This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Introducing the data science sandbox as a service 8.30.18
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights
1.
2. Data Science on Hadoop:
How Cloudera Impala Unlocks New
Productivity and Insights
Justin Erickson | Product Manager
Marcel Kornacker | Software Engineer
Ravikumar Visweswara | Software Engineer
October 2012
3. Why Data Scientists Love Hadoop
• Massive volumes of data
• Data preparation & analytics in 1 environment
• Highly flexible environment for creating & testing machine learning models
• 10% the cost/TB under management
4. Hadoop Use Cases Moving to Real-Time
Already query Already load data into Already use HBase for
Hadoop using Hive CDH every 90 mins or less real-time data access
Source: Cloudera customer survey August 2012
5. But Hadoop Isn’t Fast Enough
Need faster Move data from See value today in
queries on Hadoop to RDBMS for consolidating to a
Hadoop data interactive SQL single platform
Source: Cloudera customer survey August 2012
6. Beyond Batch – The Next Stage for Hadoop
HADOOP TODAY IS TOO SLOW
MapReduce is batch
Simple queries can take minutes / tens of minutes
CURRENT DATA MANAGEMENT IS TOO COMPLEX
Optimized for rigid schemas &
special purpose applications
Redundant data storage & processes
Very expensive systems: $20K-150K / TB
7. Cloudera Enterprise RTQ
Real-Time Query for Data Stored in Hadoop
Powered by Cloudera Impala.
Supports Hive SQL
4-30X faster than Hive over MapReduce
Supports multiple storage engines &
file formats
Uses existing drivers, integrates with existing
metastore, works with leading BI tools
Flexible, cost-effective, no lock-in
Deploy & operate with Cloudera Manager
8. Cloudera Now Powered by Impala
BEFORE IMPALA WITH IMPALA
USER INTERFACE
BATCH PROCESSING REAL-TIME ACCESS
• Unified Storage: • With Impala:
Supports HDFS and HBase Real-time SQL queries
Flexible file formats Native distributed query engine
• Unified Metastore Optimized for low-latency
• Unified Security • Provides:
• Unified Client Interfaces: Answers as fast as you can ask
ODBC, SQL syntax, Hue Beeswax Everyone to ask questions for all data
Big data storage and analytics together
9. Cloudera Impala Details
Common Hive SQL and interface Unified metadata and scheduler
SQL App Hive State
Metastore YARN HDFS NN Store
ODBC
Query Planner Query Planner Fully MPP Query Planner
Query Coordinator Query Coordinator Distributed Query Coordinator
Query Exec Engine Query Exec Engine Query Exec Engine
HDFS DN HBase HDFS DN HBase HDFS DN HBase
Local Direct Reads
15. Advantages of Our Approach
• No high-latency MapReduce batch processing
• Local processing avoids network bottlenecks
• No costly data format conversion overhead
• All data immediately query-able
• Single machine pool to scale
• All machines available to both Impala and MapReduce
• Single, open, and unified metadata and scheduler
MapReduce Remote Query Side Storage
Query Query Query Query
Node Node Node Node Query MR
Hive Engine
MR OR MR DN
NN
DN HDFS
DN DN DN
17. Benefits of Cloudera Impala
Real-Time Query for Data Stored in Hadoop
• Get answers as fast as you can ask questions
• Interactive analytics directly on source data
• No jumping between data silos
• Reduce duplicate storage with EDW
• Reduce data movement for interactive analysis
• Leverage existing tools and employee skills
• Ask questions of all your data
• No information loss from aggregation or
conforming to relational schemas for analysis
• Single metadata store from origination through analysis
• No need to hunt through multiple data silos
18. Cloudera powers real-time data hub
The Challenge:
• Needs to understand 2 years clickstream data for greater insight
• Legacy system cannot scale for data processing and analytics
So Expedia can optimize end user
data-driven search results and
maximize Google AdWord spend.
The Solution:
• Cloudera Enterprise – 4 Petabyes
• One single scalable platform for Big data for
archive, ETL & analytics with real-time BI
• Running Impala
18 CONFIDENTIAL - RESTRICTED
Expedia’s use case for Impala:As theworld’s leading online travel provider, Expedia’s business requires a fine-tuned website that understands what its visitors want and can deliver results to partner hotels, airlines and other travel vendors. Expedia has historically used traditional relational data warehouses to capture and analyze the clickstream data generated to, from and within its website, but saw the value in being able to capture greater volumes of historical, detailed data leveraging Hadoop. The goal: to better understand keyword conversions driving traffic to the site in order to optimize Google AdWord spend. Today, Expedia uses Hadoop to empower its full data lifecycle – data is collected from online activity, loaded into Hadoop, scored and analyzed, and that data generates scoring engines which impact the recommendations, search results and sort orders on Expedia.com. Most recently, Expedia has kicked off a project using HBase and Impala for real-time BI that will power their Market Manager, an interactive application used by merchants such as hotels so they can see how Expedia is performing vs. competitors. For example, if one hotel notices they aren’t getting many bookings through Expedia around Christmastime, they can drill into the application to find out why: is it because their prices are too high? Or are they running low on inventory for certain dates? With this solution, Expedia can glean these insights and proactively reach out to merchants with recommendations on how they might drive greater bookings. Impala will allow Expedia’s business users to access Hadoop in a more interactive, ad hoc, speed-of-thought manner. Latency will be cut in half, and Impala provides an extensible solution that will scale with the growth of the business.