Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
5. Data is not what it used to be
5
DataGrowth
STRUCTURED DATA – 20%
1980 2012
UNSTRUCTUREDDATA–80%
6. Hadoop was Invented to Solve:
• Large volumes of data
• Data that is only valuable in bulk
• High ingestion rates
• Data that requires more processing
• Differently structured data
• Evolving data
• High license costs
6
7. What is Apache Hadoop?
7
Has the Flexibility to Store and
Mine Any Type of Data
Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
Not bound by a single schema
Excels at
Processing Complex Data
Scale-out architecture divides workloads
across multiple nodes
Flexible file system eliminates ETL
bottlenecks
Scales
Economically
Can be deployed on commodity
hardware
Open source platform guards against
vendor lock
Hadoop Distributed
File System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and processing
that is…
Distributed
Fault tolerant
Scalable
CORE HADOOP SYSTEM COMPONENTS
10. Hive & Pig
• Hive – Turn SQL into MapReduce
• Pig – Turn execution plans into MapReduce
• Makes MapReduce easier
• But not any faster
10
11. Towards a Better Map Reduce
• Spark – Next generation MapReduce
With in-memory caching
Lazy Evaluation
Fast recovery times from node failures
• Tez – Next generation MapReduce.
Reduced overhead, more flexibility.
Currently Alpha
11
14. Impala Overview
14
Interactive SQL for Hadoop
Responses in seconds
Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
Purpose-built for low-latency queries
Separate runtime from MapReduce
Designed as part of the Hadoop ecosystem
Open Source
Apache-licensed
15. Impala Overview
Runs directly within Hadoop
reads widely used Hadoop file formats
talks to widely used Hadoop storage managers
runs on same nodes that run Hadoop processes
High performance
C++ instead of Java
runtime code generation
completely new execution engine – No MapReduce
16. Beta version released since October 2012
General availability (v1.0) release out since April 2013
Latest release (v1.2.3) released on December 23rd
Impala is Production Ready
17. User View of Impala: Overview
• Distributed service in cluster:
one Impala daemon on each node with data
• Highly available: no single point of failure
• Submit query to any daemon:
• ODBC/JDBC
• Impala CLI
• Hue
• Query is distributed to all nodes with relevant data
• Impala uses Hive’s metadata
18. User View of Impala: File Formats
• There is no ‘Impala format’.
• Impala supports:
• Uncompressed/lzo-compressed text files
• Sequence files and RCFile with snappy/gzip
compression
• Avro data files
• Parquet columnar format (more on that later)
• HBase
19. User View of Impala: SQL Support
• Most of SQL-92
• INSERT INTO … SELECT …
• Only equi-joins; no non-equi joins, no cross products
• Order By requires Limit (for now)
• DDL support
• SQL-style authorization via Apache Sentry (incubating)
• UDFs and UDAFs are supported
21. Impala Use Cases
21
Interactive BI/analytics on more data
Asking new questions – exploration, ML
Data processing with tight SLAs
Query-able archive w/full fidelity
Cost-effective, ad hoc query environment that
offloads the data warehouse for:
22. Global Financial Services Company
22
Saved 90% on incremental EDW spend &
improved performance by 5x
Offload data warehouse for query-able archive
Store decades of data cost-effectively
Process & analyze on the same system
Improved capabilities through interactive query
on more data
23. Digital Media Company
24
20x performance improvement for exploration
& data discovery
Easily identify new data sets for modeling
Interact with raw data directly to test
hypotheses
Avoid expensive DW schema changes
Accelerate ‘time to answer’
25. Impala Architecture
• Impala daemon (impalad) – N instances
• Query execution
• State store daemon (statestored) – 1 instance
• Provides name service and metadata distribution
• Catalog daemon (catalogd) – 1 instance
• Relays metadata changes to all impalad’s
27. Impala Query Execution
28
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
28. Impala Query Execution
29
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query results
29. Query Planner
2-phase planning
Left deep tree
Partition plan to maximize data locality
Join order
Before 1.2.3: Order of tables in query.
1.2.3 and above: Cost based if statistics exist
Plan Operators
Scan, HashJoin, HashAggregation, Union, TopN, Exchange
All operators are fully distributed
30
31. Simple Example
SELECT state, SUM(revenue)
FROM HdfsTbl h
JOIN HbaseTbl b ON (id)
GROUP BY state
ORDER BY 2 desc LIMIT 10
32. How does a database execute a query?
• Left Deep Tree
• Data flows from bottom
to top
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
33. Wait – Why is this a left-deep tree?
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
Agg
HashJoin
Scan: t0
34. How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data.
• So, the RHS table (Hbase
scan) is scanned first.
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Scan
Hbase
first
35. How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
36. How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
37. How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
38. How does a database execute a query?
• Start scanning LHS (Hdfs)
table
• For each row from LHS,
probe the hash table for
matching rows
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Probe hash
table and a
matching
row is
found.
39. How does a database execute a query?
• Matched rows are
bubbled up the
execution tree TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
40. How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from
LHS, probe the hash
table for matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
No
matching
row
41. How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from
LHS, probe the hash
table for matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
42. How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Probe hash
table and a
matching
row is
found.
43. How does a database execute a query?
• Matched rows are
bubbled up the
execution tree TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
44. How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
No
matching
row
45. How does a database execute a query?
• All rows have been
returned from the hash
join node. Agg node can
start returning rows
• Rows are bubbled up the
execution tree
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
46. How does a database execute a query?
• Rows from the
aggregation node
bubbles up to the top-n
node
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
47. How does a database execute a query?
• Rows from the
aggregation node
bubbles up to the top-n
node
• When all rows are
returned by the agg
node, top-n node can
restart return rows to
the end-user
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
48. Key takeaways
Data flows from bottom to top in the execution tree
and finally goes to the end user
Larger tables go on the left
Collect statistics
Filter early
49
50. How does an MPP database execute a query?
Tbl b
Scan
Hash
Join
Tbl a
Scan
Exch
Agg
Exch
Agg
Agg
Hash
Join
Tbl a
Scan
Tbl b
Scan
Broadcast
Re-distribute by
“state”
51. How does a MPP database execute a query
A join B
A join B
A join B
Local
Agg
Local
Agg
Local
Agg
Scan and
Broadcast
Tbl B
Final
Agg
Final
Agg
Final
Agg
Re-distribute by
“state”
Local read
Tbl A
57. Impala Scalability: 2x the Hardware and 2x Users/Data
(Expectation: Constant Response Times)
58
2x the Users, 2x the Hardware
2x the Data, 2x the Hardware
Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
Take away – MapReduce is going away in favor of multi-framework Hadoop. Most of the replacements are improved MR. Impala is different.
Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-65x faster than Hive; up to 100x seenNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze
Nows, we’ve finished scanning the RHS table and have finished building the hash table. We can now start scanning the LHS table to do the join.
If there’s a match, the joined row will bubble up the execution tree to the aggregation node.
This row doesn’t match. So, it won’t bubble up.
Now that all the rows have been returned from the hash join node, the aggregation node can start returning rows.
B are scanned in parallel, and broadcast to all impalad. Each Impalad reads its local data block for A and do the join. This is broadcast join. After the join is done, we do the aggregation. But before we can produce the final result, we need to redistribute the result of “local agg” according to the group by expression “state” and do the final aggregate.
We added a redundant condition in the WHERE clause to the query that doesn't change the query semantics or results returned. This is transparently mentioned in both our public blog post and published queries as the "explicit partition filter/predicate." Like window functions, this is done as a workaround for a feature limitation in both Impala and Hive to match what a user should to do optimize for these systems. Please also note that this change was done for all compared systems (Impala, Hive, and "DBMS-Y") to ensure an apples-to-apples comparison.