3. Agenda
• Part 1:
• Low-latency puzzle
piece of the
and Hadoop: a missing
• Impala: Goals, non-goals and features
• Demo
• Q+A
Wednesday, 16 January 2013
4. Agenda
• Part 1:
• Low-latency puzzle
piece of the
and Hadoop: a missing
• Impala: Goals, non-goals and features
• Demo
• Q+A
• Part 2:
• Impala Internals
• Comparing Impala to other systems
• Q+A
Wednesday, 16 January 2013
7. About Me
• Hi!
• Software Engineer at Cloudera since 2009
• Apache ZooKeeper
• First version of Flume
• Cloudera Enterprise
• Working on Impala since the beginning
of 2012
3
Wednesday, 16 January 2013
10. The Hadoop Landscape
• Hadoop MapReduce is a batch processing
system
5
Wednesday, 16 January 2013
11. The Hadoop Landscape
• Hadoop MapReduce is a batch processing
system
• Ideally suited to workloads high-latency
data processing
long-running,
5
Wednesday, 16 January 2013
12. The Hadoop Landscape
• Hadoop MapReduce is a batch processing
system
• Ideally suited to workloads high-latency
data processing
long-running,
• But not as suitable for interactive queries,
data exploration or iterative query
refinement
• All of which are keystones of data
warehousing
5
Wednesday, 16 January 2013
14. Bringing Low-Latency to
Hadoop
• HDFS and HBase make data storage cheap
and flexible
6
Wednesday, 16 January 2013
15. Bringing Low-Latency to
Hadoop
• HDFS and HBase make data storage cheap
and flexible
• SQL / ODBC are industry-standards
• Analyst familiarity
• BI tool integration
• Legacy systems
6
Wednesday, 16 January 2013
16. Bringing Low-Latency to
Hadoop
• HDFS and HBase make data storage cheap
and flexible
• SQL / ODBC are industry-standards
• Analyst familiarity
• BI tool integration
• Legacy systems
• Can we get the advantages of both?
• With acceptable performance?
6
Wednesday, 16 January 2013
18. Impala Overview: Goals
• General-purpose SQL query engine
• should work both for analytical and transactional workloads
• will support queries that take from milliseconds to hours
7
Wednesday, 16 January 2013
19. Impala Overview: Goals
• General-purpose SQL query engine
• should work both for analytical and transactional workloads
• will support queries that take from milliseconds to hours
• Runs directly within Hadoop:
• Reads widely-used Hadoop file formats
• talks to widely used Hadoop storage managers like HDFS and HBase
• runs on same nodes that run Hadoop processes
7
Wednesday, 16 January 2013
20. Impala Overview: Goals
• General-purpose SQL query engine
• should work both for analytical and transactional workloads
• will support queries that take from milliseconds to hours
• Runs directly within Hadoop:
• Reads widely-used Hadoop file formats
• talks to widely used Hadoop storage managers like HDFS and HBase
• runs on same nodes that run Hadoop processes
• High performance
• C++ instead of Java
• runtime code generation via LLVM
• completely new execution engine that doesn’t build on MapReduce
7
Wednesday, 16 January 2013
21. User View of Impala
8
Wednesday, 16 January 2013
22. User View of Impala
• Runs as a distributed service in cluster: one
Impala daemon on each node with data
8
Wednesday, 16 January 2013
23. User View of Impala
• Runs as a distributed service in cluster: one
Impala daemon on each node with data
• User submits query via ODBC/Beeswax Thrift
API to any daemon
8
Wednesday, 16 January 2013
24. User View of Impala
• Runs as a distributed service in cluster: one
Impala daemon on each node with data
• User submits query via ODBC/Beeswax Thrift
API to any daemon
• Query is distributed to all nodes with relevant
data
8
Wednesday, 16 January 2013
25. User View of Impala
• Runs as a distributed service in cluster: one
Impala daemon on each node with data
• User submits query via ODBC/Beeswax Thrift
API to any daemon
• Query is distributed to all nodes with relevant
data
• If any node fails, the query fails
8
Wednesday, 16 January 2013
26. User View of Impala
• Runs as a distributed service in cluster: one
Impala daemon on each node with data
• User submits query via ODBC/Beeswax Thrift
API to any daemon
• Query is distributed to all nodes with relevant
data
• If any node fails, the query fails
• Impala uses Hive’s metadata interface
8
Wednesday, 16 January 2013
27. User View of Impala
• Runs as a distributed service in cluster: one
Impala daemon on each node with data
• User submits query via ODBC/Beeswax Thrift
API to any daemon
• Query is distributed to all nodes with relevant
data
• If any node fails, the query fails
• Impala uses Hive’s metadata interface
• Supported file formats:
• text files (GA: with compression, including lzo)
• sequence files with snappy / gzip compression
• GA: Avro data files / columnar format (more on this later)
8
Wednesday, 16 January 2013
28. User View of Impala: SQL
9
Wednesday, 16 January 2013
29. User View of Impala: SQL
• SQL support:
• patterned after Hive’s version of SQL
• limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert
• only equi-joins, no non-equi-joins, no cross products
• ORDER BY only with LIMIT
• GA: DDL support (CREATE, ALTER)
9
Wednesday, 16 January 2013
30. User View of Impala: SQL
• SQL support:
• patterned after Hive’s version of SQL
• limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert
• only equi-joins, no non-equi-joins, no cross products
• ORDER BY only with LIMIT
• GA: DDL support (CREATE, ALTER)
• Functional Limitations
• no custom UDFs, file formats, Hive SerDes
• only hash memory of alltable has to fit in(GA) of a single node (beta) /
aggregate
joins: joined
executing nodes
memory
• join order = FROM clause order
9
Wednesday, 16 January 2013
31. User View of Impala: HBase
10
Wednesday, 16 January 2013
32. User View of Impala: HBase
• HBase functionality
• uses Hive’s mapping of HBase table into metastore table
• predicates on rowkey columns are mapped into start / stop row
• predicates on other columns are mapped into SingleColumnValueFilters
10
Wednesday, 16 January 2013
33. User View of Impala: HBase
• HBase functionality
• uses Hive’s mapping of HBase table into metastore table
• predicates on rowkey columns are mapped into start / stop row
• predicates on other columns are mapped into SingleColumnValueFilters
• HBase functional limitations
• no nested-loop joins
• all data stored as text
10
Wednesday, 16 January 2013
36. TPC-DS
• TPC-DS isdecision supportdataset designed
to model
a benchmark
systems
12
Wednesday, 16 January 2013
37. TPC-DS
• TPC-DS isdecision supportdataset designed
to model
a benchmark
systems
• We generatedillustrative!) (not a lot, but
enough to be
500MB data
12
Wednesday, 16 January 2013
38. TPC-DS
• TPC-DS isdecision supportdataset designed
to model
a benchmark
systems
• We generatedillustrative!) (not a lot, but
enough to be
500MB data
• Let’sagainstsample query against Hive 0.9,
and
run a
Impala 0.3
12
Wednesday, 16 January 2013
39. TPC-DS
• TPC-DS isdecision supportdataset designed
to model
a benchmark
systems
• We generatedillustrative!) (not a lot, but
enough to be
500MB data
• Let’sagainstsample query against Hive 0.9,
and
run a
Impala 0.3
• Single node (VM! -engine speeds so we’re
testing execution
caveat emptor),
12
Wednesday, 16 January 2013
40. TPC-DS Sample Query
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'M' and
cd_marital_status = 'S' and
cd_education_status = 'College' and
d_year = 2002 and
s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
13
Wednesday, 16 January 2013
42. Impala is much faster
• Why?
• No materialisation of intermediate data
- less I/O
• No multi-phase queries - much smaller
startup / teardown overhead
• Fasterfor each individual query
code
execution engine: generates fast
14
Wednesday, 16 January 2013
43. Part 2:
Impala Internals /
Roadmap
Wednesday, 16 January 2013
45. Impala Architecture
• Two binaries: impalad and statestored
16
Wednesday, 16 January 2013
46. Impala Architecture
• Two binaries: impalad and statestored
• Impala daemon (impalad)
• handles client requests andexecution
requests related to query
all internal
over Thrift
• runs on every datanode
16
Wednesday, 16 January 2013
47. Impala Architecture
• Two binaries: impalad and statestored
• Impala daemon (impalad)
• handles client requests andexecution
requests related to query
all internal
over Thrift
• runs on every datanode
• Statestore daemon (statestored)
• provides membership information and
metadata distribution
• only one per cluster
16
Wednesday, 16 January 2013
49. Query Execution
• Query execution phases:
• Request arrives via Thrift API (perhaps
from ODBC, or shell)
• Plannerfragments into collections
of plan
turns request
• ‘Coordinator’ initiates execution on
remote impalad daemons
17
Wednesday, 16 January 2013
50. Query Execution
• Query execution phases:
• Request arrives via Thrift API (perhaps
from ODBC, or shell)
• Plannerfragments into collections
of plan
turns request
• ‘Coordinator’ initiates execution on
remote impalad daemons
• During execution:
• Intermediate results are streamed
between impalad daemons
• Query results are streamed to client
17
Wednesday, 16 January 2013
55. The Planner
• Two-phase planning process:
• single-node plan: left-deep tree of plan operators
• plan partitioning: partition single-node plan to maximise scan locality,
minimise data movement
21
Wednesday, 16 January 2013
56. The Planner
• Two-phase planning process:
• single-node plan: left-deep tree of plan operators
• plan partitioning: partition single-node plan to maximise scan locality,
minimise data movement
• Plan operators: Scan, HashJoin, Exchange
HashAggregation, Union, TopN,
21
Wednesday, 16 January 2013
57. The Planner
• Two-phase planning process:
• single-node plan: left-deep tree of plan operators
• plan partitioning: partition single-node plan to maximise scan locality,
minimise data movement
• Plan operators: Scan, HashJoin, Exchange
HashAggregation, Union, TopN,
• Distributed aggregation:aggregation at root
individual nodes, merge
pre-aggregate in
21
Wednesday, 16 January 2013
58. The Planner
• Two-phase planning process:
• single-node plan: left-deep tree of plan operators
• plan partitioning: partition single-node plan to maximise scan locality,
minimise data movement
• Plan operators: Scan, HashJoin, Exchange
HashAggregation, Union, TopN,
• Distributed aggregation:aggregation at root
individual nodes, merge
pre-aggregate in
• GA: rudimentary cost-based optimiser
21
Wednesday, 16 January 2013
59. Plan Partitioning
• Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN HbaseTbl b ON (...)
GROUP BY 1 ORDER BY 2 desc LIMIT 10
TopN
Agg
TopN
Agg Hash
Agg Join
Hash
Join
Hdfs Hbase
Exch Exch
Scan Scan
Hdfs Hbase at coordinator at DataNodes at region servers
Scan Scan
22
Wednesday, 16 January 2013
63. Execution Engine
• Heavy-lifting component of each Impalad
• Written in C++
• runtime code generation for “big-
loops”
• Internal in-memoryfixed offsets puts
fixed-width data at
tuple format
• Hand-optimised assembly where
needed
24
Wednesday, 16 January 2013
64. More on Code Generation
25
Wednesday, 16 January 2013
65. More on Code Generation
• For example: Inserting tuples into a hash-
table
• We know ahead of time the maximum
number of tuples (in a batch), the tuple
layout, what fields might be null and so
on.
• Pre-bake loop that avoids branches and
unrolled
all this information into an
dead code
• Function calls are inlined at compile-
time
25
Wednesday, 16 January 2013
66. More on Code Generation
• For example: Inserting tuples into a hash-
table
• We know ahead of time the maximum
number of tuples (in a batch), the tuple
layout, what fields might be null and so
on.
• Pre-bake loop that avoids branches and
unrolled
all this information into an
dead code
• Function calls are inlined at compile-
time
• Result: significant speedup in real queries
25
Wednesday, 16 January 2013
68. Statestore
• Central system state repository
• Membership / failure-detection
• GA: metadata
• GA: diagnostics, scheduling information
26
Wednesday, 16 January 2013
69. Statestore
• Central system state repository
• Membership / failure-detection
• GA: metadata
• GA: diagnostics, scheduling information
• Soft-state
• All data can be reconstructed from the rest of
the system
• Impala continues to run when statestore fails,
but per-node state becomes increasingly stale
26
Wednesday, 16 January 2013
70. Statestore
• Central system state repository
• Membership / failure-detection
• GA: metadata
• GA: diagnostics, scheduling information
• Soft-state
• All data can be reconstructed from the rest of
the system
• Impala continues to run when statestore fails,
but per-node state becomes increasingly stale
• Sends periodic heartbeats
• Pushes new data
• Checks for liveness
26
Wednesday, 16 January 2013
72. Why not ZooKeeper?
• Apache ZooKeeper is not a good publish-
subscribe system
• API is awkward, and requires a lot of client logic
• Multiple round-trips required to get data for changes to node’s children
• Push model is more natural for our use case
27
Wednesday, 16 January 2013
73. Why not ZooKeeper?
• Apache ZooKeeper is not a good publish-
subscribe system
• API is awkward, and requires a lot of client logic
• Multiple round-trips required to get data for changes to node’s children
• Push model is more natural for our use case
• Don’t need all the guarantees ZK provides
• Serializability
• Persistence
• Avoid complexity where possible!
27
Wednesday, 16 January 2013
74. Why not ZooKeeper?
• Apache ZooKeeper is not a good publish-
subscribe system
• API is awkward, and requires a lot of client logic
• Multiple round-trips required to get data for changes to node’s children
• Push model is more natural for our use case
• Don’t need all the guarantees ZK provides
• Serializability
• Persistence
• Avoid complexity where possible!
• ZK is bad at the things we care about, and
good at the things we don’t
27
Wednesday, 16 January 2013
76. Comparing Impala to Dremel
• Google’s Dremel
• Columnar storage for data with nested
structures
• Distributed scalable aggregation on top
of that
28
Wednesday, 16 January 2013
77. Comparing Impala to Dremel
• Google’s Dremel
• Columnar storage for data with nested
structures
• Distributed scalable aggregation on top
of that
• Columnar storage coming to Hadoop via
joint project between Cloudera and Twitter
28
Wednesday, 16 January 2013
78. Comparing Impala to Dremel
• Google’s Dremel
• Columnar storage for data with nested
structures
• Distributed scalable aggregation on top
of that
• Columnar storage coming to Hadoop via
joint project between Cloudera and Twitter
• Impala plus columnar format: a superset had
the published version of Dremel (which
of
no joins)
28
Wednesday, 16 January 2013
80. Comparing Impala to Hive
• Hive: MapReduce as an execution engine
• High latency, low throughput queries
• Fault-tolerance based on MapReduce’s on-
disk checkpointing: materialises all
intermediate results
• Java runtime allows for extensibility: file
formats and UDFs
29
Wednesday, 16 January 2013
81. Comparing Impala to Hive
• Hive: MapReduce as an execution engine
• High latency, low throughput queries
• Fault-tolerance based on MapReduce’s on-
disk checkpointing: materialises all
intermediate results
• Java runtime allows for extensibility: file
formats and UDFs
• Impala:
• Direct, process-to-process data exchange
• No fault tolerance
• Designed for low runtime overhead
• Not nearly as extensible
29
Wednesday, 16 January 2013
83. Impala and Hive: Performance
• No published process: yet, but from the
development
benchmarks
• Impala workloads fasterthroughput, I/O-
bound
can get full disk
by 3-4x.
• Multiple phase Hive queries see larger
speedup in Impala
• Queries against in-memory data can be
up to 100x faster
30
Wednesday, 16 January 2013
85. Impala Roadmap to GA
• GA planned for second-quarter 2013
31
Wednesday, 16 January 2013
86. Impala Roadmap to GA
• GA planned for second-quarter 2013
• New data formats
• LZO-compressed text
• Avro
• Columnar format
31
Wednesday, 16 January 2013
87. Impala Roadmap to GA
• GA planned for second-quarter 2013
• New data formats
• LZO-compressed text
• Avro
• Columnar format
• Better metadata handling through statestore
31
Wednesday, 16 January 2013
88. Impala Roadmap to GA
• GA planned for second-quarter 2013
• New data formats
• LZO-compressed text
• Avro
• Columnar format
• Better metadata handling through statestore
• JDBC support
31
Wednesday, 16 January 2013
89. Impala Roadmap to GA
• GA planned for second-quarter 2013
• New data formats
• LZO-compressed text
• Avro
• Columnar format
• Better metadata handling through statestore
• JDBC support
• Improved query execution, e.g. partitioned joins
31
Wednesday, 16 January 2013
90. Impala Roadmap to GA
• GA planned for second-quarter 2013
• New data formats
• LZO-compressed text
• Avro
• Columnar format
• Better metadata handling through statestore
• JDBC support
• Improved query execution, e.g. partitioned joins
• Production deployment guidelines
• Load-balancing across Impalad daemons
• Resource isolation within Hadoop cluster
31
Wednesday, 16 January 2013
91. Impala Roadmap to GA
• GA planned for second-quarter 2013
• New data formats
• LZO-compressed text
• Avro
• Columnar format
• Better metadata handling through statestore
• JDBC support
• Improved query execution, e.g. partitioned joins
• Production deployment guidelines
• Load-balancing across Impalad daemons
• Resource isolation within Hadoop cluster
• More packages: RHEL 5.7, Ubuntu, Debian
31
Wednesday, 16 January 2013
94. Impala Roadmap: Beyond GA
• Coming in 2013
• Improved HBase support
• Composite keys, Avro data in columns
• Indexed nested-loop joins
• INSERT / UPDATE / DELETE
32
Wednesday, 16 January 2013
95. Impala Roadmap: Beyond GA
• Coming in 2013
• Improved HBase support
• Composite keys, Avro data in columns
• Indexed nested-loop joins
• INSERT / UPDATE / DELETE
• Additional SQL
• UDFs
• SQL authorisation and DDL
• ORDER BY without LIMIT
• Window functions
• Support for structured data types
32
Wednesday, 16 January 2013
96. Impala Roadmap: Beyond GA
• Coming in 2013
• Improved HBase support
• Composite keys, Avro data in columns
• Indexed nested-loop joins
• INSERT / UPDATE / DELETE
• Additional SQL
• UDFs
• SQL authorisation and DDL
• ORDER BY without LIMIT
• Window functions
• Support for structured data types
• Runtime optimisations
• Straggler handling
• Join order optimisation
• Improved cache management
• Data co-location for improved join performance
32
Wednesday, 16 January 2013
98. Impala Roadmap: 2013
• Resource management
• Cluster-wide quotas
• “User X canqueries have more than 5
concurrent
never
running”
• Goal: run exploratory and production
workloads in same cluster without
affecting production jobs
33
Wednesday, 16 January 2013
100. Try it out!
• Beta version available since October 2012
34
Wednesday, 16 January 2013
101. Try it out!
• Beta version available since October 2012
• Get started at www.cloudera.com/impala
34
Wednesday, 16 January 2013
102. Try it out!
• Beta version available since October 2012
• Get started at www.cloudera.com/impala
• Questions / comments?
• impala-user@cloudera.org
• henry@cloudera.com
34
Wednesday, 16 January 2013