The document is a presentation about using Hadoop for analytic workloads. It discusses how Hadoop has traditionally been used for batch processing but can now also be used for interactive queries and business intelligence workloads using tools like Impala, Parquet, and HDFS. It summarizes performance tests showing Impala can outperform MapReduce for queries and scales linearly with additional nodes. The presentation argues Hadoop provides an effective solution for certain data warehouse workloads while maintaining flexibility, ease of scaling, and cost effectiveness.
1. Copyright Š 2013 Cloudera Inc. All rights reserved.
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop Beyond Batch: â¨
Real-time Workloads, SQL-on-
Hadoop, and the Virtual EDW
Marcel Kornacker | marcel@cloudera.com
April 2014
2. Copyright Š 2013 Cloudera Inc. All rights reserved.
Analytic Workloads on Hadoop: Where Do
We Stand?
!2
âDeWitt Clauseâ prohibits
using DBMS vendor name
3. Copyright Š 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
â˘Hadoop has traditional been utilized for offline batch processing:
ETL and ELT
â˘Next step: Hadoop for traditional business intelligence (BI)/data
warehouse (EDW) workloads:
â˘interactive
â˘concurrent users
â˘Topic of this talk: a Hadoop-based open-source stack for EDW
workloads:
â˘HDFS: a high-performance storage system
â˘Parquet: a state-of-the-art columnar storage format
â˘Impala: a modern, open-source SQL engine for Hadoop
!3
4. Copyright Š 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
â˘Thesis of this talk:
â˘techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
â˘Hadoop stack is effective solution for certain EDW workloads
â˘Hadoop-based EDW solution maintains Hadoopâs strengths:
flexibility, ease of scaling, cost effectiveness
!4
5. Copyright Š 2013 Cloudera Inc. All rights reserved.
HDFS: A Storage System for Analytic
Workloads
â˘Available in Hdfs today:
â˘high-efficiency data scans at or near hardware speed, both
from disk and memory
â˘On the immediate roadmap:
â˘co-partitioned tables for even faster distributed joins
â˘temp-FS: write temp table data straight to memory,
bypassing diskâ¨
!5
6. Copyright Š 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
â˘High efficiency data transfers
â˘short-circuit reads: bypass DataNode protocol when reading
from local diskâ¨
-> read at 100+MB/s per disk
â˘HDFS caching: access explicitly cached data w/o copy or
checksummingâ¨
-> access memory-resident data at memory bus speedâ¨
-> enable in-memory processing
!6
7. Copyright Š 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
â˘Coming attractions:
â˘affinity groups: collocate blocks from different filesâ¨
-> create co-partitioned tables for improved join
performance
â˘temp-fs: write temp table data straight to memory,
bypassing diskâ¨
-> ideal for iterative interactive data analysis
!7
8. Copyright Š 2013 Cloudera Inc. All rights reserved.
Parquet: Columnar Storage for Hadoop
â˘What it is:
â˘state-of-the-art, open-source columnar file format thatâs
available for (most) Hadoop processing frameworks:â¨
Impala, Hive, Pig, MapReduce, Cascading, âŚ
â˘offers both high compression and high scan efficiency
â˘co-developed by Twitter and Cloudera; hosted on github and
soon to be an Apache incubator project
â˘with contributors from Criteo, Stripe, Berkeley AMPlab,
LinkedIn
â˘used in production at Twitter and Criteo
!8
9. Copyright Š 2013 Cloudera Inc. All rights reserved.
Parquet: The Details
â˘columnar storage: column-major instead of the traditional
row-major layout; used by all high-end analytic DBMSs
â˘optimized storage of nested data structures: patterned
after Dremelâs ColumnIO format
â˘extensible set of column encodings:
â˘run-length and dictionary encodings in current version (1.2)
â˘delta and optimized string encodings in 2.0
â˘embedded statistics: version 2.0 stores inlined column
statistics for further optimization of scan efficiency
!9
10. Copyright Š 2013 Cloudera Inc. All rights reserved.
Parquet: Storage Efficiency
!10
11. Copyright Š 2013 Cloudera Inc. All rights reserved.
Parquet: Scan Efficiency
!11
12. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
â˘implementation of an MPP SQL query engine for the Hadoop
environment
â˘highest-performance SQL engine for the Hadoop ecosystem;â¨
already outperforms some of its commercial competitors
â˘effective for EDW-style workloads
â˘maintains Hadoop flexibility by utilizing standard Hadoop
components (HDFS, Hbase, Metastore, Yarn)
â˘plays well with traditional BI tools:â¨
exposes/interacts with industry-standard interfaces (odbc/
jdbc, Kerberos and LDAP, ANSI SQL)
!12
13. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
â˘history:
â˘developed by Cloudera and fully open-source; hosted on
github
â˘released as beta in 10/2012
â˘1.0 version available in 05/2013
â˘current version is 1.2.3, available for CDH4 and CDH5 beta
!13
14. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala from The Userâs Perspective
â˘create tables as virtual views over data stored in HDFS
or Hbase;â¨
schema metadata is stored in Metastore (shared with
Hive, Pig, etc.; basis of HCatalog)
â˘connect via odbc/jdbc; authenticate via Kerberos or
LDAP
â˘run standard SQL:
â˘current version: ANSI SQL-92 (limited to SELECT and bulk
insert) minus correlated subqueries, has UDFs and UDAs
!14
15. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala from The Userâs Perspective
â˘2014 roadmap:
â˘1.3: admission control
â˘1.4: Decimal(<precision>, <scale>)
â˘2.0 or earlier: analytic window functions, Order By without
Limit, support for nested types (structs, arrays, maps),
UDTFs, disk-based joins and aggregation
!15
16. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Architecture
â˘distributed service:
â˘daemon process (impalad) runs on every node with data
â˘easily deployed with Cloudera Manager
â˘each node can handle user requests; load balancer
configuration for multi-user environments recommended
â˘query execution phases:
â˘client request arrives via odbc/jdbc
â˘planner turns request into collection of plan fragments
â˘coordinator initiates execution on remote impalaâs
!16
17. Copyright Š 2013 Cloudera Inc. All rights reserved.
⢠Request arrives via odbc/jdbc
Impala Query Execution
!17
18. Copyright Š 2013 Cloudera Inc. All rights reserved.
⢠Planner turns request into collection of plan fragments
⢠Coordinator initiates execution on remote impalad nodes
Impala Query Execution
!18
19. Copyright Š 2013 Cloudera Inc. All rights reserved.
⢠Intermediate results are streamed between impalaâs
⢠Query results are streamed back to client
Impala Query Execution
!19
20. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Planning
â˘2-phase process:
â˘single-node plan: left-deep tree of query operators
â˘partitioning into plan fragments for distributed parallel
execution:â¨
maximize scan locality/minimize data movement, parallelize
all query operators
â˘cost-based join order optimization
â˘cost-based join distribution optimization
!20
21. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
â˘execution engine designed for efficiency, written from scratch
in C++; no reuse of decades-old open-source code
â˘circumvents MapReduce completely
â˘in-memory execution:
â˘aggregation results and right-hand side inputs of joins are
cached in memory
â˘example: join with 1TB table, reference 2 of 200 cols, 10% of
rows â¨
-> need to cache 1GB across all nodes in clusterâ¨
-> not a limitation for most workloads
!21
22. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
â˘runtime code generation:
â˘uses llvm to jit-compile the runtime-intensive parts of a
query
â˘effect the same as custom-coding a query:
â˘remove branches
â˘propagate constants, offsets, pointers, etc.
â˘inline function calls
â˘optimized execution for modern CPUs (instruction pipelines)
!22
23. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
!23
24. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
â˘Impala vs. SQL-on-MR
â˘Impala 1.1.1/Hive 0.12 (âStinger Phases 1 and 2â)
â˘file formats: Parquet/ORCfile
â˘TPC-DS, 3TB data set running on 5-node cluster
!24
25. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
⢠Impala speedup:
⢠interactive: 8-69x
⢠report: 6-68x
⢠deep analytics:
10-58x
!25
26. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
â˘Impala 1.2.3/Presto 0.6/Shark
â˘file formats: RCfile (+ Parquet)
â˘TPC-DS, 15TB data set running on 21-node cluster
!26
27. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!27
28. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!28
⢠Multi-user benchmark:
⢠10 users concurrently
⢠same dataset, same
hardware
⢠workload: queries from
âinteractiveâ group
29. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!29
30. Copyright Š 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
â˘Hadoopâs promise of linear scalability: add more
nodes to cluster, gain a proportional increase in
capabilitiesâ¨
-> adapt to any kind of workload changes simply by
adding more nodes to cluster
â˘scaling dimensions for EDW workloads:
â˘response time
â˘concurrency/query throughput
â˘data size
!30
31. Copyright Š 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
â˘Scalability results for Impala:
â˘tests show linear scaling along all 3 dimensions
â˘setup:
â˘2 clusters: 18 and 36 nodes
â˘15TB TPC-DS data set
â˘6 âinteractiveâ TPC-DS queries
!31
32. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Latency
!32
33. Copyright Š 2013 Cloudera Inc. All rights reserved.
⢠Comparison: 10 vs 20 concurrent users
Impala Scalability: Concurrency
!33
34. Copyright Š 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Data Size
⢠Comparison: 15TB vs. 30TB data set
!34
35. Copyright Š 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
â˘Thesis of this talk:
â˘techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
â˘Impala/Parquet/Hdfs is effective solution for certain EDW
workloads
â˘Hadoop-based EDW solution maintains Hadoopâs strengths:
flexibility, ease of scaling, cost effectiveness
!35
37. Copyright Š 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
â˘what the future holds:
â˘further performance gains
â˘more complete SQL capabilities
â˘improved resource mgmt and ability to handle multiple
concurrent workloads in a single cluster
!37