SlideShare uma empresa Scribd logo
1 de 24
Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP: Sub-Second Analytical Queries in Hive
Gopal Vijayaraghavan
Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why LLAP?
• People like Hive
• Disk->Mem is getting further away
– Cloud Storage isn’t co-located
– Disks are connected to the CPU via network
• Security landscape is changing
– Cells & Columns are the new security boundary, not files
– Safely masking columns needs a process boundary
• Concurrency, Performance & Scale are at conflict
– Concurrency at 100k queries/hour
– Latencies at 2-5 seconds/query
– Petabyte scale warehouses (with terabytes of “hot” data)
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment
Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is LLAP?
• Hybrid model combining daemons and containers for
fast, concurrent execution of analytical workloads
(e.g. Hive SQL queries)
• Concurrent queries without specialized YARN queue setup
• Multi-threaded execution of vectorized operator pipelines
• Asynchronous IO and efficient in-memory caching
• Relational view of the data available thru the API
• High performance scans, execution code pushdown
• Centralized data security
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment
Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 (+ LLAP)
• Transparent to Hive users, BI tools, etc.
• Hive decides where query fragments run
(LLAP, Container, AM) based on
configuration, data size, format, etc.
• Each Query coordinated independently by
a Tez AM
• Number of concurrent queries throttled
by number of active AMs
• Hive Operators used for processing
• Tez Runtime used for data transfer
HiveServer2
Query/AM
Controller
Client(s) YARN Cluster
AM1
llapd
llapd
Container AM1
Container AM1
llapd
Container AM2
AM2
AM3
llapd
Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Industry benchmark – 10Tb scale
0
5
10
15
20
25
30
35
40
45
50
query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98
Time(seconds)
LLAP vs Hive 1.x 10TB Scale
Hive 1.x LLAP90% faster
Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Evaluation from a customer case study
0
20000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
• Aggregate daily statistics for a time interval:
SELECT yyyymmdd,
sum(total_1),
sum(total_2),
...
from table
where yyyymmdd >= xxx
and yyyymmdd < xxx
and userid = xxx
group by userid, yyyymmdd;
Max
Max
Max
Avg
Avg Avg
0
1
2
3
4
5
6
7
8
D W M Y D W M Y D W M Y
Tez Phoenix LLAP
Executiontime,s
Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Evaluation from a customer case study
• Display a large report
Execution Time in seconds over time range
Max
Max
Avg
Avg
0
5
10
15
20
D W M Y D W M Y
Tez LLAP
Executiontime,s
Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cut-awaytodemo
(GIF)
Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does LLAP make queries faster?
Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP
Queue
Technical overview – execution
• LLAP daemon has a number of executors (think
containers) that execute work "fragments"
• Fragments are parts of one, or multiple parallel
workloads (e.g. Hive SQL queries)
• Work queue with pluggable priority
• Geared towards low latency queries over long-
running queries (by default)
• I/O is similar to containers – read/write to HDFS,
shuffle, other storages and formats
• Streaming output for data API
Executor
Q1 Map 1
Executor
External read
Executor
Q3 Reducer 3
Q1 Map 1
Q1 Map 1
Q3 Map 19
HDFS
Waiting for
shuffle inputs
HBase
Container
(shuffle input)
Spark
executor
Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Executor
Technical overview – IO layer
• Optional: when executing inside LLAP
• All other formats use in-sync mode
• Asynchronous IO for Hive
• Wraps over InputFormat, reads through cache
• Supported with ORC
• Transparent, compressed in-memory cache
• Format-specific, extensible
• NVMe/NVDIMM caches
RecordReade
r
Fragment
Cache
IO thread
Plan & decode
Read, decompress
Metadata cache
Actual data (HDFS, S3, …)
What to read Data buffers
Splits Vectorized data
Indexes
Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel queries – priorities, preemption
• Lower-priority fragments can be preempted
• For example, a fragment can start running before its inputs are
ready, for better pipelining; such fragments may be preempted
• LLAP work queue examines the DAG parameters to give
preference to interactive (BI) queries
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3
Wide query
reduce waiting
Time
ClusterUtilization
Long-Running Query
Short-Running Query
Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
I/OExecution
In-memory processing – present and future
Decoder
Distributed FS
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
Low-level pushdown
(e.g. filter)*
SSD cache*
Off-heap cache Compression
codec
Native data
vectors
- work in
progress
*
Compact encoded data
Compressed data
col1
col2
Page14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
First query erformance
• Cold LLAP is nearly as fast as
shared pre-warmed containers
(impractical on real clusters)
• Realistic (long-running) LLAP
~3x faster than realistic (no
prewarm) Tez
No
prewarm,
20.94
Shared
prewarm
11.96
Cold LLAP
13.23
Realistic
LLAP
7.49
0
5
10
15
20
25
Firstqueryruntime,s
Page15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
JIT Performance – heavy use
• Cache disabled!
13.23
7.93 7.7
6.29
5.44 5.3 5.01 4.89 4.83
7.49
6.1
4.61 4.59
4.93
4.62 4.63 4.45 4.25
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7 8
Queryruntime,s
Query run
Cold LLAP
LLAP after an
unrelated workload
~2xtimesaving
For cold start, the
warmup took 3 runs
Page16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel query execution – LLAP vs Hive 1.2
3131
2416
1741
1420
1508
531
291
147
94 75
0
500
1000
1500
2000
2500
3000
3500
1 2 4 8 16
Totalruntimeforallqueries,s.
Number of concurrent users
Hive 1.2 total runtime
LLAP total runtime
Hive 1.2 linear scaling
LLAP linear scaling
Parallelism eats into latency
Page17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel query execution – 10Tb scale
531
291
147
94
75
0
100
200
300
400
500
600
1 2 4 8 16
Totalruntimeforallqueries,s.
Number of concurrent users
LLAP total runtime
LLAP linear scaling
Latency sensitive pre-emption
Page18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance – cache on HDFS, 1Tb scale
20.2
18.3
17.6
16.2 16.2
16.8
14.5
11.3
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
5
10
15
20
25
24 26 28 30 32 34 36 38 40 42
Cachehitrate
Queryruntime,s
Cache size, Gb
Query runtime, after
unrelated workload
Query runtime, after
related workload
Metadata hit rate
Data hit rate
Page19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP as a “relational” datanode
Page20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example - SparkSQL integration – execution flow
Page 20
HadoopRDD.compute()
LlapInputFormat.getRecordReader()
- Open socket for incoming data
- Send package/plan to LLAP
Check permissions
Generate splits w/LLAP locations
Return securely signed splits
Ranger+HiveServer2
Spark Executor
HadoopRDD.getPartitions()
Get Hive splits
Cluster nodes
Verify splits
Run scan + transform
Send data back
LLAP
Partitions/
Splits
Request
splits
: :
var llapContext =
LlapContext.newInstance(
sparkContext, jdbcUrl)
var df: DataFrame =
llapContext.sql("select *
from tpch_text_5.region")
DataFrame for Hive/LLAP data
Page21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Monitoring LLAP Queries
Page22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Monitoring
• LLAP exposes a UI for
monitoring
• Also has jmx endpoint with
much more data, logs and
jstack endpoints as usual
• Aggregate monitoring UI is
work in progress
Page23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Watching queries – Tez UI integration
Page24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn more

Mais conteúdo relacionado

Mais procurados

Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?DataWorks Summit/Hadoop Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the CloudDataWorks Summit
 

Mais procurados (20)

Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Semelhante a LLAP: Sub-Second Analytical Queries in Hive

LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 

Semelhante a LLAP: Sub-Second Analytical Queries in Hive (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 

Mais de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Último

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

LLAP: Sub-Second Analytical Queries in Hive

  • 1. Page1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP: Sub-Second Analytical Queries in Hive Gopal Vijayaraghavan
  • 2. Page2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why LLAP? • People like Hive • Disk->Mem is getting further away – Cloud Storage isn’t co-located – Disks are connected to the CPU via network • Security landscape is changing – Cells & Columns are the new security boundary, not files – Safely masking columns needs a process boundary • Concurrency, Performance & Scale are at conflict – Concurrency at 100k queries/hour – Latencies at 2-5 seconds/query – Petabyte scale warehouses (with terabytes of “hot” data) Node LLAP Process Cache Query Fragment HDFS Query Fragment
  • 3. Page3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is LLAP? • Hybrid model combining daemons and containers for fast, concurrent execution of analytical workloads (e.g. Hive SQL queries) • Concurrent queries without specialized YARN queue setup • Multi-threaded execution of vectorized operator pipelines • Asynchronous IO and efficient in-memory caching • Relational view of the data available thru the API • High performance scans, execution code pushdown • Centralized data security Node LLAP Process Cache Query Fragment HDFS Query Fragment
  • 4. Page4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 (+ LLAP) • Transparent to Hive users, BI tools, etc. • Hive decides where query fragments run (LLAP, Container, AM) based on configuration, data size, format, etc. • Each Query coordinated independently by a Tez AM • Number of concurrent queries throttled by number of active AMs • Hive Operators used for processing • Tez Runtime used for data transfer HiveServer2 Query/AM Controller Client(s) YARN Cluster AM1 llapd llapd Container AM1 Container AM1 llapd Container AM2 AM2 AM3 llapd
  • 5. Page5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Industry benchmark – 10Tb scale 0 5 10 15 20 25 30 35 40 45 50 query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98 Time(seconds) LLAP vs Hive 1.x 10TB Scale Hive 1.x LLAP90% faster
  • 6. Page6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Evaluation from a customer case study 0 20000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 • Aggregate daily statistics for a time interval: SELECT yyyymmdd, sum(total_1), sum(total_2), ... from table where yyyymmdd >= xxx and yyyymmdd < xxx and userid = xxx group by userid, yyyymmdd; Max Max Max Avg Avg Avg 0 1 2 3 4 5 6 7 8 D W M Y D W M Y D W M Y Tez Phoenix LLAP Executiontime,s
  • 7. Page7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Evaluation from a customer case study • Display a large report Execution Time in seconds over time range Max Max Avg Avg 0 5 10 15 20 D W M Y D W M Y Tez LLAP Executiontime,s
  • 8. Page8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cut-awaytodemo (GIF)
  • 9. Page9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does LLAP make queries faster?
  • 10. Page10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Queue Technical overview – execution • LLAP daemon has a number of executors (think containers) that execute work "fragments" • Fragments are parts of one, or multiple parallel workloads (e.g. Hive SQL queries) • Work queue with pluggable priority • Geared towards low latency queries over long- running queries (by default) • I/O is similar to containers – read/write to HDFS, shuffle, other storages and formats • Streaming output for data API Executor Q1 Map 1 Executor External read Executor Q3 Reducer 3 Q1 Map 1 Q1 Map 1 Q3 Map 19 HDFS Waiting for shuffle inputs HBase Container (shuffle input) Spark executor
  • 11. Page11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Executor Technical overview – IO layer • Optional: when executing inside LLAP • All other formats use in-sync mode • Asynchronous IO for Hive • Wraps over InputFormat, reads through cache • Supported with ORC • Transparent, compressed in-memory cache • Format-specific, extensible • NVMe/NVDIMM caches RecordReade r Fragment Cache IO thread Plan & decode Read, decompress Metadata cache Actual data (HDFS, S3, …) What to read Data buffers Splits Vectorized data Indexes
  • 12. Page12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parallel queries – priorities, preemption • Lower-priority fragments can be preempted • For example, a fragment can start running before its inputs are ready, for better pipelining; such fragments may be preempted • LLAP work queue examines the DAG parameters to give preference to interactive (BI) queries LLAP QueueExecutor Executor Interactive query map 1/3 … Interactive query map 3/3 Executor Interactive query map 2/3 Wide query reduce waiting Time ClusterUtilization Long-Running Query Short-Running Query
  • 13. Page13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved I/OExecution In-memory processing – present and future Decoder Distributed FS Fragment Hive operator Hive operator Vectorized processin g col1 col2 Low-level pushdown (e.g. filter)* SSD cache* Off-heap cache Compression codec Native data vectors - work in progress * Compact encoded data Compressed data col1 col2
  • 14. Page14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved First query erformance • Cold LLAP is nearly as fast as shared pre-warmed containers (impractical on real clusters) • Realistic (long-running) LLAP ~3x faster than realistic (no prewarm) Tez No prewarm, 20.94 Shared prewarm 11.96 Cold LLAP 13.23 Realistic LLAP 7.49 0 5 10 15 20 25 Firstqueryruntime,s
  • 15. Page15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved JIT Performance – heavy use • Cache disabled! 13.23 7.93 7.7 6.29 5.44 5.3 5.01 4.89 4.83 7.49 6.1 4.61 4.59 4.93 4.62 4.63 4.45 4.25 0 2 4 6 8 10 12 14 0 1 2 3 4 5 6 7 8 Queryruntime,s Query run Cold LLAP LLAP after an unrelated workload ~2xtimesaving For cold start, the warmup took 3 runs
  • 16. Page16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parallel query execution – LLAP vs Hive 1.2 3131 2416 1741 1420 1508 531 291 147 94 75 0 500 1000 1500 2000 2500 3000 3500 1 2 4 8 16 Totalruntimeforallqueries,s. Number of concurrent users Hive 1.2 total runtime LLAP total runtime Hive 1.2 linear scaling LLAP linear scaling Parallelism eats into latency
  • 17. Page17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parallel query execution – 10Tb scale 531 291 147 94 75 0 100 200 300 400 500 600 1 2 4 8 16 Totalruntimeforallqueries,s. Number of concurrent users LLAP total runtime LLAP linear scaling Latency sensitive pre-emption
  • 18. Page18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance – cache on HDFS, 1Tb scale 20.2 18.3 17.6 16.2 16.2 16.8 14.5 11.3 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0 5 10 15 20 25 24 26 28 30 32 34 36 38 40 42 Cachehitrate Queryruntime,s Cache size, Gb Query runtime, after unrelated workload Query runtime, after related workload Metadata hit rate Data hit rate
  • 19. Page19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP as a “relational” datanode
  • 20. Page20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example - SparkSQL integration – execution flow Page 20 HadoopRDD.compute() LlapInputFormat.getRecordReader() - Open socket for incoming data - Send package/plan to LLAP Check permissions Generate splits w/LLAP locations Return securely signed splits Ranger+HiveServer2 Spark Executor HadoopRDD.getPartitions() Get Hive splits Cluster nodes Verify splits Run scan + transform Send data back LLAP Partitions/ Splits Request splits : : var llapContext = LlapContext.newInstance( sparkContext, jdbcUrl) var df: DataFrame = llapContext.sql("select * from tpch_text_5.region") DataFrame for Hive/LLAP data
  • 21. Page21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Monitoring LLAP Queries
  • 22. Page22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Monitoring • LLAP exposes a UI for monitoring • Also has jmx endpoint with much more data, logs and jstack endpoints as usual • Aggregate monitoring UI is work in progress
  • 23. Page23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Watching queries – Tez UI integration
  • 24. Page24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more