More Related Content Similar to Apache Tez: Accelerating Hadoop Query Processing (20) More from DataWorks Summit (20) Apache Tez: Accelerating Hadoop Query Processing 1. Apache Tez : Accelerating
Hadoop Query Processing
Page 1
Arun C. Murthy Bikas Saha
Founder & Architect Hortonworks
@acmurthy @bikassaha
(@hortonworks)
2. © Hortonworks Inc. 2013
Hello!
• Founder/Architect at Hortonworks Inc.
–Lead - Map-Reduce/YARN/Tez
–Formerly, Architect Hadoop MapReduce, Yahoo
–Responsible for running Hadoop MapReduce as a service for all
of Yahoo (~50k nodes footprint)
• Apache Hadoop, ASF
–Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
–Long-term Committer/PMC member (full time for 7 years)
–Release Manager for hadoop-2.x
Page 2
3. © Hortonworks Inc. 2013
Once upon a time …
Page 3
… long, long ago, there was a kingdom we shall call
Apache Hadoop
http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo
4. © Hortonworks Inc. 2013
Hadoop begat …
Page 4
… a two-headed monster on every node in the kingdom;
each belonged to a different clan and answered to a
different master
http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg
5. © Hortonworks Inc. 2013
Knights of Bytes - HDFS
Page 5
… stored data uncompromisingly in directories/files, nary a
care about contents
http://whoiscraigmoser.com/Images/identity/knight.png
6. © Hortonworks Inc. 2013
Prince of Processing - MapReduce
Page 6
He ruled with an iron fist by mapping,
and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg
7. © Hortonworks Inc. 2013
Peace Reigned
Page 7
… for a while with the odd change in the direction of the wind
http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg
8. © Hortonworks Inc. 2013
Slowly, but surely …
Page 8
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
9. © Hortonworks Inc. 2013
Slowly, but surely …
Page 9
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
10. © Hortonworks Inc. 2013
Slowly, but surely …
Page 10
… people of the kingdom clamored for more.
A palpable sense of greed & expectation.
http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg
11. © Hortonworks Inc. 2013
Signs of Distress
Page 11
SQL said some, others said Machine Learning,
still others said Real-Time Event Processing
http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg
12. © Hortonworks Inc. 2013
A Meeting at the Summit
Page 12
MapReduce is dead!
Err… not quite.
We need more options! We need more!
True…
http://4.bp.blogspot.com/-
oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp
13. © Hortonworks Inc. 2013
A Meeting at the Summit
Page 13
A common thread YARN running through all applications…
Long live the King!
http://whipup.net/wp-content/images/2008/08/yarn.gif
14. © Hortonworks Inc. 2013
The Edict
Page 14
Henceforth, in the Kingdom of King YARN…
MapReduce has been relegated to the status
of, merely, one of the applications!
http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg
15. © Hortonworks Inc. 2013
Reign of King YARN
Page 15
King YARN came to throne
with promises to return power
to all applications
equally, lower performance
taxes and resource
management…
http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg
16. © Hortonworks Inc. 2013
Oh the Shame!
Page 16
Well, at least, Prince
MapReduce still had
powerful allies like
Highness
Hive, Powerful
Pig, Cheery
Cascading…
http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg
17. © Hortonworks Inc. 2013
Things get worse before better
Page 17
Unfortunately, things got a lot worse for the Prince MapReduce…
http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg
18. © Hortonworks Inc. 2013
Knight Tez
Page 18
He did MapReduce, and so much more…
Smartly aligned himself to Kingdom YARN.
http://twomorrows.com/alterego/media/08shiningknight.gif
19. © Hortonworks Inc. 2013
Knight Tez
Page 19
… they decided to throw their
lot with Knight Tez!
http://informatica.upg-ploiesti.ro/62689/img/partners.jpg
Long term alliances of MapReduce with
Hive, Pig, Cascading etc. broke up…
http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png
22. © Hortonworks Inc. 2013
Every season has a flavor…
Page 22
SQL-on-Hadoop is the new black!
SQL-on-Hadoop will be solved within
the existing ecosystem
23. © Hortonworks Inc. 2013
Looking ahead
Page 23
What will it be next year?
Real-time event processing?
Machine Learning?
24. © Hortonworks Inc. 2013
Play to our strengths
Page 24
Invest in the Apache Hadoop platform
and the ecosystem (Hive et al).
26. © Hortonworks Inc. 2013
Tez – Introduction
Page 26
• Distributed execution
framework targeted towards
data-processing applications.
• Based on expressing a
computation as a dataflow
graph.
• Built on top of YARN – the
resource management
framework for Hadoop.
• Open source Apache incubator
project and Apache licensed.
27. © Hortonworks Inc. 2013
Tez – Design Themes
Page 27
• Empowering End Users
• Execution Performance
28. © Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment
Page 28
29. © Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
–Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.
–Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow
graphs with no translation impedance.
Page 29
TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2
TaskD-1 TaskD-2 TaskE-1 TaskE-2
30. © Hortonworks Inc. 2013
Aggregate Stage
Partition Stage
Preprocessor Stage
Tez – Empowering End Users
• Expressive dataflow definition API’s
Page 30
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
31. © Hortonworks Inc. 2013
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model
–Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
–End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful operators.
Page 31
IntermediateReduce
ShuffleInput
ReduceProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
PairwiseJoin
Input1
JoinProcessor
FileSortedOutput
Input2
32. © Hortonworks Inc. 2013
Tez – Empowering End Users
• Data type agnostic
–Tez is only concerned with the movement of data. Files and
streams of bytes.
–Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig
can use tuple oriented formats that are natural and native to them.
Page 32
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes
33. © Hortonworks Inc. 2013
Tez – Empowering End Users
• Simplifying deployment
–Tez is a completely client side application.
–No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.
–Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
–Leverages YARN local resources and distributed cache.
Page 33
Client
Machine
Node
Manager
TezTask
Node
Manager
TezTaskTezClient
HDFS
Tez Lib 1 Tez Lib 2
Client
Machine
TezClient
34. © Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying usage
With great power API’s come great responsibilities
Page 34
35. © Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce
• Plan reconfiguration at runtime
• Optimal resource management
• Dynamic physical data flow decisions
Page 35
36. © Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce
–Eliminate replicated write barrier between successive
computations.
–Eliminate job launch overhead of workflow jobs.
–Eliminate extra stage of map reads in every workflow job.
–Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.
Page 36
Pig/Hive - MR
Pig/Hive - Tez
37. © Hortonworks Inc. 2013
Tez – Execution Performance
• Plan reconfiguration at runtime
–Dynamic runtime concurrent control based on data size, user
operator resources, available cluster resources and locality.
–Advanced changes in dataflow graph structure.
–Progressive graph construction in concert with user optimizer.
Page 37
HDFS
Blocks
YARN
Resources
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Stage 2
100 10
reducers
Only 10GB’s
of data
38. © Hortonworks Inc. 2013
Tez – Execution Performance
• Optimal resource management
–Reuse YARN containers to launch new tasks.
–Reuse YARN containers to enable shared objects across tasks.
Page 38
YARN Container
TezTask Host
TezTask1
TezTask2
SharedObjects
YARN Container
Tez
Application Master
Start Task
Task Done
Start Task
39. © Hortonworks Inc. 2013
Tez – Execution Performance
• Dynamic physical data flow decisions
–Decide the type of physical byte movement and storage on the fly.
–Store intermediate data on distributed store, local store or in-
memory.
–Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 39
Producer
(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime
40. © Hortonworks Inc. 2013
Tez – Current status
• Apache Incubator Project
–Rapid development. Over 270 jiras opened. Over 170 resolved.
–Growing community.
• Focus on stability
–Testing and quality are highest priority.
–Code ready and deployed on multi-node clusters.
• DAG of MR processing is working
– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.
– Working Hive prototype that can target Tez for execution of
queries.
–Work started on prototype of Pig that can target Tez.
Page 40
41. © Hortonworks Inc. 2013
Tez – Current status
Page 41
Fact Table
Dimension
Table 1
Result
Table 1
Dimension
Table 2
Result
Table 2
Dimension
Table 3
Result
Table 3
Join
Join
Join
Typical pattern in a
TPC-DS query
Fact Table
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Optimization for
small data sets
Both can now run
as a single Tez job
42. © Hortonworks Inc. 2013
Tez – Looking ahead
• Early adopters and contributors welcome
–Adopters to drive more scenarios. Contributors to make them
happen.
• Stay tuned for Tez meetups with deep dives on Tez
architecture and using Tez
• Useful links
–Work tracking: https://issues.apache.org/jira/browse/TEZ
–Code: https://github.com/apache/incubator-tez
–High level design document and API specification:
https://issues.apache.org/jira/browse/TEZ-65
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
Page 42
43. © Hortonworks Inc. 2013
Tez – Takeaways
• Distributed execution framework that works on
computations represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Execution architecture designed to enable dynamic
performance optimizations at runtime
• Open source Apache project – your use-cases and
code are welcome
• It works and is already being used by Hive
Page 43