What's New in Teams Calling, Meetings and Devices March 2024
Pig on Tez - Low Latency ETL with Big Data
1. Pig on Tez
Daniel Dai
@daijy
Rohini Palaniswamy
@rohini_pswamy
H a d o o p S u m m i t 2 0 1 4 , S a n J o s e
2. Agenda
Team Introduction
Apache Pig
Why Pig on Tez?
Pig on Tez
- Design
- Tez features in Pig
- Performance
- Current status
- Future Plan
2
3. 3
Apache Pig on Tez Team
Daniel Dai
Pig PMC
Hortonworks
Rohini Palaniswamy
Pig PMC
Yahoo!
Olga Natkovich
Pig PMC
Yahoo!
Cheolsoo Park
VP Pig, Pig PMC
Netflix
Mark Wagner
Pig Committer
LinkedIn
Alex Bain
Pig Contributor
LinkedIn
4. Pig Latin
Procedural scripting language
Closer to relational algebra
Heavily used for ETL
Schema / No schema data, Pig eats everything
More than SQL and Feature rich
4
Multiquery Nested Foreach Illustrate
Algebraic and Accumulator java
UDFs
Script Embedding Scalars
Macros
non-java UDFs (jython, python,
javascript, groovy, jruby)
Distributed Orderby, Skewed
Join
5. Pig users
Heavily used for ETL at Web Scale by Major Internet Companies
At Yahoo!
- 60% of total hadoop jobs run daily
- 12 million monthly pig jobs
Other heavy users
- Twitter
- Netflix
- LinkedIn
- Ebay
- Salesforce
Standard data science tool, in university textbook
5
6. Why Pig on Tez?
DAG execution framework
Low level DAG framework
- Build DAG by defining vertices and edges
- Customize scheduling of DAG and routing of data
Highly customizable with pluggable implementations
Resource efficient
Performance
- Without having to increase memory
Natively built on top of YARN
- Multi-tenancy, resource allocation come for free
Scale
Security
Excellent support from Tez community
- Bikas Saha, Siddharth Seth, Hitesh Shah
6
8. Design
8
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler
9. DAG Plan – Split Group by + Join
9
f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;
Group by y Group by z
Load foo
Join
Load g1 and Load g2
Group by y Group by z
Load foo
Join
Multiple outputs
Reduce follows
reduce
HDFS HDFS
Split multiplex de-multiplex
11. DAG Plan – Distributed Orderby
11
Aggregate
Sample
Sort
Partition
A = LOAD ‘foo’ AS (x, y);
B = FILTER A by $0 is not
null;
C = ORDER f BY x;
Stage sample map
on distributed cache
Load/Filter
& Sample
Aggregate
Partition
Sort
Broadcast sample map
HDFS
HDFS
Load/FilterHDFS
HDFS
Map
Reduce
Map
Reduce
Map
1-1 Unsorted
Edge
Cache sample map
12. Session Reuse
Feature
- Session reuse
Submit more than one DAG to same AM
Usage
- Each Pig script uses a single session
- Grunt shell uses one session for all commands till timeout
- More than one DAG submitted for merge join, ‘exec’
Benefits
- A pig script with 5 MR jobs has 5 AM containers launched. Single AM for one
pig script in Tez saves capacity.
- Eliminates issue of queue and resource contention faced in MR by every new
MR job in the pipeline of a multi-stage pig script.
12
13. Container Reuse
Features
- Container reuse
Rerun new tasks on already launched containers (jvm)
Usage
- Turned on by default for all pig scripts and grunt shell
Benefits
- Reduced launch overhead
Container request and release overhead
Resource localization overhead
JVM launch time overhead
- Reduced network IO
1-1 edge tasks are launched on same node
- Object caching
User impact
- Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables
and memory leaks due to jvm reuse.
13
14. Custom Vertex Input/Output/Processor/Manager
Features
- Custom Vertex Processor
- Custom Input and Output between vertices
- Custom Vertex Manager
Usage
- PigProcessor instead of MapProcessor and ReduceProcessor
- Unsorted input/output
with Partitioner – Union
without Partitioner – Broadcast Edge (Replicate join, Orderby and Skewed join), 1-1
Edge (Order by, Skewed join and Multiquery off)
- Custom Vertex Manager – Automatic Parallelism Estimation
Benefits
- No framework restrictions like MR
- More efficient processing and algorithms
14
15. Broadcast Edge and Object Caching
Feature
- Broadcast Edge
Broadcast same data to all tasks in successor vertices
- Object Caching
Ability to cache objects in memory for scope of Vertex, DAG and Session
- Input fetch on choice
Usage
- Replicate join small table
- Orderby and Skewed join partitioning samples
Benefits
- Replace use of Distributed cache and avoid NodeManager bottleneck of localization
- Avoid input fetching if in cache on container reuse
- Performance gains of upto 3x in tests for replicated join on smaller clusters with
higher container reuse
15
16. Vertex Groups
Feature
- Vertex Grouping
Ability to group multiple vertices into one vertex group and produce a combined output
Usage
- Union operator
Benefits
- Better performance due to elimination
of an additional vertex
- Performance gains of 1.2x to 2x over MR
16
A = LOAD ‘a’;
B = LOAD ‘b’;
C = UNION A, B;
D = GROUP C by $0;
Load A Load B
GROUP
17. Dynamic Parallelism
Determine parallelism beforehand is hard
Dynamic adjust parallelism at runtime
Tez VertexManagerPlugin
- Custom policy to determine parallelism at runtime
- Library of common policy: ShuffleVertexManager
17
18. Dynamic Parallelism - ShuffleVertexManager
18
Load A
JOIN
Load A
JOIN 4 2
Load B
Load B
Stock VertexManagerPlugin from Tez
Used by Group, Hash Join, etc
Dynamic reduce parallelism of vertex based on estimated input size
19. Dynamic Parallelism – PartitionerDefinedVertexManager
Custom VertexManagerPlugin Used by Order by / Skewed Join
Dynamic increase / decrease parallelism based on input size
19
Load/Filter
& Sample
Sample
Aggregate
Partition
Sort
Calculate
#Parallelism
24. Performance Numbers – Interactive Query
24
0
100
200
300
400
500
600
700
10G 5G 1G 500M
Timeinsecs
Input Size
TPC-H Q10
MR
Tez
2.49X
3.41X
4.89X 6X
When the input data is small, latency dominates
Tez significantly reduce latency through session/container reuse
25. Performance Numbers – Iterative Algorithm
25
Pig can be used to implement iterative algorithm using embedding
Iterative algorithm is ideal for container reuse
Example: k-means Algorithm
- Each iteration takes an average 1.48s after the first iteration (vs 27s for MR)
0
1000
2000
3000
10 50 100
Timeinsecs
Iteration
k-means
MR
Tez
14.84X
13.12X
5.37X
* Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
26. Performance is proportional to …
Number of stages in the DAG
- Higher the number of stages in the DAG, performance of Tez over MR will be
better due to elimination of map read stages.
Size of intermediate output
- More the size of intermediate output, the performance of Tez over MR will be
better due to reduced HDFS usage.
Cluster/queue capacity
- More congested a queue is, the performance of Tez over MR will be better due
to container reuse.
Size of data in the job
- For smaller data and more stages, the performance of Tez over MR will be
better as percentage of launch overhead in the total time is high for smaller
jobs.
26
28. Where are we?
90% feature parity with Pig on MR
- No Local mode (TEZ-235)
- Rarely used operators not implemented
MAPREDUCE (native mapreduce jobs)
Collected CoGroup
98% of ~1300 e2e tests pass.
35% of ~2850 unit tests pass. Porting of rest pending on Tez Local mode.
Tez branch merged into trunk and will be part of Pig 0.14 release
Netflix has Lipstick working with Pig on Tez
- Credits: Jacob Perkins, Cheolsoo Park
28
29. User Impact
Tez
- Zero pain deployment
- Tez library installation on local disk and copy to HDFS
Pig
- No pain migration from Pig on MR to Pig on Tez
Existing scripts work as is without any modification
Only two additional steps to execute in Tez mode
– export TEZ_HOME=/tez-install-location
– pig -x tez myscript.pig
- Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static
variables and memory leaks due to jvm reuse.
29
30. What next?
Support for Tez Local mode
All unit tests ported
Improve
- Stability
- Usability
- Debuggability
Apache Release
- Pig 0.14 with Tez released by Sep 2014
Deployment
- In research in Yahoo! by early Q3
- In production in Yahoo and Netflix by Q3/Q4
Performance
- From 1.2-3x to 1.5x-5x by Q4
30
31. Tez Features - WIP
Tez UI
- Application Master UI and Job history UI is in the works by integrating via
Application Timeline server.
- Currently only AM logs are easily viewable. Task logs are available but have to
grep the AM log to find the URL.
Tez Local mode
Tez AM Recovery
- Tez checkpointing and resuming on AM failure is functional but needs more
work. With single DAG execution of whole script, AM retries can be very costly.
Input fetch optimizations
- Custom ShuffleHandler on NodeManager
- Local input fetch on container reuse
31
32. What next - Performance?
Shared Edges
- Same output to multiple downstream vertices
Multiple Vertex Caching
Unsorted shuffle for skewed join and order by
Custom edge manager and data routing for skewed join
Groupby and join using hashing and avoid sorting
Better memory management
Dynamic reconfiguration of DAG
- Automatically determine type of join - replicate, skewed or hash join
32