SlideShare uma empresa Scribd logo
1 de 33
Pig on Tez
Daniel Dai
@daijy
Rohini Palaniswamy
@rohini_pswamy
H a d o o p S u m m i t 2 0 1 4 , S a n J o s e
Agenda
 Team Introduction
 Apache Pig
 Why Pig on Tez?
 Pig on Tez
- Design
- Tez features in Pig
- Performance
- Current status
- Future Plan
2
3
Apache Pig on Tez Team
Daniel Dai
Pig PMC
Hortonworks
Rohini Palaniswamy
Pig PMC
Yahoo!
Olga Natkovich
Pig PMC
Yahoo!
Cheolsoo Park
VP Pig, Pig PMC
Netflix
Mark Wagner
Pig Committer
LinkedIn
Alex Bain
Pig Contributor
LinkedIn
Pig Latin
 Procedural scripting language
 Closer to relational algebra
 Heavily used for ETL
 Schema / No schema data, Pig eats everything
 More than SQL and Feature rich
4
Multiquery Nested Foreach Illustrate
Algebraic and Accumulator java
UDFs
Script Embedding Scalars
Macros
non-java UDFs (jython, python,
javascript, groovy, jruby)
Distributed Orderby, Skewed
Join
Pig users
 Heavily used for ETL at Web Scale by Major Internet Companies
 At Yahoo!
- 60% of total hadoop jobs run daily
- 12 million monthly pig jobs
 Other heavy users
- Twitter
- Netflix
- LinkedIn
- Ebay
- Salesforce
 Standard data science tool, in university textbook
5
Why Pig on Tez?
 DAG execution framework
 Low level DAG framework
- Build DAG by defining vertices and edges
- Customize scheduling of DAG and routing of data
 Highly customizable with pluggable implementations
 Resource efficient
 Performance
- Without having to increase memory
 Natively built on top of YARN
- Multi-tenancy, resource allocation come for free
 Scale
 Security
 Excellent support from Tez community
- Bikas Saha, Siddharth Seth, Hitesh Shah
6
PIG on TEZ
Design
8
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler
DAG Plan – Split Group by + Join
9
f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;
Group by y Group by z
Load foo
Join
Load g1 and Load g2
Group by y Group by z
Load foo
Join
Multiple outputs
Reduce follows
reduce
HDFS HDFS
Split multiplex de-multiplex
DAG Execution - Visualization
10
Vertex 1
(Load)
Vertex 2
(Group)
Vertex 3
(Group)
Vertex 4
(Join)
MROutput
MRInput
DAG Plan – Distributed Orderby
11
Aggregate
Sample
Sort
Partition
A = LOAD ‘foo’ AS (x, y);
B = FILTER A by $0 is not
null;
C = ORDER f BY x;
Stage sample map
on distributed cache
Load/Filter
& Sample
Aggregate
Partition
Sort
Broadcast sample map
HDFS
HDFS
Load/FilterHDFS
HDFS
Map
Reduce
Map
Reduce
Map
1-1 Unsorted
Edge
Cache sample map
Session Reuse
 Feature
- Session reuse
 Submit more than one DAG to same AM
 Usage
- Each Pig script uses a single session
- Grunt shell uses one session for all commands till timeout
- More than one DAG submitted for merge join, ‘exec’
 Benefits
- A pig script with 5 MR jobs has 5 AM containers launched. Single AM for one
pig script in Tez saves capacity.
- Eliminates issue of queue and resource contention faced in MR by every new
MR job in the pipeline of a multi-stage pig script.
12
Container Reuse
 Features
- Container reuse
 Rerun new tasks on already launched containers (jvm)
 Usage
- Turned on by default for all pig scripts and grunt shell
 Benefits
- Reduced launch overhead
 Container request and release overhead
 Resource localization overhead
 JVM launch time overhead
- Reduced network IO
 1-1 edge tasks are launched on same node
- Object caching
 User impact
- Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables
and memory leaks due to jvm reuse.
13
Custom Vertex Input/Output/Processor/Manager
 Features
- Custom Vertex Processor
- Custom Input and Output between vertices
- Custom Vertex Manager
 Usage
- PigProcessor instead of MapProcessor and ReduceProcessor
- Unsorted input/output
 with Partitioner – Union
 without Partitioner – Broadcast Edge (Replicate join, Orderby and Skewed join), 1-1
Edge (Order by, Skewed join and Multiquery off)
- Custom Vertex Manager – Automatic Parallelism Estimation
 Benefits
- No framework restrictions like MR
- More efficient processing and algorithms
14
Broadcast Edge and Object Caching
 Feature
- Broadcast Edge
 Broadcast same data to all tasks in successor vertices
- Object Caching
 Ability to cache objects in memory for scope of Vertex, DAG and Session
- Input fetch on choice
 Usage
- Replicate join small table
- Orderby and Skewed join partitioning samples
 Benefits
- Replace use of Distributed cache and avoid NodeManager bottleneck of localization
- Avoid input fetching if in cache on container reuse
- Performance gains of upto 3x in tests for replicated join on smaller clusters with
higher container reuse
15
Vertex Groups
 Feature
- Vertex Grouping
 Ability to group multiple vertices into one vertex group and produce a combined output
 Usage
- Union operator
 Benefits
- Better performance due to elimination
of an additional vertex
- Performance gains of 1.2x to 2x over MR
16
A = LOAD ‘a’;
B = LOAD ‘b’;
C = UNION A, B;
D = GROUP C by $0;
Load A Load B
GROUP
Dynamic Parallelism
 Determine parallelism beforehand is hard
 Dynamic adjust parallelism at runtime
 Tez VertexManagerPlugin
- Custom policy to determine parallelism at runtime
- Library of common policy: ShuffleVertexManager
17
Dynamic Parallelism - ShuffleVertexManager
18
Load A
JOIN
Load A
JOIN 4 2
Load B
Load B
 Stock VertexManagerPlugin from Tez
 Used by Group, Hash Join, etc
 Dynamic reduce parallelism of vertex based on estimated input size
Dynamic Parallelism – PartitionerDefinedVertexManager
 Custom VertexManagerPlugin Used by Order by / Skewed Join
 Dynamic increase / decrease parallelism based on input size
19
Load/Filter
& Sample
Sample
Aggregate
Partition
Sort
Calculate
#Parallelism
PERFORMANCE
Performance numbers –
21
0
10
20
30
40
50
60
70
80
Prod script 1
1.5x
1 MR Job
3172 vs 3172 Tasks
Prod script 2
2.1x
12 MR jobs
966 vs 941 Tasks
Prod script 3
1.5x
4 MR jobs on 8.4
TB input
21397 vs 21382
Tasks
Prod script 4
2%
4 MR Jobs on 25.2
TB input
101864 vs 101856
tasks
Timeinmins
MR
Tez
28 vs 18m
11 vs 5m
50 vs 35m
74 vs 72m
Performance numbers –
22
0
20
40
60
80
100
120
140
160
Prod script 1
2.52x
5 MR Jobs
Prod script 2
2.02x
5 MR Jobs
Prod script 3
2.22x
12 MR Jobs
Prod script 4
1.75x
15 MR jobs
Timeinmins
MR
Tez
25 vs 10m
34 vs 16m
2h 22m vs 1h 21m
1h 46m vs 48m
Lipstick from
23
Performance Numbers – Interactive Query
24
0
100
200
300
400
500
600
700
10G 5G 1G 500M
Timeinsecs
Input Size
TPC-H Q10
MR
Tez
2.49X
3.41X
4.89X 6X
 When the input data is small, latency dominates
 Tez significantly reduce latency through session/container reuse
Performance Numbers – Iterative Algorithm
25
 Pig can be used to implement iterative algorithm using embedding
 Iterative algorithm is ideal for container reuse
 Example: k-means Algorithm
- Each iteration takes an average 1.48s after the first iteration (vs 27s for MR)
0
1000
2000
3000
10 50 100
Timeinsecs
Iteration
k-means
MR
Tez
14.84X
13.12X
5.37X
* Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
Performance is proportional to …
 Number of stages in the DAG
- Higher the number of stages in the DAG, performance of Tez over MR will be
better due to elimination of map read stages.
 Size of intermediate output
- More the size of intermediate output, the performance of Tez over MR will be
better due to reduced HDFS usage.
 Cluster/queue capacity
- More congested a queue is, the performance of Tez over MR will be better due
to container reuse.
 Size of data in the job
- For smaller data and more stages, the performance of Tez over MR will be
better as percentage of launch overhead in the total time is high for smaller
jobs.
26
CURRENT & FUTURE
Where are we?
 90% feature parity with Pig on MR
- No Local mode (TEZ-235)
- Rarely used operators not implemented
 MAPREDUCE (native mapreduce jobs)
 Collected CoGroup
 98% of ~1300 e2e tests pass.
 35% of ~2850 unit tests pass. Porting of rest pending on Tez Local mode.
 Tez branch merged into trunk and will be part of Pig 0.14 release
 Netflix has Lipstick working with Pig on Tez
- Credits: Jacob Perkins, Cheolsoo Park
28
User Impact
 Tez
- Zero pain deployment
- Tez library installation on local disk and copy to HDFS
 Pig
- No pain migration from Pig on MR to Pig on Tez
 Existing scripts work as is without any modification
 Only two additional steps to execute in Tez mode
– export TEZ_HOME=/tez-install-location
– pig -x tez myscript.pig
- Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static
variables and memory leaks due to jvm reuse.
29
What next?
 Support for Tez Local mode
 All unit tests ported
 Improve
- Stability
- Usability
- Debuggability
 Apache Release
- Pig 0.14 with Tez released by Sep 2014
 Deployment
- In research in Yahoo! by early Q3
- In production in Yahoo and Netflix by Q3/Q4
 Performance
- From 1.2-3x to 1.5x-5x by Q4
30
Tez Features - WIP
 Tez UI
- Application Master UI and Job history UI is in the works by integrating via
Application Timeline server.
- Currently only AM logs are easily viewable. Task logs are available but have to
grep the AM log to find the URL.
 Tez Local mode
 Tez AM Recovery
- Tez checkpointing and resuming on AM failure is functional but needs more
work. With single DAG execution of whole script, AM retries can be very costly.
 Input fetch optimizations
- Custom ShuffleHandler on NodeManager
- Local input fetch on container reuse
31
What next - Performance?
 Shared Edges
- Same output to multiple downstream vertices
 Multiple Vertex Caching
 Unsorted shuffle for skewed join and order by
 Custom edge manager and data routing for skewed join
 Groupby and join using hashing and avoid sorting
 Better memory management
 Dynamic reconfiguration of DAG
- Automatically determine type of join - replicate, skewed or hash join
32
We are hiring!!!
Hortonworks
Stop by Kiosk D5
Yahoo!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
Thank You

Mais conteúdo relacionado

Mais procurados

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 

Mais procurados (20)

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 

Destaque

Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Destaque (18)

Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Apache Kylinについて #hcj2016
Apache Kylinについて #hcj2016Apache Kylinについて #hcj2016
Apache Kylinについて #hcj2016
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Semelhante a Pig on Tez - Low Latency ETL with Big Data

[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종NAVER D2
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Katello on TorqueBox
Katello on TorqueBoxKatello on TorqueBox
Katello on TorqueBoxlzap
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Nelson Calero
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 

Semelhante a Pig on Tez - Low Latency ETL with Big Data (20)

[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
November 2014 HUG: Apache Pig 0.14
November 2014 HUG: Apache Pig 0.14 November 2014 HUG: Apache Pig 0.14
November 2014 HUG: Apache Pig 0.14
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Katello on TorqueBox
Katello on TorqueBoxKatello on TorqueBox
Katello on TorqueBox
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Pig on Tez - Low Latency ETL with Big Data

  • 1. Pig on Tez Daniel Dai @daijy Rohini Palaniswamy @rohini_pswamy H a d o o p S u m m i t 2 0 1 4 , S a n J o s e
  • 2. Agenda  Team Introduction  Apache Pig  Why Pig on Tez?  Pig on Tez - Design - Tez features in Pig - Performance - Current status - Future Plan 2
  • 3. 3 Apache Pig on Tez Team Daniel Dai Pig PMC Hortonworks Rohini Palaniswamy Pig PMC Yahoo! Olga Natkovich Pig PMC Yahoo! Cheolsoo Park VP Pig, Pig PMC Netflix Mark Wagner Pig Committer LinkedIn Alex Bain Pig Contributor LinkedIn
  • 4. Pig Latin  Procedural scripting language  Closer to relational algebra  Heavily used for ETL  Schema / No schema data, Pig eats everything  More than SQL and Feature rich 4 Multiquery Nested Foreach Illustrate Algebraic and Accumulator java UDFs Script Embedding Scalars Macros non-java UDFs (jython, python, javascript, groovy, jruby) Distributed Orderby, Skewed Join
  • 5. Pig users  Heavily used for ETL at Web Scale by Major Internet Companies  At Yahoo! - 60% of total hadoop jobs run daily - 12 million monthly pig jobs  Other heavy users - Twitter - Netflix - LinkedIn - Ebay - Salesforce  Standard data science tool, in university textbook 5
  • 6. Why Pig on Tez?  DAG execution framework  Low level DAG framework - Build DAG by defining vertices and edges - Customize scheduling of DAG and routing of data  Highly customizable with pluggable implementations  Resource efficient  Performance - Without having to increase memory  Natively built on top of YARN - Multi-tenancy, resource allocation come for free  Scale  Security  Excellent support from Tez community - Bikas Saha, Siddharth Seth, Hitesh Shah 6
  • 8. Design 8 Logical Plan Tez Plan MR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler
  • 9. DAG Plan – Split Group by + Join 9 f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Group by y Group by z Load foo Join Load g1 and Load g2 Group by y Group by z Load foo Join Multiple outputs Reduce follows reduce HDFS HDFS Split multiplex de-multiplex
  • 10. DAG Execution - Visualization 10 Vertex 1 (Load) Vertex 2 (Group) Vertex 3 (Group) Vertex 4 (Join) MROutput MRInput
  • 11. DAG Plan – Distributed Orderby 11 Aggregate Sample Sort Partition A = LOAD ‘foo’ AS (x, y); B = FILTER A by $0 is not null; C = ORDER f BY x; Stage sample map on distributed cache Load/Filter & Sample Aggregate Partition Sort Broadcast sample map HDFS HDFS Load/FilterHDFS HDFS Map Reduce Map Reduce Map 1-1 Unsorted Edge Cache sample map
  • 12. Session Reuse  Feature - Session reuse  Submit more than one DAG to same AM  Usage - Each Pig script uses a single session - Grunt shell uses one session for all commands till timeout - More than one DAG submitted for merge join, ‘exec’  Benefits - A pig script with 5 MR jobs has 5 AM containers launched. Single AM for one pig script in Tez saves capacity. - Eliminates issue of queue and resource contention faced in MR by every new MR job in the pipeline of a multi-stage pig script. 12
  • 13. Container Reuse  Features - Container reuse  Rerun new tasks on already launched containers (jvm)  Usage - Turned on by default for all pig scripts and grunt shell  Benefits - Reduced launch overhead  Container request and release overhead  Resource localization overhead  JVM launch time overhead - Reduced network IO  1-1 edge tasks are launched on same node - Object caching  User impact - Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 13
  • 14. Custom Vertex Input/Output/Processor/Manager  Features - Custom Vertex Processor - Custom Input and Output between vertices - Custom Vertex Manager  Usage - PigProcessor instead of MapProcessor and ReduceProcessor - Unsorted input/output  with Partitioner – Union  without Partitioner – Broadcast Edge (Replicate join, Orderby and Skewed join), 1-1 Edge (Order by, Skewed join and Multiquery off) - Custom Vertex Manager – Automatic Parallelism Estimation  Benefits - No framework restrictions like MR - More efficient processing and algorithms 14
  • 15. Broadcast Edge and Object Caching  Feature - Broadcast Edge  Broadcast same data to all tasks in successor vertices - Object Caching  Ability to cache objects in memory for scope of Vertex, DAG and Session - Input fetch on choice  Usage - Replicate join small table - Orderby and Skewed join partitioning samples  Benefits - Replace use of Distributed cache and avoid NodeManager bottleneck of localization - Avoid input fetching if in cache on container reuse - Performance gains of upto 3x in tests for replicated join on smaller clusters with higher container reuse 15
  • 16. Vertex Groups  Feature - Vertex Grouping  Ability to group multiple vertices into one vertex group and produce a combined output  Usage - Union operator  Benefits - Better performance due to elimination of an additional vertex - Performance gains of 1.2x to 2x over MR 16 A = LOAD ‘a’; B = LOAD ‘b’; C = UNION A, B; D = GROUP C by $0; Load A Load B GROUP
  • 17. Dynamic Parallelism  Determine parallelism beforehand is hard  Dynamic adjust parallelism at runtime  Tez VertexManagerPlugin - Custom policy to determine parallelism at runtime - Library of common policy: ShuffleVertexManager 17
  • 18. Dynamic Parallelism - ShuffleVertexManager 18 Load A JOIN Load A JOIN 4 2 Load B Load B  Stock VertexManagerPlugin from Tez  Used by Group, Hash Join, etc  Dynamic reduce parallelism of vertex based on estimated input size
  • 19. Dynamic Parallelism – PartitionerDefinedVertexManager  Custom VertexManagerPlugin Used by Order by / Skewed Join  Dynamic increase / decrease parallelism based on input size 19 Load/Filter & Sample Sample Aggregate Partition Sort Calculate #Parallelism
  • 21. Performance numbers – 21 0 10 20 30 40 50 60 70 80 Prod script 1 1.5x 1 MR Job 3172 vs 3172 Tasks Prod script 2 2.1x 12 MR jobs 966 vs 941 Tasks Prod script 3 1.5x 4 MR jobs on 8.4 TB input 21397 vs 21382 Tasks Prod script 4 2% 4 MR Jobs on 25.2 TB input 101864 vs 101856 tasks Timeinmins MR Tez 28 vs 18m 11 vs 5m 50 vs 35m 74 vs 72m
  • 22. Performance numbers – 22 0 20 40 60 80 100 120 140 160 Prod script 1 2.52x 5 MR Jobs Prod script 2 2.02x 5 MR Jobs Prod script 3 2.22x 12 MR Jobs Prod script 4 1.75x 15 MR jobs Timeinmins MR Tez 25 vs 10m 34 vs 16m 2h 22m vs 1h 21m 1h 46m vs 48m
  • 24. Performance Numbers – Interactive Query 24 0 100 200 300 400 500 600 700 10G 5G 1G 500M Timeinsecs Input Size TPC-H Q10 MR Tez 2.49X 3.41X 4.89X 6X  When the input data is small, latency dominates  Tez significantly reduce latency through session/container reuse
  • 25. Performance Numbers – Iterative Algorithm 25  Pig can be used to implement iterative algorithm using embedding  Iterative algorithm is ideal for container reuse  Example: k-means Algorithm - Each iteration takes an average 1.48s after the first iteration (vs 27s for MR) 0 1000 2000 3000 10 50 100 Timeinsecs Iteration k-means MR Tez 14.84X 13.12X 5.37X * Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
  • 26. Performance is proportional to …  Number of stages in the DAG - Higher the number of stages in the DAG, performance of Tez over MR will be better due to elimination of map read stages.  Size of intermediate output - More the size of intermediate output, the performance of Tez over MR will be better due to reduced HDFS usage.  Cluster/queue capacity - More congested a queue is, the performance of Tez over MR will be better due to container reuse.  Size of data in the job - For smaller data and more stages, the performance of Tez over MR will be better as percentage of launch overhead in the total time is high for smaller jobs. 26
  • 28. Where are we?  90% feature parity with Pig on MR - No Local mode (TEZ-235) - Rarely used operators not implemented  MAPREDUCE (native mapreduce jobs)  Collected CoGroup  98% of ~1300 e2e tests pass.  35% of ~2850 unit tests pass. Porting of rest pending on Tez Local mode.  Tez branch merged into trunk and will be part of Pig 0.14 release  Netflix has Lipstick working with Pig on Tez - Credits: Jacob Perkins, Cheolsoo Park 28
  • 29. User Impact  Tez - Zero pain deployment - Tez library installation on local disk and copy to HDFS  Pig - No pain migration from Pig on MR to Pig on Tez  Existing scripts work as is without any modification  Only two additional steps to execute in Tez mode – export TEZ_HOME=/tez-install-location – pig -x tez myscript.pig - Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 29
  • 30. What next?  Support for Tez Local mode  All unit tests ported  Improve - Stability - Usability - Debuggability  Apache Release - Pig 0.14 with Tez released by Sep 2014  Deployment - In research in Yahoo! by early Q3 - In production in Yahoo and Netflix by Q3/Q4  Performance - From 1.2-3x to 1.5x-5x by Q4 30
  • 31. Tez Features - WIP  Tez UI - Application Master UI and Job history UI is in the works by integrating via Application Timeline server. - Currently only AM logs are easily viewable. Task logs are available but have to grep the AM log to find the URL.  Tez Local mode  Tez AM Recovery - Tez checkpointing and resuming on AM failure is functional but needs more work. With single DAG execution of whole script, AM retries can be very costly.  Input fetch optimizations - Custom ShuffleHandler on NodeManager - Local input fetch on container reuse 31
  • 32. What next - Performance?  Shared Edges - Same output to multiple downstream vertices  Multiple Vertex Caching  Unsorted shuffle for skewed join and order by  Custom edge manager and data routing for skewed join  Groupby and join using hashing and avoid sorting  Better memory management  Dynamic reconfiguration of DAG - Automatically determine type of join - replicate, skewed or hash join 32
  • 33. We are hiring!!! Hortonworks Stop by Kiosk D5 Yahoo! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com. Thank You