SlideShare uma empresa Scribd logo
1 de 36
Making Pig Fly 
Optimizing Data Processing on Hadoop 
Daniel Dai (@daijy) 
Thejas Nair (@thejasn) 
© Hortonworks Inc. 2011 
Page 1
What is Apache Pig? 
Pig Latin, a high level 
data processing 
language. 
© Hortonworks Inc. 2011 
Page 2 
Architecting the Future of Big Data 
An engine that 
executes Pig 
Latin locally or on 
a Hadoop cluster. 
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
Pig-latin example 
• Query : Get the list of web pages visited by users whose 
age is between 20 and 29 years. 
USERS = load ‘users’ as (uid, age); 
USERS_20s = filter USERS by age >= 20 and age <= 29; 
PVs = load ‘pages’ as (url, uid, timestamp); 
PVs_u20s = join USERS_20s by uid, PVs by uid; 
© Hortonworks Inc. 2011 
Page 3 
Architecting the Future of Big Data
Why pig ? 
•Faster development 
– Fewer lines of code 
– Don’t re-invent the wheel 
• Flexible 
– Metadata is optional 
– Extensible 
– Procedural programming 
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ 
© Hortonworks Inc. 2011 
Page 4 
Architecting the Future of Big Data
Pig optimizations 
• Ideally user should not have to 
bother 
• Reality 
– Pig is still young and immature 
– Pig does not have the whole picture 
–Cluster configuration 
–Data histogram 
– Pig philosophy: Pig is docile 
© Hortonworks Inc. 2011 
Page 5 
Architecting the Future of Big Data
Pig optimizations 
• What pig does for you 
– Do safe transformations of query to optimize 
– Optimized operations (join, sort) 
• What you do 
– Organize input in optimal way 
– Optimize pig-latin query 
– Tell pig what join/group algorithm to use 
© Hortonworks Inc. 2011 
Page 6 
Architecting the Future of Big Data
Rule based optimizer 
• Column pruner 
• Push up filter 
• Push down flatten 
• Push up limit 
• Partition pruning 
• Global optimizer 
© Hortonworks Inc. 2011 
Page 7 
Architecting the Future of Big Data
Column Pruner 
• Pig will do column pruning automatically 
A = load ‘input’ as (a0, a1, a2); 
B = foreach A generate a0+a1; 
C = order B by $0; 
Store C into ‘output’; 
• Cases Pig will not do column pruning 
automatically 
– No schema specified in load statement 
© Hortonworks Inc. 2011 
Page 8 
Architecting the Future of Big Data 
Pig will prune 
a2 automatically 
A = load ‘input’; 
B = order A by $0; 
C = foreach B generate $0+$1; 
Store C into ‘output’; 
DIY 
A = load ‘input’; 
A1 = foreach A generate $0, $1; 
B = order A1 by $0; 
C = foreach B generate $0+$1; 
Store C into ‘output’;
Column Pruner 
• Another case Pig does not do column 
pruning 
– Pig does not keep track of unused column after 
grouping 
A = load ‘input’ as (a0, a1, a2); 
B = group A by a0; 
C = foreach B generate SUM(A.a1); 
Store C into ‘output’; 
© Hortonworks Inc. 2011 
Page 9 
Architecting the Future of Big Data 
DIY 
A = load ‘input’ as (a0, a1, a2); 
A1 = foreach A generate $0, $1; 
B = group A1 by a0; 
C = foreach B generate SUM(A.a1); 
Store C into ‘output’;
Push up filter 
• Pig split the filter condition before push 
B 
© Hortonworks Inc. 2011 
Page 10 
Architecting the Future of Big Data 
A 
Join 
a0>0 && b0>10 
Filter 
A 
Join 
a0>0 
B 
Filter b0>10 
Original query Split filter condition 
A 
Join 
a0>0 
B 
Filter 
b0>10 
Push up filter
Other push up/down 
• Push down flatten 
• Push up limit 
Limit 
© Hortonworks Inc. 2011 
Page 11 
Architecting the Future of Big Data 
Load 
Flatten 
Order 
Load 
Order 
Flatten 
A = load ‘input’ as (a0:bag, a1); 
B = foreach A generate 
flattten(a0), a1; 
C = order B by a1; 
Store C into ‘output’; 
Load 
Foreach 
Limit 
Load 
Foreach 
Load (limited) 
Foreach 
Load 
Order 
Limit 
Load 
Order (limited)
Partition pruning 
• Prune unnecessary partitions entirely 
– HCatLoader 
2010 
2011 
2012 
© Hortonworks Inc. 2011 
Page 12 
HCatLoader 
Architecting the Future of Big Data 
Filter 
(year>=2011) 
2010 
2011 
2012 
HCatLoader 
(year>=2011)
Intermediate file compression 
Pig Script 
© Hortonworks Inc. 2011 
Page 13 
Architecting the Future of Big Data 
map 1 
reduce 1 
Pig temp file 
map 2 
reduce 2 
Pig temp file 
map 3 
reduce 3 
•Intermediate file 
between map and 
reduce 
– Snappy 
•Temp file between 
mapreduce jobs 
– No compression by 
default
Enable temp file compression 
•Pig temp file are not compressed by 
default 
– Issues with snappy (HADOOP-7990) 
– LZO: not Apache license 
•Enable LZO compression 
–Install LZO for Hadoop 
–In conf/pig.properties 
pig.tmpfilecompression = true 
pig.tmpfilecompression.codec = lzo 
–With lzo, up to > 90% disk saving and 4x query 
speed up 
© Hortonworks Inc. 2011 
Page 14 
Architecting the Future of Big Data
Multiquery 
• Combine two or more map/reduce 
job into one 
– Happens automatically 
– Cases we want to control multiquery: combine too 
many 
© Hortonworks Inc. 2011 
Page 15 
Architecting the Future of Big Data 
Load 
Group by $0 Group by $1 
Foreach Foreach 
Store Store 
Group by $2 
Foreach 
Store
Control multiquery 
• Disable multiquery 
– Command line option: -M 
• Using “exec” to mark the boundary 
A = load ‘input’; 
B0 = group A by $0; 
C0 = foreach B0 generate group, COUNT(A); 
Store C0 into ‘output0’; 
B1 = group A by $1; 
C1 = foreach B1 generate group, COUNT(A); 
Store C1 into ‘output1’; 
exec 
B2 = group A by $2; 
C2 = foreach B2 generate group, COUNT(A); 
Store C2 into ‘output2’; 
© Hortonworks Inc. 2011 
Page 16 
Architecting the Future of Big Data
Implement the right UDF 
• Algebraic UDF 
– Initial 
– Intermediate 
– Final 
A = load ‘input’; 
B0 = group A by $0; 
C0 = foreach B0 generate group, SUM(A); 
Store C0 into ‘output0’; 
© Hortonworks Inc. 2011 
Page 17 
Architecting the Future of Big Data 
Map 
Initial 
Combiner 
Intermediate 
Reduce 
Final
Implement the right UDF 
• Accumulator UDF 
– Reduce side UDF 
– Normally takes a bag 
• Benefit 
– Big bag are passed in 
batches 
– Avoid using too much 
memory 
– Batch size 
© Hortonworks Inc. 2011 
Page 18 
Architecting the Future of Big Data 
A = load ‘input’; 
B0 = group A by $0; 
C0 = foreach B0 generate group, 
my_accum(A); 
Store C0 into ‘output0’; 
my_accum extends Accumulator { 
public void accumulate() { 
// take a bag trunk 
} 
public void getValue() { 
// called after all bag trunks are 
processed 
} 
pig.accumulative.batchsize=20000 }
Memory optimization 
• Control bag size on reduce side 
Mapreduce: 
reduce(Text key, Iterator<Writable> 
values, ……) 
– If bag size exceed threshold, spill to disk 
– Control the bag size to fit the bag in memory if 
possible 
© Hortonworks Inc. 2011 
Page 19 
Architecting the Future of Big Data 
Iterator 
Bag of Input 1 Bag of Input 2 Bag of Input 3 
pig.cachedbag.memusage=0.2
Optimization starts before pig 
• Input format 
• Serialization format 
• Compression 
© Hortonworks Inc. 2011 
Page 20 
Architecting the Future of Big Data
Input format -Test Query 
> searches = load ’aol_search_logs.txt' 
using PigStorage() as (ID, Query, …); 
> search_thejas = filter searches by Query 
matches '.*thejas.*'; 
> dump search_thejas; 
(1568578 , thejasminesupperclub, ….) 
© Hortonworks Inc. 2011 
Page 21 
Architecting the Future of Big Data
Input formats 
© Hortonworks Inc. 2011 
Page 22 
Architecting the Future of Big Data 
140 
120 
100 
80 
60 
40 
20 
0 
RunTime (sec) 
RunTime (sec)
Columnar format 
•RCFile 
•Columnar format for a group of rows 
•More efficient if you query subset of 
columns 
© Hortonworks Inc. 2011 
Page 23 
Architecting the Future of Big Data
Tests with RCFile 
• Tests with load + project + filter out all 
records. 
• Using hcatalog, w compression,types 
•Test 1 
•Project 1 out of 5 columns 
•Test 2 
•Project all 5 columns 
© Hortonworks Inc. 2011 
Page 24 
Architecting the Future of Big Data
RCFile test results 
© Hortonworks Inc. 2011 
Page 25 
Architecting the Future of Big Data 
140 
120 
100 
80 
60 
40 
20 
0 
Project 1 (sec) Project all (sec) 
Plain Text 
RCFile
Cost based optimizations 
• Optimizations decisions based on 
your query/data 
• Often iterative process 
© Hortonworks Inc. 2011 
Page 26 
Architecting the Future of Big Data 
Run 
query 
Measure 
Tune
Cost based optimization - Aggregation 
• Hash Based Agg 
Map 
(logic) 
M. 
Output 
Use pig.exec.mapPartAgg=true to enable 
© Hortonworks Inc. 2011 
Map task 
Page 27 
Architecting the Future of Big Data 
HBA 
HBA 
Output 
Reduce task
Cost based optimization – Hash Agg. 
• Auto off feature 
• switches off HBA if output reduction is 
not good enough 
• Configuring Hash Agg 
• Configure auto off feature - 
pig.exec.mapPartAgg.minReduction 
• Configure memory used - 
pig.cachedbag.memusage 
© Hortonworks Inc. 2011 
Page 28 
Architecting the Future of Big Data
Cost based optimization - Join 
• Use appropriate join algorithm 
•Skew on join key - Skew join 
•Fits in memory – FR join 
© Hortonworks Inc. 2011 
Page 29 
Architecting the Future of Big Data
Cost based optimization – MR tuning 
•Tune MR parameters to reduce IO 
•Control spills using map sort params 
•Reduce shuffle/sort-merge params 
© Hortonworks Inc. 2011 
Page 30 
Architecting the Future of Big Data
Parallelism of reduce tasks 
0:25:55 
0:23:02 
0:20:10 
0:17:17 
0:14:24 
4 6 8 24 48 256 
© Hortonworks Inc. 2011 
Page 31 
Architecting the Future of Big Data 
Runtime 
Runtime 
• Number of reduce slots = 6 
• Factors affecting runtime 
• Cores simultaneously used/skew 
• Cost of having additional reduce tasks
Cost based optimization – keep data sorted 
•Frequent joins operations on same 
keys 
• Keep data sorted on keys 
• Use merge join 
• Optimized group on sorted keys 
• Works with few load functions – needs 
additional i/f implementation 
© Hortonworks Inc. 2011 
Page 32 
Architecting the Future of Big Data
Optimizations for sorted data 
© Hortonworks Inc. 2011 
Page 33 
Architecting the Future of Big Data 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
sort+sort+join+join join + join 
Join 2 
Join 1 
Sort2 
Sort1
Future Directions 
• Optimize using stats 
• Using historical stats w hcatalog 
• Sampling 
© Hortonworks Inc. 2011 
Page 34 
Architecting the Future of Big Data
Questions 
© Hortonworks Inc. 2011 
Page 35 
Architecting the Future of Big Data 
?
© Hortonworks Inc. 2011 Page 36

Mais conteúdo relacionado

Mais procurados

Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 

Mais procurados (20)

TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive Query optimization techniques in Apache Hive
Query optimization techniques in Apache Hive
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 

Destaque

A Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at PearsonA Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at Pearson
MongoDB
 
Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
MongoDB
 

Destaque (8)

MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence ArchitectureMongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
 
A Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at PearsonA Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at Pearson
 
Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
 
Big Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise ArchitectureBig Data Paris - A Modern Enterprise Architecture
Big Data Paris - A Modern Enterprise Architecture
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Algorithm Analyzing
Algorithm AnalyzingAlgorithm Analyzing
Algorithm Analyzing
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 

Semelhante a Making pig fly optimizing data processing on hadoop presentation

Semelhante a Making pig fly optimizing data processing on hadoop presentation (20)

Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the Cloud
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdf
 
The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...
The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...
The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Pig programming is fun
Pig programming is funPig programming is fun
Pig programming is fun
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
Demystifying web performance tooling and metrics
Demystifying web performance tooling and metricsDemystifying web performance tooling and metrics
Demystifying web performance tooling and metrics
 

Último

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 

Making pig fly optimizing data processing on hadoop presentation

  • 1. Making Pig Fly Optimizing Data Processing on Hadoop Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level data processing language. © Hortonworks Inc. 2011 Page 2 Architecting the Future of Big Data An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
  • 3. Pig-latin example • Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load ‘users’ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load ‘pages’ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid; © Hortonworks Inc. 2011 Page 3 Architecting the Future of Big Data
  • 4. Why pig ? •Faster development – Fewer lines of code – Don’t re-invent the wheel • Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ © Hortonworks Inc. 2011 Page 4 Architecting the Future of Big Data
  • 5. Pig optimizations • Ideally user should not have to bother • Reality – Pig is still young and immature – Pig does not have the whole picture –Cluster configuration –Data histogram – Pig philosophy: Pig is docile © Hortonworks Inc. 2011 Page 5 Architecting the Future of Big Data
  • 6. Pig optimizations • What pig does for you – Do safe transformations of query to optimize – Optimized operations (join, sort) • What you do – Organize input in optimal way – Optimize pig-latin query – Tell pig what join/group algorithm to use © Hortonworks Inc. 2011 Page 6 Architecting the Future of Big Data
  • 7. Rule based optimizer • Column pruner • Push up filter • Push down flatten • Push up limit • Partition pruning • Global optimizer © Hortonworks Inc. 2011 Page 7 Architecting the Future of Big Data
  • 8. Column Pruner • Pig will do column pruning automatically A = load ‘input’ as (a0, a1, a2); B = foreach A generate a0+a1; C = order B by $0; Store C into ‘output’; • Cases Pig will not do column pruning automatically – No schema specified in load statement © Hortonworks Inc. 2011 Page 8 Architecting the Future of Big Data Pig will prune a2 automatically A = load ‘input’; B = order A by $0; C = foreach B generate $0+$1; Store C into ‘output’; DIY A = load ‘input’; A1 = foreach A generate $0, $1; B = order A1 by $0; C = foreach B generate $0+$1; Store C into ‘output’;
  • 9. Column Pruner • Another case Pig does not do column pruning – Pig does not keep track of unused column after grouping A = load ‘input’ as (a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; © Hortonworks Inc. 2011 Page 9 Architecting the Future of Big Data DIY A = load ‘input’ as (a0, a1, a2); A1 = foreach A generate $0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’;
  • 10. Push up filter • Pig split the filter condition before push B © Hortonworks Inc. 2011 Page 10 Architecting the Future of Big Data A Join a0>0 && b0>10 Filter A Join a0>0 B Filter b0>10 Original query Split filter condition A Join a0>0 B Filter b0>10 Push up filter
  • 11. Other push up/down • Push down flatten • Push up limit Limit © Hortonworks Inc. 2011 Page 11 Architecting the Future of Big Data Load Flatten Order Load Order Flatten A = load ‘input’ as (a0:bag, a1); B = foreach A generate flattten(a0), a1; C = order B by a1; Store C into ‘output’; Load Foreach Limit Load Foreach Load (limited) Foreach Load Order Limit Load Order (limited)
  • 12. Partition pruning • Prune unnecessary partitions entirely – HCatLoader 2010 2011 2012 © Hortonworks Inc. 2011 Page 12 HCatLoader Architecting the Future of Big Data Filter (year>=2011) 2010 2011 2012 HCatLoader (year>=2011)
  • 13. Intermediate file compression Pig Script © Hortonworks Inc. 2011 Page 13 Architecting the Future of Big Data map 1 reduce 1 Pig temp file map 2 reduce 2 Pig temp file map 3 reduce 3 •Intermediate file between map and reduce – Snappy •Temp file between mapreduce jobs – No compression by default
  • 14. Enable temp file compression •Pig temp file are not compressed by default – Issues with snappy (HADOOP-7990) – LZO: not Apache license •Enable LZO compression –Install LZO for Hadoop –In conf/pig.properties pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo –With lzo, up to > 90% disk saving and 4x query speed up © Hortonworks Inc. 2011 Page 14 Architecting the Future of Big Data
  • 15. Multiquery • Combine two or more map/reduce job into one – Happens automatically – Cases we want to control multiquery: combine too many © Hortonworks Inc. 2011 Page 15 Architecting the Future of Big Data Load Group by $0 Group by $1 Foreach Foreach Store Store Group by $2 Foreach Store
  • 16. Control multiquery • Disable multiquery – Command line option: -M • Using “exec” to mark the boundary A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, COUNT(A); Store C0 into ‘output0’; B1 = group A by $1; C1 = foreach B1 generate group, COUNT(A); Store C1 into ‘output1’; exec B2 = group A by $2; C2 = foreach B2 generate group, COUNT(A); Store C2 into ‘output2’; © Hortonworks Inc. 2011 Page 16 Architecting the Future of Big Data
  • 17. Implement the right UDF • Algebraic UDF – Initial – Intermediate – Final A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, SUM(A); Store C0 into ‘output0’; © Hortonworks Inc. 2011 Page 17 Architecting the Future of Big Data Map Initial Combiner Intermediate Reduce Final
  • 18. Implement the right UDF • Accumulator UDF – Reduce side UDF – Normally takes a bag • Benefit – Big bag are passed in batches – Avoid using too much memory – Batch size © Hortonworks Inc. 2011 Page 18 Architecting the Future of Big Data A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, my_accum(A); Store C0 into ‘output0’; my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed } pig.accumulative.batchsize=20000 }
  • 19. Memory optimization • Control bag size on reduce side Mapreduce: reduce(Text key, Iterator<Writable> values, ……) – If bag size exceed threshold, spill to disk – Control the bag size to fit the bag in memory if possible © Hortonworks Inc. 2011 Page 19 Architecting the Future of Big Data Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2
  • 20. Optimization starts before pig • Input format • Serialization format • Compression © Hortonworks Inc. 2011 Page 20 Architecting the Future of Big Data
  • 21. Input format -Test Query > searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …); > search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….) © Hortonworks Inc. 2011 Page 21 Architecting the Future of Big Data
  • 22. Input formats © Hortonworks Inc. 2011 Page 22 Architecting the Future of Big Data 140 120 100 80 60 40 20 0 RunTime (sec) RunTime (sec)
  • 23. Columnar format •RCFile •Columnar format for a group of rows •More efficient if you query subset of columns © Hortonworks Inc. 2011 Page 23 Architecting the Future of Big Data
  • 24. Tests with RCFile • Tests with load + project + filter out all records. • Using hcatalog, w compression,types •Test 1 •Project 1 out of 5 columns •Test 2 •Project all 5 columns © Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data
  • 25. RCFile test results © Hortonworks Inc. 2011 Page 25 Architecting the Future of Big Data 140 120 100 80 60 40 20 0 Project 1 (sec) Project all (sec) Plain Text RCFile
  • 26. Cost based optimizations • Optimizations decisions based on your query/data • Often iterative process © Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data Run query Measure Tune
  • 27. Cost based optimization - Aggregation • Hash Based Agg Map (logic) M. Output Use pig.exec.mapPartAgg=true to enable © Hortonworks Inc. 2011 Map task Page 27 Architecting the Future of Big Data HBA HBA Output Reduce task
  • 28. Cost based optimization – Hash Agg. • Auto off feature • switches off HBA if output reduction is not good enough • Configuring Hash Agg • Configure auto off feature - pig.exec.mapPartAgg.minReduction • Configure memory used - pig.cachedbag.memusage © Hortonworks Inc. 2011 Page 28 Architecting the Future of Big Data
  • 29. Cost based optimization - Join • Use appropriate join algorithm •Skew on join key - Skew join •Fits in memory – FR join © Hortonworks Inc. 2011 Page 29 Architecting the Future of Big Data
  • 30. Cost based optimization – MR tuning •Tune MR parameters to reduce IO •Control spills using map sort params •Reduce shuffle/sort-merge params © Hortonworks Inc. 2011 Page 30 Architecting the Future of Big Data
  • 31. Parallelism of reduce tasks 0:25:55 0:23:02 0:20:10 0:17:17 0:14:24 4 6 8 24 48 256 © Hortonworks Inc. 2011 Page 31 Architecting the Future of Big Data Runtime Runtime • Number of reduce slots = 6 • Factors affecting runtime • Cores simultaneously used/skew • Cost of having additional reduce tasks
  • 32. Cost based optimization – keep data sorted •Frequent joins operations on same keys • Keep data sorted on keys • Use merge join • Optimized group on sorted keys • Works with few load functions – needs additional i/f implementation © Hortonworks Inc. 2011 Page 32 Architecting the Future of Big Data
  • 33. Optimizations for sorted data © Hortonworks Inc. 2011 Page 33 Architecting the Future of Big Data 90 80 70 60 50 40 30 20 10 0 sort+sort+join+join join + join Join 2 Join 1 Sort2 Sort1
  • 34. Future Directions • Optimize using stats • Using historical stats w hcatalog • Sampling © Hortonworks Inc. 2011 Page 34 Architecting the Future of Big Data
  • 35. Questions © Hortonworks Inc. 2011 Page 35 Architecting the Future of Big Data ?
  • 36. © Hortonworks Inc. 2011 Page 36

Notas do Editor

  1. Pig’s optimizer applies these for you in most cases, but the user can often apply these rules more aggressively.
  2. With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. The compression of map output is enabled by default, and is done using snappy which is part of hadoop. But the output of MR job currently does not support snappy, and the light weight compression algorithm of lzo does not ship with apache hadoop as it its license is GPL.
  3. Optimizations start before you write your pig query. The choice of how input is stored is made before you use pig, so pig does not get to help you there. The important criteria include serialization format and choice of compression.
  4. The numbers you see in practice can be different from what you expect from theory. So I ran some experiments to see how pig performs with different input options. I used famous/infamous aol search data released back in 2006. I wanted a query that does not do much, so I added a filter that looks my name in the data. I was quite sure aol users are not likely to be searching for me! But apparently, there is one row out of 36 million that matched my name, but that wasn’t actually me!
  5. I tried different ways of storing input, the default PigStorage() which uses a human readable text format. I measured the total time it took for all the map tasks, which was a total of 69 seconds for 36 M recs, that is around ½ M per core/sec. Then I tried compressed form of PigStorage, which uses LZO for compression – the data size size is reduced to a third. LZO compression is a lightweight compression, so it does not add too much of cpu overhead. The reduced input file size will save on IO, but in this case, the data copy is available locally and since size is small it is likely to be in OS cache. Compression will add more value when that is not the case. In first two cases, I ran the query without specifying the data type for each column, so they did not get deserialized to corresponding java types. When I specify the datatype, PigStorage takes lot longer. I tried AvroStorage load function with types, and it performs significantly better than PigStorage with types.
  6. Pig introduced a new aggregation algorithm in pig 0.10. The only supported algorithm earlier was one that used combiner. But the problem with combiner is that MR serializes map output to a buffer, and then deserializes it, in the process of getting sorted data out to combiner phase. The serlization-deserisilazation is expensive. So in 0.10, we use a hash based aggregation within the map itself, so that this cost can be avoided. Instead of map logic’s output going to combiner it goes to the new HBA operator, which does partial aggregation and reduces output size. In 0.10 the hash based agg is off by default. This is because, it is a new feature, and we thought of letting people try it out and give feedback. In most cases this should outperform combiner based aggregation. In theory there are few extreme cases where combiner based aggregation can be useful.
  7. As you can see in the previous diagram, HBA’s usefulness depends on how much it reduces the map output. If it does not reduce it by much, the cpu costs of using HBA is not worth it. So hash-based agg has an auto-off feature. The operator stops trying to do aggregation if it sees that there is not much of output size reduction that is happening. It is set to a factor of 10, ie if the data size does not get reduced to a 10th, it disables itself. But based on some performance tests we did, values like 3 or 4 are also safe for most cases. You can also configure memory used by hash based agg using pig.cachedbag.memusage. It is percentage of memory to be used for retaining bags of data in memory, higher value keeps more records in memory and can help reduce output size, but if value is too high, you run into risk of running out of memory. But for most cases, the default of 20% is likely to work. That is not one of the first things to look at to improve performance.
  8. The common mapreduce parameters you can tweak are also applicable to pig. You would want to look at the map task spill counts, to see if spill is happening more than once. If that is the case, then you want to see if you can allocate a larger sort buffer by increasing the value of io.sort.mb configuration parameter. There are also other parameters the decide how the regions within buffer are allocated, you can look at those to optimize it further. There are also reduce side shuffle parameters that you can look at, it can also help in reducing IO. You can specify the MR properties on pig commandline or set it in the properties file.
  9. TODO: open jira for optimized group on sorted data.
  10. Numbers using google 1 gram data. Doing a join of data against itself on word+year .