Making pig fly optimizing data processing on hadoop presentation

Making Pig Fly
Optimizing Data Processing on Hadoop
Daniel Dai (@daijy)
Thejas Nair (@thejasn)
© Hortonworks Inc. 2011
Page 1

What is Apache Pig?
Pig Latin, a high level
data processing
language.
Page 2
Architecting the Future of Big Data
An engine that
executes Pig
Latin locally or on
a Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Pig-latin example
• Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.
USERS = load ‘users’ as (uid, age);
USERS_20s = filter USERS by age >= 20 and age <= 29;
PVs = load ‘pages’ as (url, uid, timestamp);
PVs_u20s = join USERS_20s by uid, PVs by uid;
Page 3

Why pig ?
•Faster development
– Fewer lines of code
– Don’t re-invent the wheel
• Flexible
– Metadata is optional
– Extensible
– Procedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
Page 4

Pig optimizations
• Ideally user should not have to
bother
• Reality
– Pig is still young and immature
– Pig does not have the whole picture
–Cluster configuration
–Data histogram
– Pig philosophy: Pig is docile
Page 5

Pig optimizations
• What pig does for you
– Do safe transformations of query to optimize
– Optimized operations (join, sort)
• What you do
– Organize input in optimal way
– Optimize pig-latin query
– Tell pig what join/group algorithm to use
Page 6

Rule based optimizer
• Column pruner
• Push up filter
• Push down flatten
• Push up limit
• Partition pruning
• Global optimizer
Page 7

Column Pruner
• Pig will do column pruning automatically
A = load ‘input’ as (a0, a1, a2);
B = foreach A generate a0+a1;
C = order B by $0;
Store C into ‘output’;
• Cases Pig will not do column pruning
automatically
– No schema specified in load statement
Page 8
Pig will prune
a2 automatically
A = load ‘input’;
B = order A by $0;
C = foreach B generate $0+$1;
DIY
A1 = foreach A generate $0, $1;
B = order A1 by $0;
C = foreach B generate $0+$1;

Column Pruner
• Another case Pig does not do column
pruning
– Pig does not keep track of unused column after
grouping
B = group A by a0;
C = foreach B generate SUM(A.a1);
Page 9
DIY
A1 = foreach A generate $0, $1;
B = group A1 by a0;
C = foreach B generate SUM(A.a1);

Push up filter
• Pig split the filter condition before push
B
Page 10
A
Join
a0>0 && b0>10
Filter
A
Join
a0>0
B
Filter b0>10
Original query Split filter condition
A
Join
a0>0
B
Filter
b0>10
Push up filter

Other push up/down
• Push down flatten
• Push up limit
Limit
Page 11
Load
Flatten
Order
Load
Order
Flatten
A = load ‘input’ as (a0:bag, a1);
B = foreach A generate
flattten(a0), a1;
C = order B by a1;
Load
Foreach
Limit
Load
Foreach
Load (limited)
Foreach
Load
Order
Limit
Load
Order (limited)

Partition pruning
• Prune unnecessary partitions entirely
– HCatLoader
2010
2011
2012
Page 12
HCatLoader
Filter
(year>=2011)
2010
2011
2012
HCatLoader
(year>=2011)

Intermediate file compression
Pig Script
Page 13
map 1
reduce 1
Pig temp file
map 2
reduce 2
Pig temp file
map 3
reduce 3
•Intermediate file
between map and
reduce
– Snappy
•Temp file between
mapreduce jobs
– No compression by
default

Enable temp file compression
•Pig temp file are not compressed by
default
– Issues with snappy (HADOOP-7990)
– LZO: not Apache license
•Enable LZO compression
–Install LZO for Hadoop
–In conf/pig.properties
pig.tmpfilecompression = true
pig.tmpfilecompression.codec = lzo
–With lzo, up to > 90% disk saving and 4x query
speed up
Page 14

Multiquery
• Combine two or more map/reduce
job into one
– Happens automatically
– Cases we want to control multiquery: combine too
many
Page 15
Load
Group by $0 Group by $1
Foreach Foreach
Store Store
Group by $2
Foreach
Store

Control multiquery
• Disable multiquery
– Command line option: -M
• Using “exec” to mark the boundary
B0 = group A by $0;
C0 = foreach B0 generate group, COUNT(A);
Store C0 into ‘output0’;
B1 = group A by $1;
exec
B2 = group A by $2;
Page 16

Implement the right UDF
• Algebraic UDF
– Initial
– Intermediate
– Final
B0 = group A by $0;
C0 = foreach B0 generate group, SUM(A);
Page 17
Map
Initial
Combiner
Intermediate
Reduce
Final

Implement the right UDF
• Accumulator UDF
– Reduce side UDF
– Normally takes a bag
• Benefit
– Big bag are passed in
batches
– Avoid using too much
memory
– Batch size
Page 18
B0 = group A by $0;
C0 = foreach B0 generate group,
my_accum(A);
my_accum extends Accumulator {
public void accumulate() {
// take a bag trunk
}
public void getValue() {
// called after all bag trunks are
processed
}
pig.accumulative.batchsize=20000 }

Memory optimization
• Control bag size on reduce side
Mapreduce:
reduce(Text key, Iterator<Writable>
values, ……)
– If bag size exceed threshold, spill to disk
– Control the bag size to fit the bag in memory if
possible
Page 19
Iterator
Bag of Input 1 Bag of Input 2 Bag of Input 3
pig.cachedbag.memusage=0.2

Optimization starts before pig
• Input format
• Serialization format
• Compression
Page 20

Input format -Test Query
> searches = load ’aol_search_logs.txt'
using PigStorage() as (ID, Query, …);
> search_thejas = filter searches by Query
matches '.*thejas.*';
> dump search_thejas;
(1568578 , thejasminesupperclub, ….)
Page 21

Input formats
Page 22
140
120
100
80
60
40
20
0
RunTime (sec)
RunTime (sec)

Columnar format
•RCFile
•Columnar format for a group of rows
•More efficient if you query subset of
columns
Page 23

Tests with RCFile
• Tests with load + project + filter out all
records.
• Using hcatalog, w compression,types
•Test 1
•Project 1 out of 5 columns
•Test 2
•Project all 5 columns
Page 24

RCFile test results
Page 25
140
120
100
80
60
40
20
0
Project 1 (sec) Project all (sec)
Plain Text
RCFile

Cost based optimizations
• Optimizations decisions based on
your query/data
• Often iterative process
Page 26
Run
query
Measure
Tune

Cost based optimization - Aggregation
• Hash Based Agg
Map
(logic)
M.
Output
Use pig.exec.mapPartAgg=true to enable
Map task
Page 27
HBA
HBA
Output
Reduce task

Cost based optimization – Hash Agg.
• Auto off feature
• switches off HBA if output reduction is
not good enough
• Configuring Hash Agg
• Configure auto off feature -
pig.exec.mapPartAgg.minReduction
• Configure memory used -
pig.cachedbag.memusage
Page 28

Cost based optimization - Join
• Use appropriate join algorithm
•Skew on join key - Skew join
•Fits in memory – FR join
Page 29

Cost based optimization – MR tuning
•Tune MR parameters to reduce IO
•Control spills using map sort params
•Reduce shuffle/sort-merge params
Page 30

Parallelism of reduce tasks
0:25:55
0:23:02
0:20:10
0:17:17
0:14:24
4 6 8 24 48 256
Page 31
Runtime
Runtime
• Number of reduce slots = 6
• Factors affecting runtime
• Cores simultaneously used/skew
• Cost of having additional reduce tasks

Cost based optimization – keep data sorted
•Frequent joins operations on same
keys
• Keep data sorted on keys
• Use merge join
• Optimized group on sorted keys
• Works with few load functions – needs
additional i/f implementation
Page 32

Optimizations for sorted data
Page 33
90
80
70
60
50
40
30
20
10
0
sort+sort+join+join join + join
Join 2
Join 1
Sort2
Sort1

Future Directions
• Optimize using stats
• Using historical stats w hcatalog
• Sampling
Page 34

Questions
Page 35
?

Making pig fly optimizing data processing on hadoop presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Making pig fly optimizing data processing on hadoop presentation

Semelhante a Making pig fly optimizing data processing on hadoop presentation (20)

Último

Último (20)

Making pig fly optimizing data processing on hadoop presentation

Notas do Editor