This document discusses optimizing data processing on Apache Pig. It describes Pig as a high-level language for analyzing large datasets. Various optimization techniques for Pig are covered, including pushing filters, partition pruning, intermediate file compression, and controlling multiquery jobs. Cost-based optimizations like aggregation algorithms and join strategies are also discussed. Keeping data sorted and using columnar formats can further improve performance. Future work includes optimizing queries using statistics and sampling.
Pig’s optimizer applies these for you in most cases, but the user can often apply these rules more aggressively.
With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. The compression of map output is enabled by default, and is done using snappy which is part of hadoop. But the output of MR job currently does not support snappy, and the light weight compression algorithm of lzo does not ship with apache hadoop as it its license is GPL.
Optimizations start before you write your pig query. The choice of how input is stored is made before you use pig, so pig does not get to help you there. The important criteria include serialization format and choice of compression.
The numbers you see in practice can be different from what you expect from theory. So I ran some experiments to see how pig performs with different input options. I used famous/infamous aol search data released back in 2006. I wanted a query that does not do much, so I added a filter that looks my name in the data. I was quite sure aol users are not likely to be searching for me! But apparently, there is one row out of 36 million that matched my name, but that wasn’t actually me!
I tried different ways of storing input, the default PigStorage() which uses a human readable text format. I measured the total time it took for all the map tasks, which was a total of 69 seconds for 36 M recs, that is around ½ M per core/sec. Then I tried compressed form of PigStorage, which uses LZO for compression – the data size size is reduced to a third. LZO compression is a lightweight compression, so it does not add too much of cpu overhead. The reduced input file size will save on IO, but in this case, the data copy is available locally and since size is small it is likely to be in OS cache. Compression will add more value when that is not the case. In first two cases, I ran the query without specifying the data type for each column, so they did not get deserialized to corresponding java types. When I specify the datatype, PigStorage takes lot longer. I tried AvroStorage load function with types, and it performs significantly better than PigStorage with types.
Pig introduced a new aggregation algorithm in pig 0.10. The only supported algorithm earlier was one that used combiner. But the problem with combiner is that MR serializes map output to a buffer, and then deserializes it, in the process of getting sorted data out to combiner phase. The serlization-deserisilazation is expensive. So in 0.10, we use a hash based aggregation within the map itself, so that this cost can be avoided. Instead of map logic’s output going to combiner it goes to the new HBA operator, which does partial aggregation and reduces output size. In 0.10 the hash based agg is off by default. This is because, it is a new feature, and we thought of letting people try it out and give feedback. In most cases this should outperform combiner based aggregation. In theory there are few extreme cases where combiner based aggregation can be useful.
As you can see in the previous diagram, HBA’s usefulness depends on how much it reduces the map output. If it does not reduce it by much, the cpu costs of using HBA is not worth it. So hash-based agg has an auto-off feature. The operator stops trying to do aggregation if it sees that there is not much of output size reduction that is happening. It is set to a factor of 10, ie if the data size does not get reduced to a 10th, it disables itself. But based on some performance tests we did, values like 3 or 4 are also safe for most cases. You can also configure memory used by hash based agg using pig.cachedbag.memusage. It is percentage of memory to be used for retaining bags of data in memory, higher value keeps more records in memory and can help reduce output size, but if value is too high, you run into risk of running out of memory. But for most cases, the default of 20% is likely to work. That is not one of the first things to look at to improve performance.
The common mapreduce parameters you can tweak are also applicable to pig. You would want to look at the map task spill counts, to see if spill is happening more than once. If that is the case, then you want to see if you can allocate a larger sort buffer by increasing the value of io.sort.mb configuration parameter. There are also other parameters the decide how the regions within buffer are allocated, you can look at those to optimize it further. There are also reduce side shuffle parameters that you can look at, it can also help in reducing IO. You can specify the MR properties on pig commandline or set it in the properties file.
TODO: open jira for optimized group on sorted data.
Numbers using google 1 gram data. Doing a join of data against itself on word+year .