O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.
Before 0.8 this is hard in Pig because you cannot re-use the results of Pig Latin operation in another operation without joining them, even if the result is a scalar value
January 2011 HUG: Pig Presentation
Alan F. Gates<br />Pig 0.8 New Features<br />
Who am I?<br />Pig committer and PMC member<br />Architect in the grid team at Yahoo<br />Photo credit: Steven Guarnaccia, The Three Little Pigs<br />
Focus of Pig 0.8<br />Usability<br />Integration<br />Performance<br />Backwards compatibility with 0.7<br />
Better statistics<br />Statistics printed out at end of job run<br />Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage<br />Loader for reading job history files included in Piggybank<br />New PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats information<br />Can also pass listener to track Pig jobs as they run<br />Done for Oozie so it can show users Pig statistics<br />
Sample stats info<br />Job Stats (time in seconds):<br />JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias <br />job_0 2 1 15 3 9 27 27 27 a,b,c,d,e<br />job_1 1 1 3 3 3 12 12 12 g,h<br />job_2 1 1 3 3 3 12 12 12 i<br />job_3 1 1 3 3 3 12 12 12 i<br />Input(s):<br />Successfully read 10000 records from: “studenttab10k"<br />Successfully read 10000 records from: “votertab10k"<br />Output(s):<br />Successfully stored 6 records (150 bytes) in: ”outfile"<br />Counters:<br />Total records written : 6<br />Total bytes written : 150<br />
Invoke Static Java Functions as UDFs<br />Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs<br />define UrlDecodeInvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8');<br />Currently only works with simple types and static functions <br />
Improved HBase Integration<br />Can now read records as bytes instead of auto converting to strings<br />Filters can be pushed down<br />Can store data in HBase as well as load from it<br />Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.<br />
Casting Relations to Scalars<br />Say you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.)<br /> views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group, <br />COUNT(browser) / (long)numviews.total;<br />Now it is possible to cast the relation numviewsto a scalar value for use in later calculations<br />Pig handles storing the results in a file and retrieving it when needed<br />Only works for single row results<br />
Integrating MapReduce Jobs<br />Sometimes you need to integrate MR and Pig jobs<br />Legacy code<br />Algorithm that’s hard to implement in Pig<br />A = load 'WordcountInput.txt'; <br />B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCountinputDiroutputDir`; <br />C = foreachB …<br />
Plus a Whole Lot More<br />Custom PartitionersB = group A by $0 partition by YourPartitionerparallel 2;<br />Greatly expanded string and math built in UDFs<br />Performance Improvements<br />Automatic merging of small files<br />Compression of intermediate results<br />Safety Features<br />Parallel set automatically when not specified<br />Monitor your UDF by annotating it with @MonitoredUDF. If it takes too long to return Pig will kill it and return a default value instead.<br />PigUnit for unit testing your Pig Latin scripts<br />
Plus Even More I Probably Don’t Have Time to Talk About<br />New option for UNION to merge schemas<br />Map side COGROUP<br />DESCRIBE now works in nested FOREACH<br />Local shell commands can now be run from Grunt<br />Support for jars and scripts stored on dfs<br />Arbitrary jobconf key-value pairs can be set inside Pig Latin script using SET<br />Merge join extended<br />Support for more than two tables for inner join<br />Support for left, right, or full outer join for 2 tables<br /><ul><li>Pig artifacts now available via maven</li></ul>Significant memory improvements.<br />
What’s Next?<br />Preview of Pig 0.9<br />Integrate Pig with scripting languages for control flow<br />Add macros to Pig Latin<br />Revive ILLUSTRATE<br />Fix most runtime type errors<br />Rewrite parser to give useful error messages<br />Programming Pig from O’Reilly Press<br />
Acknowledgements<br />Much of the content of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8: http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/<br />The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release<br />