O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Pig, Making Hadoop Easy

92.570 visualizações

Publicada em

Presentation by Alan Gates, Yahoo!, gates@yahoo-inc.com. Slides posted with permission.

Pig, Making Hadoop Easy

  1. 1. Alan F. Gates<br />Yahoo!<br />Pig, Making Hadoop Easy<br />
  2. 2. Who Am I?<br />Pig committer<br />Hadoop PMC Member<br />An architect in Yahoo!grid team<br />Or, as one coworker put it, “the lipstick on the Pig”<br />
  3. 3. Who are you?<br />
  4. 4. Motivation By Example<br /> Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.<br />Load Users<br />Load Pages<br />Filter by age<br />Join on name<br />Group on url<br />Count clicks<br />Order by clicks<br />Take top 5<br />
  5. 5. In Map Reduce<br />
  6. 6. In Pig Latin<br />Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;<br />
  7. 7. Performance<br />0.1<br />0.4,<br />0.5<br />0.2<br />0.3<br />0.6, <br />0.7<br />
  8. 8. Why not SQL?<br />Data Factory<br />Pig<br />Pipelines<br />Iterative Processing<br />Research<br />Data Warehouse<br />Hive<br />BI Tools<br />Analysis<br />Data Collection<br />
  9. 9. Pig Highlights<br />User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)<br />UDFs can be written to take advantage of the combiner<br />Four join implementations built in: hash, fragment-replicate, merge, skewed<br />Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned<br />Order by provides total ordering across reducers in a balanced way<br />Writing load and store functions is easy once an InputFormat and OutputFormat exist<br />Piggybank, a collection of user contributed UDFs<br />
  10. 10. Who uses Pig for What?<br />70% of production jobs at Yahoo (10ks per day)<br />Also used by Twitter, LinkedIn, Ebay, AOL, …<br />Used to<br />Process web logs<br />Build user behavior models<br />Process images<br />Build maps of the web<br />Do research on raw data sets<br />
  11. 11. Accessing Pig<br />Submit a script directly<br />Grunt, the pig shell<br />PigServer Java class, a JDBC like interface<br />
  12. 12. Components<br />Job executes on cluster<br />Hadoop Cluster<br />Pig resides on user machine<br />User machine<br />No need to install anything extra on your Hadoop cluster.<br />
  13. 13. How It Works<br />Pig Latin<br />A = LOAD ‘myfile’<br /> AS (x, y, z);<br />B = FILTER A by x > 0; <br />C = GROUP B BY x;<br />D = FOREACH A GENERATE<br />x, COUNT(B);<br />STORE D INTO ‘output’;<br />pig.jar:<br /><ul><li>parses
  14. 14. checks
  15. 15. optimizes
  16. 16. plans execution
  17. 17. submits jar to Hadoop
  18. 18. monitors job progress</li></ul>Execution Plan<br />Map:<br />Filter<br /> Count<br />Combine/Reduce:<br />Sum<br />
  19. 19. Demo<br />s3://hadoopday/pig_tutorial<br />
  20. 20. Upcoming Features<br />In 0.8 (plan to branch end of August, release this fall):<br />Runtime statistics collection<br />UDFs in scripting languages (e.g. python)<br />Ability to specify a custom partitioner<br />Adding many string and math functions as Pig supported UDFs<br />Post 0.8<br />Adding branches, loops, functions, and modules<br />Usability<br />Better error messages<br />Fix ILLUSTRATE<br />Improved integration with workflow systems<br />
  21. 21. Learn More<br />Read the online documentation: http://hadoop.apache.org/pig/<br />On line tutorials<br />From Yahoo, http://developer.yahoo.com/hadoop/tutorial/<br />From Cloudera, http://www.cloudera.com/hadoop-training<br />Using Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728<br />A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore<br />Join the mailing lists:<br />pig-user@hadoop.apache.org for user questions<br />pig-dev@hadoop.apache.com for developer issues<br />howldev@yahoogroups.com for Howl<br />