SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 10
                   October 27, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu




                                                                     1
Acknowledgments
        Course design and slides based on
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures and examples courtesy of the
following great Hadoop (order yours today!)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)

                                               2
Today’s Agenda
• Machine Translation (wrap-up)
• Apache Pig
• Language Modeling

• Only Pig included in this slide deck
  – slides for other topics will be posted separately on
    course site



                                                       3
Apache Pig
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)

grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)

grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)

grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)

grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)

grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)

grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
grunt> DUMP C;
(1,Scarf,,)
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)

grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)

grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)

grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)

grunt> E = COGROUP A BY $0 INNER, B BY $1;
grunt> DUMP E;
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
grunt> E = COGROUP A BY $0 INNER, B BY $1;
grunt> DUMP E;
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})

grunt> F = FOREACH E GENERATE FLATTEN(A), B.$0;
grunt> DUMP F;
(1,Scarf,{})
(2,Tie,{(Joe),(Hank)})
(3,Hat,{(Eve)})
(4,Coat,{(Hank)})
grunt> I = CROSS A, B;
                 grunt> DUMP I;
grunt> DUMP A;   (2,Tie,Joe,2)
(2,Tie)          (2,Tie,Hank,4)
                 (2,Tie,Ali,0)
(4,Coat)         (2,Tie,Eve,3)
(3,Hat)          (2,Tie,Hank,2)
                 (4,Coat,Joe,2)
(1,Scarf)        (4,Coat,Hank,4)
                 (4,Coat,Ali,0)
grunt> DUMP B;   (4,Coat,Eve,3)
                 (4,Coat,Hank,2)
(Joe,2)          (3,Hat,Joe,2)
(Hank,4)         (3,Hat,Hank,4)
                 (3,Hat,Ali,0)
(Ali,0)          (3,Hat,Eve,3)
(Eve,3)          (3,Hat,Hank,2)
(Hank,2)         (1,Scarf,Joe,2)
                 (1,Scarf,Hank,4)
                 (1,Scarf,Ali,0)
                 (1,Scarf,Eve,3)
                 (1,Scarf,Hank,2)
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)

grunt> B = GROUP A BY $0;
grunt> DUMP B;
(Joe,{(Joe,Cherry),(Joe,banana)})
(Ali,{(Ali,apple)})
(Eve,{(Eve,apple)})

grunt> C = GROUP A BY $1;
grunt> DUMP C;
(chery,{(Joe,Cherry)})
(apple,{(Ali,apple),(Eve,apple)})
(banana,{(Joe,banana)})
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)

-- group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5L,{(Ali,apple),(Eve,apple)})
(6L,{(Joe,cherry),(Joe,banana)})

grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})

grunt> D = GROUP A ANY; // random sampling
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)

-- Streaming as in Hadoop via stdin and stdout
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)


grunt> D = DISTINCT C;
grunt> DUMP D;
(cherry)
(apple)
(banana)
grunt> C = STREAM A THROUGH `cut -f 2`;


-- use external script (e.g. Python)
-- use DEFINE not only to create alias, but to ship to cluster
-- cluster needs appropriate software installed (e.g. Python)
grunt> DEFINE my_function `myfunc.py` SHIP (‘foo/myfunc.py');

grunt> C = STREAM A THROUGH my_function;
grunt> DUMP A;
(2,3)
(1,2)
(2,4)

grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)

-- ordering not preserved!
grunt> C = FOREACH B GENERATE *;

-- order preserved
grunt> D = LIMIT B 2;
grunt> DUMP D;
(1,2)
(2,4)
grunt> DUMP A;
(2,3)
(1,2)
(2,4)

grunt> DUMP B;
(z,x,8)
(w,y,1)

grunt> C = UNION A, B;

grunt> DUMP C;
(z,x,8)
(w,y,1)
(2,3)
(1,2)
(2,4)
grunt> C = UNION A, B;

grunt> DUMP C;
(z,x,8)
(w,y,1)
(2,3)
(1,2)
(2,4)

grunt> DESCRIBE A;
A: {f0: int,f1: int}

grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}

grunt> DESCRIBE C;
Schema for C unknown.
grunt> records = LOAD ’foo.txt'
>> AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

grunt> DESCRIBE records;
records: {year: chararray, temperature:int, quality: int}
grunt> DESCRIBE records;
records: {year: chararray, temperature:int, quality: int}

grunt> DUMP records; -- colored changed from before
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,2)

grunt> filtered_records = FILTER records
>> BY temperature >= 0 AND
>> (quality == 0 OR quality == 1);

grunt> DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1949,111,1)
grunt> records = LOAD ’foo.txt'
>> AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

grunt> grouped = GROUP records BY year;

grunt> DUMP grouped;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

grunt> DESCRIBE grouped;
grouped: {group: chararray,
          records: {year: chararray, temperature:int, quality: int} }
grunt> DUMP grouped;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

grunt> max_temp = FOREACH grouped GENERATE group,
>> MAX(records.temperature);

grunt> DUMP max_temp;
(1949,111)
(1950,22)

-- let’s put it all together
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR
quality == 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);

DUMP max_temp;
grunt> ILLUSTRATE max_temp;
-------------------------------------------------------------------------------
| records | year: bytearray | temperature: bytearray | quality: bytearray |
-------------------------------------------------------------------------------
| | 1949 | 9999 | 1 |
| | 1949 | 111 | 1 |
| | 1949 | 78 | 1 |
-------------------------------------------------------------------------------
| records | year: chararray | temperature: int | quality: int |
-------------------------------------------------------------------
| | 1949 | 9999 | 1 |
| | 1949 | 111 | 1 |
| | 1949 | 78 | 1 |
-------------------------------------------------------------------
----------------------------------------------------------------------------
| filtered_records | year: chararray | temperature: int | quality: int |
----------------------------------------------------------------------------
| | 1949 | 111 | 1 |
| | 1949 | 78 | 1 |
----------------------------------------------------------------------------
------------------------------------------------------------------------------------
| grouped_records | group: chararray | filtered_records: bag({year: chararray, |
temperature: int,quality: int}) |
------------------------------------------------------------------------------------
| | 1949 | {(1949, 111, 1), (1949, 78, 1)} |
------------------------------------------------------------------------------------
-------------------------------------------
| max_temp | group: chararray | int |
-------------------------------------------
| | 1949 | 111 |
-------------------------------------------
White p. 172

Multiquery execution
A = LOAD 'input/pig/multiquery/A';
B = FILTER A BY $1 == 'banana';
C = FILTER A BY $1 != 'banana';
STORE B INTO 'output/b';
STORE C INTO 'output/c';

“Relations B and C are both derived from A, so to save reading A
twice, Pig can run this script as a single MapReduce job by reading A
once and writing two output files from the job, one for each of B and
C.”




                                                   27
White p. 172

Handling data corruption

grunt> records = LOAD 'corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,,1)
(1949,111,1)
(1949,78,1)

grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)


                                                 28
White p. 172

Handling data corruption
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)

grunt> grouped = GROUP corrupt ALL;

grunt> all_grouped = FOREACH grouped GENERATE group,
COUNT(corrupt);

grunt> DUMP all_grouped;
(all,1)




                                              29
White p. 172

Handling data corruption
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)

grunt> SPLIT records INTO good IF temperature is not null,
>> bad IF temperature is null;

grunt> DUMP good;
(1950,0,1)
(1950,22,1)
(1949,111,1)
(1949,78,1)

grunt> DUMP bad;
(1950,,1)
                                              30
White p. 172

Handling data corruption
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)

grunt> SPLIT records INTO good IF temperature is not null,
>> bad IF temperature is null;

grunt> DUMP good;
(1950,0,1)
(1950,22,1)
(1949,111,1)
(1949,78,1)

grunt> DUMP bad;
(1950,,1)
                                              31
White p. 172

Handling missing data
grunt> A = LOAD 'input/pig/corrupt/missing_fields';
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3)
(1,Scarf)

grunt> B = FILTER A BY SIZE(*) > 1;

grunt> DUMP B;
(2,Tie)
(4,Coat)
(1,Scarf)


                                               32
User Defined Functions (UDFs)
   Written in Java
   Sub-class EvalFunc or FilterFunc
       FilterFunc sub-classes EvalFunc with type T=Boolean


public abstract class EvalFunc<T> {
        public abstract T exec(Tuple input) throws IOException;
}




PiggyBank: Public library of pig functions
   http://wiki.apache.org/pig/PiggyBank
UDF Example
 filtered = FILTER records BY temperature != 9999 AND
 (quality == 0 OR quality == 1 OR quality == 4
               OR quality == 5 OR quality == 9);


 grunt> REGISTER my-udfs.jar;

 grunt> filtered = FILTER records BY temperature != 9999
 AND com.hadoopbook.pig.IsGoodQuality(quality);

 -- aliasing
 grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();

 grunt> filtered_records = FILTER records
 >> BY temperature != 9999 AND isGood(quality);
More on UDFs (functionals)
Pig translates
          com.hadoopbook.pig.IsGoodQuality(x)
to
          com.hadoopbook.pig.IsGoodQuality.exec(x);


    Look for class named “…isGoodQuality” in registered JAR
    Instantiate an instance as specified by DEFINE clause
        Example below uses default constructor (no arguments)
          • Default behavior if no DEFINE corresponding clause is specified
        Can optionally pass other constructor arguments to parameterize
         different UDF behaviors at run-time

     grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
Case Sensitivity
    Operators & commands are not case-sensitive
    aliases & function names are case-sensitive
        Why?
        Pig resolves function calls by
          • Treating the function’s name as a Java classname
          • Trying to load a class with that name
        Java classnames are case-sensitive
Setting the number of reducers
   Like Hadoop, defaults to 1
grouped_records = GROUP records BY year PARALLEL 30;
   Can use optional PARALLEL clause for reduce operators
       grouping & joining (GROUP, COGROUP, JOIN, CROSS)
       DISTINCT
       ORDER


   Number of map tasks auto-determined as in Hadoop
Setting and using parameters
 pig -param input=in.txt -param output=out.txt foo.pig

 OR

 # foo.param
 input=/user/tom/input/ncdc/micro-tab/sample.txt
 output=/tmp/out

 pig -param_file foo.param foo.pig

 THEN

 records = LOAD '$input';
 …
 STORE x into '$output';
Running Pig
   Version Matching: Hadoop & Pig
       Pig Use Pig 0.3-0.4 with Hadoop 0.18
       Use Pig 0.5-0.7 with Hadoop 0.20.x.
         • uses the new MapReduce API



   Pig is pure client-side
       no software to install on cluster
       Pig run-time generates Hadoop programs



   As with Hadoop, can run Pig local or in distributed mode
Ways to run Pig
   Script file: pig script.pig
   Command-line pig –e “DUMP a”
   grunt> interactive shell
   embedded: launch Pig programs from java code


   PigPen eclipse plug-in
White p. 333
White p. 337
Pig types
For More Information on Pig…
   http://hadoop.apache.org/pig
   http://wiki.apache.org/pig

Mais conteúdo relacionado

Semelhante a Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records

Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionAtsushi Nitanda
 
Finish the program below that does several bit-wise manipulations of.pdf
Finish the program below that does several bit-wise manipulations of.pdfFinish the program below that does several bit-wise manipulations of.pdf
Finish the program below that does several bit-wise manipulations of.pdffasttrackcomputersol
 
AST: threats and opportunities
AST: threats and opportunitiesAST: threats and opportunities
AST: threats and opportunitiesAlexander Lifanov
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...DevOpsDays Tel Aviv
 
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文doboncho
 
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonByterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonakaptur
 

Semelhante a Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records (7)

Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network Perception
 
Finish the program below that does several bit-wise manipulations of.pdf
Finish the program below that does several bit-wise manipulations of.pdfFinish the program below that does several bit-wise manipulations of.pdf
Finish the program below that does several bit-wise manipulations of.pdf
 
AST: threats and opportunities
AST: threats and opportunitiesAST: threats and opportunities
AST: threats and opportunities
 
Py3k
Py3kPy3k
Py3k
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
 
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
 
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonByterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPython
 

Mais de Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
 

Mais de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 10 October 27, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu 1
  • 2. Acknowledgments Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures and examples courtesy of the following great Hadoop (order yours today!) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010) 2
  • 3. Today’s Agenda • Machine Translation (wrap-up) • Apache Pig • Language Modeling • Only Pig included in this slide deck – slides for other topics will be posted separately on course site 3
  • 6. grunt> DUMP A; (Joe,cherry,2) (Ali,apple,3) (Joe,banana,2) (Eve,apple,7) grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant'; grunt> DUMP B; (Joe,3,Constant) (Ali,4,Constant) (Joe,3,Constant) (Eve,8,Constant)
  • 7. grunt> DUMP A; (2,Tie) (4,Coat) (3,Hat) (1,Scarf) grunt> DUMP B; (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2)
  • 8. grunt> DUMP A; (2,Tie) (4,Coat) (3,Hat) (1,Scarf) grunt> DUMP B; (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2) grunt> C = JOIN A BY $0, B BY $1; grunt> DUMP C; (2,Tie,Joe,2) (2,Tie,Hank,2) (3,Hat,Eve,3) (4,Coat,Hank,4)
  • 9. grunt> DUMP A; (2,Tie) (4,Coat) (3,Hat) (1,Scarf) grunt> DUMP B; (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2) grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1; grunt> DUMP C; (1,Scarf,,) (2,Tie,Joe,2) (2,Tie,Hank,2) (3,Hat,Eve,3) (4,Coat,Hank,4)
  • 10. grunt> DUMP A; (2,Tie) (4,Coat) (3,Hat) (1,Scarf) grunt> DUMP B; (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2) grunt> D = COGROUP A BY $0, B BY $1; grunt> DUMP D; (0,{},{(Ali,0)}) (1,{(1,Scarf)},{}) (2,{(2,Tie)},{(Joe,2),(Hank,2)}) (3,{(3,Hat)},{(Eve,3)}) (4,{(4,Coat)},{(Hank,4)})
  • 11. grunt> DUMP A; (2,Tie) (4,Coat) (3,Hat) (1,Scarf) grunt> DUMP B; (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2) grunt> E = COGROUP A BY $0 INNER, B BY $1; grunt> DUMP E; (1,{(1,Scarf)},{}) (2,{(2,Tie)},{(Joe,2),(Hank,2)}) (3,{(3,Hat)},{(Eve,3)}) (4,{(4,Coat)},{(Hank,4)})
  • 12. grunt> E = COGROUP A BY $0 INNER, B BY $1; grunt> DUMP E; (1,{(1,Scarf)},{}) (2,{(2,Tie)},{(Joe,2),(Hank,2)}) (3,{(3,Hat)},{(Eve,3)}) (4,{(4,Coat)},{(Hank,4)}) grunt> F = FOREACH E GENERATE FLATTEN(A), B.$0; grunt> DUMP F; (1,Scarf,{}) (2,Tie,{(Joe),(Hank)}) (3,Hat,{(Eve)}) (4,Coat,{(Hank)})
  • 13. grunt> I = CROSS A, B; grunt> DUMP I; grunt> DUMP A; (2,Tie,Joe,2) (2,Tie) (2,Tie,Hank,4) (2,Tie,Ali,0) (4,Coat) (2,Tie,Eve,3) (3,Hat) (2,Tie,Hank,2) (4,Coat,Joe,2) (1,Scarf) (4,Coat,Hank,4) (4,Coat,Ali,0) grunt> DUMP B; (4,Coat,Eve,3) (4,Coat,Hank,2) (Joe,2) (3,Hat,Joe,2) (Hank,4) (3,Hat,Hank,4) (3,Hat,Ali,0) (Ali,0) (3,Hat,Eve,3) (Eve,3) (3,Hat,Hank,2) (Hank,2) (1,Scarf,Joe,2) (1,Scarf,Hank,4) (1,Scarf,Ali,0) (1,Scarf,Eve,3) (1,Scarf,Hank,2)
  • 14. grunt> DUMP A; (Joe,cherry) (Ali,apple) (Joe,banana) (Eve,apple) grunt> B = GROUP A BY $0; grunt> DUMP B; (Joe,{(Joe,Cherry),(Joe,banana)}) (Ali,{(Ali,apple)}) (Eve,{(Eve,apple)}) grunt> C = GROUP A BY $1; grunt> DUMP C; (chery,{(Joe,Cherry)}) (apple,{(Ali,apple),(Eve,apple)}) (banana,{(Joe,banana)})
  • 15. grunt> DUMP A; (Joe,cherry) (Ali,apple) (Joe,banana) (Eve,apple) -- group by the number of characters in the second field: grunt> B = GROUP A BY SIZE($1); grunt> DUMP B; (5L,{(Ali,apple),(Eve,apple)}) (6L,{(Joe,cherry),(Joe,banana)}) grunt> C = GROUP A ALL; grunt> DUMP C; (all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)}) grunt> D = GROUP A ANY; // random sampling
  • 16. grunt> DUMP A; (Joe,cherry,2) (Ali,apple,3) (Joe,banana,2) (Eve,apple,7) -- Streaming as in Hadoop via stdin and stdout grunt> C = STREAM A THROUGH `cut -f 2`; grunt> DUMP C; (cherry) (apple) (banana) (apple) grunt> D = DISTINCT C; grunt> DUMP D; (cherry) (apple) (banana)
  • 17. grunt> C = STREAM A THROUGH `cut -f 2`; -- use external script (e.g. Python) -- use DEFINE not only to create alias, but to ship to cluster -- cluster needs appropriate software installed (e.g. Python) grunt> DEFINE my_function `myfunc.py` SHIP (‘foo/myfunc.py'); grunt> C = STREAM A THROUGH my_function;
  • 18. grunt> DUMP A; (2,3) (1,2) (2,4) grunt> B = ORDER A BY $0, $1 DESC; grunt> DUMP B; (1,2) (2,4) (2,3) -- ordering not preserved! grunt> C = FOREACH B GENERATE *; -- order preserved grunt> D = LIMIT B 2; grunt> DUMP D; (1,2) (2,4)
  • 19. grunt> DUMP A; (2,3) (1,2) (2,4) grunt> DUMP B; (z,x,8) (w,y,1) grunt> C = UNION A, B; grunt> DUMP C; (z,x,8) (w,y,1) (2,3) (1,2) (2,4)
  • 20. grunt> C = UNION A, B; grunt> DUMP C; (z,x,8) (w,y,1) (2,3) (1,2) (2,4) grunt> DESCRIBE A; A: {f0: int,f1: int} grunt> DESCRIBE B; B: {f0: chararray,f1: chararray,f2: int} grunt> DESCRIBE C; Schema for C unknown.
  • 21. grunt> records = LOAD ’foo.txt' >> AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> DESCRIBE records; records: {year: chararray, temperature:int, quality: int}
  • 22. grunt> DESCRIBE records; records: {year: chararray, temperature:int, quality: int} grunt> DUMP records; -- colored changed from before (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,2) grunt> filtered_records = FILTER records >> BY temperature >= 0 AND >> (quality == 0 OR quality == 1); grunt> DUMP filtered_records; (1950,0,1) (1950,22,1) (1949,111,1)
  • 23. grunt> records = LOAD ’foo.txt' >> AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> grouped = GROUP records BY year; grunt> DUMP grouped; (1949,{(1949,111,1),(1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)}) grunt> DESCRIBE grouped; grouped: {group: chararray, records: {year: chararray, temperature:int, quality: int} }
  • 24. grunt> DUMP grouped; (1949,{(1949,111,1),(1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)}) grunt> max_temp = FOREACH grouped GENERATE group, >> MAX(records.temperature); grunt> DUMP max_temp; (1949,111) (1950,22) -- let’s put it all together
  • 25. records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp;
  • 26. grunt> ILLUSTRATE max_temp; ------------------------------------------------------------------------------- | records | year: bytearray | temperature: bytearray | quality: bytearray | ------------------------------------------------------------------------------- | | 1949 | 9999 | 1 | | | 1949 | 111 | 1 | | | 1949 | 78 | 1 | ------------------------------------------------------------------------------- | records | year: chararray | temperature: int | quality: int | ------------------------------------------------------------------- | | 1949 | 9999 | 1 | | | 1949 | 111 | 1 | | | 1949 | 78 | 1 | ------------------------------------------------------------------- ---------------------------------------------------------------------------- | filtered_records | year: chararray | temperature: int | quality: int | ---------------------------------------------------------------------------- | | 1949 | 111 | 1 | | | 1949 | 78 | 1 | ---------------------------------------------------------------------------- ------------------------------------------------------------------------------------ | grouped_records | group: chararray | filtered_records: bag({year: chararray, | temperature: int,quality: int}) | ------------------------------------------------------------------------------------ | | 1949 | {(1949, 111, 1), (1949, 78, 1)} | ------------------------------------------------------------------------------------ ------------------------------------------- | max_temp | group: chararray | int | ------------------------------------------- | | 1949 | 111 | -------------------------------------------
  • 27. White p. 172 Multiquery execution A = LOAD 'input/pig/multiquery/A'; B = FILTER A BY $1 == 'banana'; C = FILTER A BY $1 != 'banana'; STORE B INTO 'output/b'; STORE C INTO 'output/c'; “Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B and C.” 27
  • 28. White p. 172 Handling data corruption grunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,,1) (1949,111,1) (1949,78,1) grunt> corrupt = FILTER records BY temperature is null; grunt> DUMP corrupt; (1950,,1) 28
  • 29. White p. 172 Handling data corruption grunt> corrupt = FILTER records BY temperature is null; grunt> DUMP corrupt; (1950,,1) grunt> grouped = GROUP corrupt ALL; grunt> all_grouped = FOREACH grouped GENERATE group, COUNT(corrupt); grunt> DUMP all_grouped; (all,1) 29
  • 30. White p. 172 Handling data corruption grunt> corrupt = FILTER records BY temperature is null; grunt> DUMP corrupt; (1950,,1) grunt> SPLIT records INTO good IF temperature is not null, >> bad IF temperature is null; grunt> DUMP good; (1950,0,1) (1950,22,1) (1949,111,1) (1949,78,1) grunt> DUMP bad; (1950,,1) 30
  • 31. White p. 172 Handling data corruption grunt> corrupt = FILTER records BY temperature is null; grunt> DUMP corrupt; (1950,,1) grunt> SPLIT records INTO good IF temperature is not null, >> bad IF temperature is null; grunt> DUMP good; (1950,0,1) (1950,22,1) (1949,111,1) (1949,78,1) grunt> DUMP bad; (1950,,1) 31
  • 32. White p. 172 Handling missing data grunt> A = LOAD 'input/pig/corrupt/missing_fields'; grunt> DUMP A; (2,Tie) (4,Coat) (3) (1,Scarf) grunt> B = FILTER A BY SIZE(*) > 1; grunt> DUMP B; (2,Tie) (4,Coat) (1,Scarf) 32
  • 33. User Defined Functions (UDFs)  Written in Java  Sub-class EvalFunc or FilterFunc  FilterFunc sub-classes EvalFunc with type T=Boolean public abstract class EvalFunc<T> { public abstract T exec(Tuple input) throws IOException; } PiggyBank: Public library of pig functions  http://wiki.apache.org/pig/PiggyBank
  • 34. UDF Example filtered = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grunt> REGISTER my-udfs.jar; grunt> filtered = FILTER records BY temperature != 9999 AND com.hadoopbook.pig.IsGoodQuality(quality); -- aliasing grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality(); grunt> filtered_records = FILTER records >> BY temperature != 9999 AND isGood(quality);
  • 35. More on UDFs (functionals) Pig translates com.hadoopbook.pig.IsGoodQuality(x) to com.hadoopbook.pig.IsGoodQuality.exec(x);  Look for class named “…isGoodQuality” in registered JAR  Instantiate an instance as specified by DEFINE clause  Example below uses default constructor (no arguments) • Default behavior if no DEFINE corresponding clause is specified  Can optionally pass other constructor arguments to parameterize different UDF behaviors at run-time grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
  • 36. Case Sensitivity  Operators & commands are not case-sensitive  aliases & function names are case-sensitive  Why?  Pig resolves function calls by • Treating the function’s name as a Java classname • Trying to load a class with that name  Java classnames are case-sensitive
  • 37. Setting the number of reducers  Like Hadoop, defaults to 1 grouped_records = GROUP records BY year PARALLEL 30;  Can use optional PARALLEL clause for reduce operators  grouping & joining (GROUP, COGROUP, JOIN, CROSS)  DISTINCT  ORDER  Number of map tasks auto-determined as in Hadoop
  • 38. Setting and using parameters pig -param input=in.txt -param output=out.txt foo.pig OR # foo.param input=/user/tom/input/ncdc/micro-tab/sample.txt output=/tmp/out pig -param_file foo.param foo.pig THEN records = LOAD '$input'; … STORE x into '$output';
  • 39. Running Pig  Version Matching: Hadoop & Pig  Pig Use Pig 0.3-0.4 with Hadoop 0.18  Use Pig 0.5-0.7 with Hadoop 0.20.x. • uses the new MapReduce API  Pig is pure client-side  no software to install on cluster  Pig run-time generates Hadoop programs  As with Hadoop, can run Pig local or in distributed mode
  • 40. Ways to run Pig  Script file: pig script.pig  Command-line pig –e “DUMP a”  grunt> interactive shell  embedded: launch Pig programs from java code  PigPen eclipse plug-in
  • 43. For More Information on Pig…  http://hadoop.apache.org/pig  http://wiki.apache.org/pig