Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records
This document provides an overview and agenda for a lecture on Apache Pig. Pig is presented as a platform for analyzing large datasets. The lecture will cover topics like loading and filtering data, joins, grouping, ordering, and user-defined functions (UDFs) in Pig. UDFs allow users to extend Pig's capabilities by writing functions in Java. The lecture aims to demonstrate how Pig can simplify working with large datasets by providing high-level abstractions over MapReduce.
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
Mais conteúdo relacionado
Semelhante a Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records
Byterun, a Python bytecode interpreter - Allison Kaptur at NYCPythonakaptur
Semelhante a Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records (7)
Here are a few ways to handle the corrupted record:1. Remove it: FILTER records BY temperature is not null2. Assign default value: records = FOREACH records GENERATE *, IsEmpty(temperature) ? 0 : temperature 3. Log/report it without including in analysisThe key is to first identify corrupted records, then decide how to handle them based on your analysis needs. Removing is common for aggregation.29 White p. 172Handling data corruptiongrunt> records = LOAD 'corrupt.txt' >> AS (year:chararray, temperature:int, quality:int);grunt> DUMP records
1. Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 10
October 27, 2011
Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
1
2. Acknowledgments
Course design and slides based on
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park
Some figures and examples courtesy of the
following great Hadoop (order yours today!)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)
2
3. Today’s Agenda
• Machine Translation (wrap-up)
• Apache Pig
• Language Modeling
• Only Pig included in this slide deck
– slides for other topics will be posted separately on
course site
3
9. grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;
grunt> DUMP C;
(1,Scarf,,)
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
10. grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
11. grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> E = COGROUP A BY $0 INNER, B BY $1;
grunt> DUMP E;
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
12. grunt> E = COGROUP A BY $0 INNER, B BY $1;
grunt> DUMP E;
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
grunt> F = FOREACH E GENERATE FLATTEN(A), B.$0;
grunt> DUMP F;
(1,Scarf,{})
(2,Tie,{(Joe),(Hank)})
(3,Hat,{(Eve)})
(4,Coat,{(Hank)})
14. grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
grunt> B = GROUP A BY $0;
grunt> DUMP B;
(Joe,{(Joe,Cherry),(Joe,banana)})
(Ali,{(Ali,apple)})
(Eve,{(Eve,apple)})
grunt> C = GROUP A BY $1;
grunt> DUMP C;
(chery,{(Joe,Cherry)})
(apple,{(Ali,apple),(Eve,apple)})
(banana,{(Joe,banana)})
15. grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
-- group by the number of characters in the second field:
grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;
(5L,{(Ali,apple),(Eve,apple)})
(6L,{(Joe,cherry),(Joe,banana)})
grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})
grunt> D = GROUP A ANY; // random sampling
17. grunt> C = STREAM A THROUGH `cut -f 2`;
-- use external script (e.g. Python)
-- use DEFINE not only to create alias, but to ship to cluster
-- cluster needs appropriate software installed (e.g. Python)
grunt> DEFINE my_function `myfunc.py` SHIP (‘foo/myfunc.py');
grunt> C = STREAM A THROUGH my_function;
18. grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)
-- ordering not preserved!
grunt> C = FOREACH B GENERATE *;
-- order preserved
grunt> D = LIMIT B 2;
grunt> DUMP D;
(1,2)
(2,4)
25. records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR
quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
27. White p. 172
Multiquery execution
A = LOAD 'input/pig/multiquery/A';
B = FILTER A BY $1 == 'banana';
C = FILTER A BY $1 != 'banana';
STORE B INTO 'output/b';
STORE C INTO 'output/c';
“Relations B and C are both derived from A, so to save reading A
twice, Pig can run this script as a single MapReduce job by reading A
once and writing two output files from the job, one for each of B and
C.”
27
28. White p. 172
Handling data corruption
grunt> records = LOAD 'corrupt.txt'
>> AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,,1)
(1949,111,1)
(1949,78,1)
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)
28
29. White p. 172
Handling data corruption
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)
grunt> grouped = GROUP corrupt ALL;
grunt> all_grouped = FOREACH grouped GENERATE group,
COUNT(corrupt);
grunt> DUMP all_grouped;
(all,1)
29
30. White p. 172
Handling data corruption
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)
grunt> SPLIT records INTO good IF temperature is not null,
>> bad IF temperature is null;
grunt> DUMP good;
(1950,0,1)
(1950,22,1)
(1949,111,1)
(1949,78,1)
grunt> DUMP bad;
(1950,,1)
30
31. White p. 172
Handling data corruption
grunt> corrupt = FILTER records BY temperature is null;
grunt> DUMP corrupt;
(1950,,1)
grunt> SPLIT records INTO good IF temperature is not null,
>> bad IF temperature is null;
grunt> DUMP good;
(1950,0,1)
(1950,22,1)
(1949,111,1)
(1949,78,1)
grunt> DUMP bad;
(1950,,1)
31
32. White p. 172
Handling missing data
grunt> A = LOAD 'input/pig/corrupt/missing_fields';
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3)
(1,Scarf)
grunt> B = FILTER A BY SIZE(*) > 1;
grunt> DUMP B;
(2,Tie)
(4,Coat)
(1,Scarf)
32
33. User Defined Functions (UDFs)
Written in Java
Sub-class EvalFunc or FilterFunc
FilterFunc sub-classes EvalFunc with type T=Boolean
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
}
PiggyBank: Public library of pig functions
http://wiki.apache.org/pig/PiggyBank
34. UDF Example
filtered = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4
OR quality == 5 OR quality == 9);
grunt> REGISTER my-udfs.jar;
grunt> filtered = FILTER records BY temperature != 9999
AND com.hadoopbook.pig.IsGoodQuality(quality);
-- aliasing
grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
grunt> filtered_records = FILTER records
>> BY temperature != 9999 AND isGood(quality);
35. More on UDFs (functionals)
Pig translates
com.hadoopbook.pig.IsGoodQuality(x)
to
com.hadoopbook.pig.IsGoodQuality.exec(x);
Look for class named “…isGoodQuality” in registered JAR
Instantiate an instance as specified by DEFINE clause
Example below uses default constructor (no arguments)
• Default behavior if no DEFINE corresponding clause is specified
Can optionally pass other constructor arguments to parameterize
different UDF behaviors at run-time
grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();
36. Case Sensitivity
Operators & commands are not case-sensitive
aliases & function names are case-sensitive
Why?
Pig resolves function calls by
• Treating the function’s name as a Java classname
• Trying to load a class with that name
Java classnames are case-sensitive
37. Setting the number of reducers
Like Hadoop, defaults to 1
grouped_records = GROUP records BY year PARALLEL 30;
Can use optional PARALLEL clause for reduce operators
grouping & joining (GROUP, COGROUP, JOIN, CROSS)
DISTINCT
ORDER
Number of map tasks auto-determined as in Hadoop
38. Setting and using parameters
pig -param input=in.txt -param output=out.txt foo.pig
OR
# foo.param
input=/user/tom/input/ncdc/micro-tab/sample.txt
output=/tmp/out
pig -param_file foo.param foo.pig
THEN
records = LOAD '$input';
…
STORE x into '$output';
39. Running Pig
Version Matching: Hadoop & Pig
Pig Use Pig 0.3-0.4 with Hadoop 0.18
Use Pig 0.5-0.7 with Hadoop 0.20.x.
• uses the new MapReduce API
Pig is pure client-side
no software to install on cluster
Pig run-time generates Hadoop programs
As with Hadoop, can run Pig local or in distributed mode