A basic introduction to Apache Pig, focused on understanding what it is as well as quickly getting started using it through Amazon's Elastic Map Reduce service. The second part details my experience following along with Amazon's Pig tutorial at: http://aws.amazon.com/articles/2729
2. What is Apache Pig?
Pig is an execution framework that interprets
scripts written in a language called Pig Latin
and then runs them on a Hadoop cluster.
(Disturbing
Logo)
--
>
3. Pig is a tool that...
● creates complex jobs that efficiently process
large volumes of data
● supports many relational features, making it
easy to join, group, and aggregate data
● performs ETL tasks quickly, on many
servers simultaneously
4. What is Pig Latin?
It is a high level data transformation language
that:
● allows you to concentrate on the data
transformations you require
Rather than:
● force you to be concerned with individual
map and reduce functions
5. Walkthrough - Create a Job Flow
* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
15. Type "pig" to enter the grunt shell
$ pig
grunt> _
It's a freakin' shell!
grunt> pwd
hdfs://10.174.115.214:9000/
16. You can enter the HDFS file system:
grunt> cd hdfs:///
grunt> ls
hdfs://10.174.115.214:9000/mnt <dir>
Even enter an S3 bucket:
grunt> cd s3://elasticmapreduce/samples/pig-
apache/input/
grunt> ls
s3://elasticmapreduce/samples/pig-
apache/input/access_log_1<r 1> 8754118
s3://elasticmapreduce/samples/pig-
apache/input/access_log_2<r 1> 8902171
17. Load Piggybank - Open source library, user
contributed functions
grunt> register file:
/home/hadoop/lib/pig/piggybank.jar
DEFINE the EXTRACT alias from piggybank
grunt> DEFINE EXTRACT org.apache.pig.
piggybank.evaluation.string.EXTRACT;
18. LOAD
Use TextLoader (internal Pig function) to Load
each line of the source file:
grunt> RAW_LOGS = LOAD 's3:
//elasticmapreduce/samples/pig-
apache/input/access_log_1' USING TextLoader as
(line:chararray);
19. ILLUSTRATE
Shows a step-by-step process on how Pig would
transform a small sample of data
grunt> illustrate RAW_LOGS;
Connecting to hadoop file system at: hdfs://10.174.115.214:9000
Connecting to map-reduce job tracker at: 10.174.115.214:9001
...
---------------------------------------------------------------
| RAW_LOGS | line:chararray |
---------------------------------------------------------------
| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700]
"GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-"
"msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
---------------------------------------------------------------
20. Now let's:
● split each line into fields
● store everything in a bag
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s
[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"')
)
as (
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
request: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
browser: chararray
);
22. Create a bag containing tuples with just the
referrer element (limit 10 items):
grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
grunt> TEMP = LIMIT REFERRER_ONLY 10;
Output the contents of the bag:
grunt> DUMP TEMP;
Pig features used in the script: LIMIT
File concatenation threshold: 100 optimistic? false
MR plan size before optimization: 1
MR plan size after optimization: 1
Pig script settings are added to the job
creating jar file Job5394669249002614476.jar
Setting up single store job
1 map-reduce job(s) waiting for submission.
...
23. More log output before we get our results (cleaned
up here)
...
Input(s):
Successfully read 39344 records (126 bytes) from: "s3:
//elasticmapreduce/samples/pig-apache/input/access_log_1"
Output(s):
Successfully stored 10 records (126 bytes) in: "hdfs://10.
174.115.214:9000/tmp/temp948493830/tmp76754790"
Counters:
Total records written : 10
...
24. Voila! Our exciting results:
(-)
(-)
(-)
(-)
(-)
(-)
(http://example.org/)
(http://example.org/)
(-)
(-)
First 10 referrers (the dashes represent no
referrer)
25. Now let's filter only by referrerals from bing.com*
grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.
*bing.*';
grunt> TEMP = LIMIT FILTERED 9;
grunt> DUMP TEMP;
(http://www.bing.com/search?q=login)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=search)
(http://www.bing.com/search?q=philmont)
* We all use Bing, am I right?
26. Don't forget to terminate your Job
Flow
Amazon will charge you even if it's idle!