Apache Pig on Amazon AWS - Swine Not?

Apache Pig
on Amazon AWS
Swine Not?

What is Apache Pig?
Pig is an execution framework that interprets
scripts written in a language called Pig Latin
and then runs them on a Hadoop cluster.
(Disturbing
Logo)
--
>

Pig is a tool that...
● creates complex jobs that efficiently process
large volumes of data
● supports many relational features, making it
easy to join, group, and aggregate data
● performs ETL tasks quickly, on many
servers simultaneously

What is Pig Latin?
It is a high level data transformation language
that:
● allows you to concentrate on the data
transformations you require
Rather than:
● force you to be concerned with individual
map and reduce functions

Walkthrough - Create a Job Flow
* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729

SSH into master instance
$ ssh -i ~/keys/crocs.pem -l hadoop ec2-54-215-
107-197.us-west-1.compute.amazonaws.com

Type "pig" to enter the grunt shell
$ pig
grunt> _
It's a freakin' shell!
grunt> pwd
hdfs://10.174.115.214:9000/

You can enter the HDFS file system:
grunt> cd hdfs:///
grunt> ls
hdfs://10.174.115.214:9000/mnt <dir>
Even enter an S3 bucket:
grunt> cd s3://elasticmapreduce/samples/pig-
apache/input/
grunt> ls
s3://elasticmapreduce/samples/pig-
apache/input/access_log_1<r 1> 8754118
s3://elasticmapreduce/samples/pig-
apache/input/access_log_2<r 1> 8902171

Load Piggybank - Open source library, user
contributed functions
grunt> register file:
/home/hadoop/lib/pig/piggybank.jar
DEFINE the EXTRACT alias from piggybank
grunt> DEFINE EXTRACT org.apache.pig.
piggybank.evaluation.string.EXTRACT;

LOAD
Use TextLoader (internal Pig function) to Load
each line of the source file:
grunt> RAW_LOGS = LOAD 's3:
//elasticmapreduce/samples/pig-
apache/input/access_log_1' USING TextLoader as
(line:chararray);

ILLUSTRATE
Shows a step-by-step process on how Pig would
transform a small sample of data
grunt> illustrate RAW_LOGS;
Connecting to hadoop file system at: hdfs://10.174.115.214:9000
Connecting to map-reduce job tracker at: 10.174.115.214:9001
...
---------------------------------------------------------------
| RAW_LOGS | line:chararray |
---------------------------------------------------------------
| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700]
"GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-"
"msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
---------------------------------------------------------------

Now let's:
● split each line into fields
● store everything in a bag
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s
[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"')
)
as (
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
request: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
browser: chararray
);

Create a bag containing tuples with just the
referrer element (limit 10 items):
grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
grunt> TEMP = LIMIT REFERRER_ONLY 10;
Output the contents of the bag:
grunt> DUMP TEMP;
Pig features used in the script: LIMIT
File concatenation threshold: 100 optimistic? false
MR plan size before optimization: 1
MR plan size after optimization: 1
Pig script settings are added to the job
creating jar file Job5394669249002614476.jar
Setting up single store job
1 map-reduce job(s) waiting for submission.
...

More log output before we get our results (cleaned
up here)
...
Input(s):
Successfully read 39344 records (126 bytes) from: "s3:
//elasticmapreduce/samples/pig-apache/input/access_log_1"
Output(s):
Successfully stored 10 records (126 bytes) in: "hdfs://10.
174.115.214:9000/tmp/temp948493830/tmp76754790"
Counters:
Total records written : 10
...

Voila! Our exciting results:
(-)
(-)
(-)
(-)
(-)
(-)
(http://example.org/)
(http://example.org/)
(-)
(-)
First 10 referrers (the dashes represent no
referrer)

Now let's filter only by referrerals from bing.com*
grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.
*bing.*';
grunt> TEMP = LIMIT FILTERED 9;
grunt> DUMP TEMP;
(http://www.bing.com/search?q=login)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=search)
(http://www.bing.com/search?q=philmont)
* We all use Bing, am I right?

Don't forget to terminate your Job
Flow
Amazon will charge you even if it's idle!

Apache Pig on Amazon AWS - Swine Not?

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Apache Pig on Amazon AWS - Swine Not?