SlideShare a Scribd company logo
1 of 26
Download to read offline
Apache Pig
on Amazon AWS
Swine Not?
What is Apache Pig?
Pig is an execution framework that interprets
scripts written in a language called Pig Latin
and then runs them on a Hadoop cluster.
(Disturbing
Logo)
--
>
Pig is a tool that...
● creates complex jobs that efficiently process
large volumes of data
● supports many relational features, making it
easy to join, group, and aggregate data
● performs ETL tasks quickly, on many
servers simultaneously
What is Pig Latin?
It is a high level data transformation language
that:
● allows you to concentrate on the data
transformations you require
Rather than:
● force you to be concerned with individual
map and reduce functions
Walkthrough - Create a Job Flow
* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
And now we wait...
SSH into master instance
$ ssh -i ~/keys/crocs.pem -l hadoop  ec2-54-215-
107-197.us-west-1.compute.amazonaws.com
Type "pig" to enter the grunt shell
$ pig
grunt> _
It's a freakin' shell!
grunt> pwd
hdfs://10.174.115.214:9000/
You can enter the HDFS file system:
grunt> cd hdfs:///
grunt> ls
hdfs://10.174.115.214:9000/mnt <dir>
Even enter an S3 bucket:
grunt> cd  s3://elasticmapreduce/samples/pig-
apache/input/
grunt> ls
s3://elasticmapreduce/samples/pig-
apache/input/access_log_1<r 1> 8754118
s3://elasticmapreduce/samples/pig-
apache/input/access_log_2<r 1> 8902171
Load Piggybank - Open source library, user
contributed functions
grunt> register file:
/home/hadoop/lib/pig/piggybank.jar
DEFINE the EXTRACT alias from piggybank
grunt> DEFINE EXTRACT org.apache.pig.
piggybank.evaluation.string.EXTRACT;
LOAD
Use TextLoader (internal Pig function) to Load
each line of the source file:
grunt> RAW_LOGS = LOAD 's3:
//elasticmapreduce/samples/pig-
apache/input/access_log_1' USING TextLoader as
(line:chararray);
ILLUSTRATE
Shows a step-by-step process on how Pig would
transform a small sample of data
grunt> illustrate RAW_LOGS;
Connecting to hadoop file system at: hdfs://10.174.115.214:9000
Connecting to map-reduce job tracker at: 10.174.115.214:9001
...
---------------------------------------------------------------
| RAW_LOGS | line:chararray |
---------------------------------------------------------------
| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700]
"GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-"
"msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
---------------------------------------------------------------
Now let's:
● split each line into fields
● store everything in a bag
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s
[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"')
)
as (
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
request: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
browser: chararray
);
ILLUSTRATE an example of our work
grunt> illustrate LOGS_BASE;
...
| LOGS_BASE |
| remoteAddr:chararray | 74.125.74.193
| remoteLogname:chararray | -
| user:chararray | -
| time:chararray | 20/Jul/2009:20:30:55 -0700
| request:chararray | GET /gwidgets/alexa.xml HTTP/1.1
| status:int | 200
| bytes_string:chararray | 2969
| referrer:chararray | -
| browser:chararray | Mozilla/5.0 (compatible)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
Create a bag containing tuples with just the
referrer element (limit 10 items):
grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
grunt> TEMP = LIMIT REFERRER_ONLY 10;
Output the contents of the bag:
grunt> DUMP TEMP;
Pig features used in the script: LIMIT
File concatenation threshold: 100 optimistic? false
MR plan size before optimization: 1
MR plan size after optimization: 1
Pig script settings are added to the job
creating jar file Job5394669249002614476.jar
Setting up single store job
1 map-reduce job(s) waiting for submission.
...
More log output before we get our results (cleaned
up here)
...
Input(s):
Successfully read 39344 records (126 bytes) from: "s3:
//elasticmapreduce/samples/pig-apache/input/access_log_1"
Output(s):
Successfully stored 10 records (126 bytes) in: "hdfs://10.
174.115.214:9000/tmp/temp948493830/tmp76754790"
Counters:
Total records written : 10
...
Voila! Our exciting results:
(-)
(-)
(-)
(-)
(-)
(-)
(http://example.org/)
(http://example.org/)
(-)
(-)
First 10 referrers (the dashes represent no
referrer)
Now let's filter only by referrerals from bing.com*
grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.
*bing.*';
grunt> TEMP = LIMIT FILTERED 9;
grunt> DUMP TEMP;
(http://www.bing.com/search?q=login)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=search)
(http://www.bing.com/search?q=philmont)
* We all use Bing, am I right?
Don't forget to terminate your Job
Flow
Amazon will charge you even if it's idle!

More Related Content

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Apache Pig on Amazon AWS - Swine Not?

  • 1. Apache Pig on Amazon AWS Swine Not?
  • 2. What is Apache Pig? Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster. (Disturbing Logo) -- >
  • 3. Pig is a tool that... ● creates complex jobs that efficiently process large volumes of data ● supports many relational features, making it easy to join, group, and aggregate data ● performs ETL tasks quickly, on many servers simultaneously
  • 4. What is Pig Latin? It is a high level data transformation language that: ● allows you to concentrate on the data transformations you require Rather than: ● force you to be concerned with individual map and reduce functions
  • 5. Walkthrough - Create a Job Flow * Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. And now we wait...
  • 14. SSH into master instance $ ssh -i ~/keys/crocs.pem -l hadoop ec2-54-215- 107-197.us-west-1.compute.amazonaws.com
  • 15. Type "pig" to enter the grunt shell $ pig grunt> _ It's a freakin' shell! grunt> pwd hdfs://10.174.115.214:9000/
  • 16. You can enter the HDFS file system: grunt> cd hdfs:/// grunt> ls hdfs://10.174.115.214:9000/mnt <dir> Even enter an S3 bucket: grunt> cd s3://elasticmapreduce/samples/pig- apache/input/ grunt> ls s3://elasticmapreduce/samples/pig- apache/input/access_log_1<r 1> 8754118 s3://elasticmapreduce/samples/pig- apache/input/access_log_2<r 1> 8902171
  • 17. Load Piggybank - Open source library, user contributed functions grunt> register file: /home/hadoop/lib/pig/piggybank.jar DEFINE the EXTRACT alias from piggybank grunt> DEFINE EXTRACT org.apache.pig. piggybank.evaluation.string.EXTRACT;
  • 18. LOAD Use TextLoader (internal Pig function) to Load each line of the source file: grunt> RAW_LOGS = LOAD 's3: //elasticmapreduce/samples/pig- apache/input/access_log_1' USING TextLoader as (line:chararray);
  • 19. ILLUSTRATE Shows a step-by-step process on how Pig would transform a small sample of data grunt> illustrate RAW_LOGS; Connecting to hadoop file system at: hdfs://10.174.115.214:9000 Connecting to map-reduce job tracker at: 10.174.115.214:9001 ... --------------------------------------------------------------- | RAW_LOGS | line:chararray | --------------------------------------------------------------- | | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" ---------------------------------------------------------------
  • 20. Now let's: ● split each line into fields ● store everything in a bag grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s [+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
  • 21. ILLUSTRATE an example of our work grunt> illustrate LOGS_BASE; ... | LOGS_BASE | | remoteAddr:chararray | 74.125.74.193 | remoteLogname:chararray | - | user:chararray | - | time:chararray | 20/Jul/2009:20:30:55 -0700 | request:chararray | GET /gwidgets/alexa.xml HTTP/1.1 | status:int | 200 | bytes_string:chararray | 2969 | referrer:chararray | - | browser:chararray | Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
  • 22. Create a bag containing tuples with just the referrer element (limit 10 items): grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; grunt> TEMP = LIMIT REFERRER_ONLY 10; Output the contents of the bag: grunt> DUMP TEMP; Pig features used in the script: LIMIT File concatenation threshold: 100 optimistic? false MR plan size before optimization: 1 MR plan size after optimization: 1 Pig script settings are added to the job creating jar file Job5394669249002614476.jar Setting up single store job 1 map-reduce job(s) waiting for submission. ...
  • 23. More log output before we get our results (cleaned up here) ... Input(s): Successfully read 39344 records (126 bytes) from: "s3: //elasticmapreduce/samples/pig-apache/input/access_log_1" Output(s): Successfully stored 10 records (126 bytes) in: "hdfs://10. 174.115.214:9000/tmp/temp948493830/tmp76754790" Counters: Total records written : 10 ...
  • 24. Voila! Our exciting results: (-) (-) (-) (-) (-) (-) (http://example.org/) (http://example.org/) (-) (-) First 10 referrers (the dashes represent no referrer)
  • 25. Now let's filter only by referrerals from bing.com* grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '. *bing.*'; grunt> TEMP = LIMIT FILTERED 9; grunt> DUMP TEMP; (http://www.bing.com/search?q=login) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=views) (http://www.bing.com/search?q=views) (http://www.bing.com/search?q=search) (http://www.bing.com/search?q=philmont) * We all use Bing, am I right?
  • 26. Don't forget to terminate your Job Flow Amazon will charge you even if it's idle!