SlideShare uma empresa Scribd logo
1 de 45
Pig vs. MapReduce
By Donald Miner
NYC Pig User Group
August 21, 2013
About Don
@donaldpminer
dminer@clearedgeit.com
I’ll be talking about
What is Java MapReduce good for?
Why is Pig better in some ways?
When should I use which?
When do I use Pig??
Can I use Pig to do this?
YES NO
Let’s get to the point
When do I use Pig??
Can I use Pig to do this?
YES NO
USE PIG!
When do I use Pig??
Can I use Pig to do this?
YES NO
TRY TO USE PIG ANYWAYS!
When do I use Pig??
Can I use Pig to do this?
YES NO
TRY TO USE PIG ANYWAYS!
Did that work?
YES NO
When do I use Pig??
Can I use Pig to do this?
YES NO
TRY TO USE PIG ANYWAYS!
Did that work?
YES NO
OK… use Java MapReduce
Why?
• If you can do it with Pig, save yourself the pain
• Almost always developer time is worth more
than machine time
• Trying something out in Pig is not risky (time-
wise) – you might learn something about your
problem
– Ok, so it turned out to look a bit like a hack, but
who cares?
– Ok, so it ended up being slow, but who cares?
Use the right tool for the job
Pig
Java MapReduce
HTML
Get the job done faster and better
Big Data Problem TM
Which is faster,
Pig or Java MapReduce?
Hypothetically, any Pig job could be
rewritten using MapReduce… so Java
MR can only be faster.
The TRUE battle is the
Pig optimizer vs. the developer
VS
Are you better than the Pig optimizer than figuring out how
to string multiple jobs together (and other things)?
Things that are hard to express in Pig
• When something is hard to express succinctly in Pig,
you are going to end up with a slow job
i.e., building something up of several primitives
• Some examples:
– Tricky groupings or joins
– Combining lots of data sets
– Tricky usage of the distributed cache (replicated join)
– Tricky cross products
– Doing crazy stuff in nested FOREACH
• In these cases, Pig is going to spawn off a bunch of
MapReduce jobs, which could have been done with
less
This is change in “speed” that doesn’t just have to do with cost-of-abstraction
The Fancy MAPREDUCE keyword!
Pig has a relational operator called MAPREDUCE
that allows your to plug in a Java MapReduce
job!
Use this to only replace the tricky things
… don’t throw out all the stuff Pig is good at
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'
AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
Have the best of both worlds!
To the rescue…
Somewhat related:
Is developer time worthless?
Does speed really matter?
Time spent writing Pig job
Runtime of Pig job x times job is ran
Time spent maintaining Pig job
Time spent writing MR job
Runtime of MR job x times job is ran
Time spent maintaining MR job
When does the scale tip in one direction or the other?
Will the job run many times? Or once?
Are your Java programmers sloppy?
Is the Java MR significantly faster in this case?
Is 14 minutes really that different from 20 minutes?
Why is development so much faster in Pig?
• Fewer java-level bugs to work out
… but bugs might be harder to figure out
• Fewer lines of code simply means less typing
• Compilation and deployment can significantly slow
down incremental improvements
• Easier to read: The purpose of the analytic is more
straightforward (the context is self-evident)
Avoiding Java!
• Not everyone is a Java expert
… especially all those SQL guys you are repurposing
• The higher level of abstraction makes Pig
easier to learn and read
– I’ve had both software engineers and SQL
developers become productive in Pig in <4 days
Oh, you want to learn Hadoop? Read this first!
But can I really?
not really.
Pig is good at moving data sets between states
… but not so good at manipulating the data itself
examples: advanced string operations, math,
complex aggregates, dates, NLP, model building
You need user-defined functions (UDFs)
I’ve seen too many people try to avoid UDFs
UDFs are powerful:
manipulate bags after a GROUP BY
Plug into external libraries like NLTK or OpenNLP
Loaders for complex custom data types
Exploiting the order of data
Ok, so I still want to avoid Java
Do you work by yourself???
Give someone else the task of writing you a UDF!
(they are bite-size little projects)
Current UDF support in 0.11.1:
Java, Python, JavaScript, Ruby, Groovy
These can help you avoid Java if you simply don’t like it (me)
Why did you write a book on MR
Design Patterns if you think you should
do stuff in Pig??
Good question!
• I’ve seen plenty of devs do DUMB stuff
in Pig just because there is a keyword
for it
e.g., silly joins, ordering, using the
PARALLEL keyword wrong
• Knowing how MapReduce works will
result in you writing better Pig
• In particular– how do Pig optimizations
and relational keywords translate into
MapReduce design patterns?
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
A STORY ABOUT MAINTAINABILITY
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
IT guy here. Your
MapReduce job is blowing
up the cluster, how do I fix
this thing?
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Ah, that’s pretty easy to
fix. Just comment out that
first line in the mapper
function.
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Ok, how do I do that?
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easy
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easy
First, check the code out
of git
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easyFirst, check the
code out of git
Then, download, install
and configure Eclipse.
Don’t forget to set your
CLASSPATH!
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easyFirst, check the
code out of git
Then, download,
install and configure
Eclipse. Don’t forget
to set your
CLASSPATH!
Ok, now comment out
line # 851 in
/home/itguy/java/src/co
m/hadooprus/hadoop/ha
doop/mapreducejobs/job
s/codes/analytic/mymapr
educejob/mapper.java
. . .
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easyFirst, check the
code out of git
Then, download,
install and configure
Eclipse. Don’t forget
to set your
CLASSPATH!
Ok, now comment out
line # 851 in
/home/itguy/java/src/co
m/hadooprus/hadoop/h
adoop/mapreducejobs/j
obs/codes/analytic/mym
apreducejob/mapper.jav
a
. . . . . .
Now, build the .jar
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easyFirst, check the
code out of git
Then, download,
install and configure
Eclipse. Don’t forget
to set your
CLASSPATH!
Ok, now comment out
line # 851 in
/home/itguy/java/src/co
m/hadooprus/hadoop/h
adoop/mapreducejobs/j
obs/codes/analytic/mym
apreducejob/mapper.jav
a
. . . . . . . . .Now, compile
the .jar
And ship the .jar to the
cluster, replacing the old
one
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easyFirst, check the
code out of git
Then, download,
install and configure
Eclipse. Don’t forget
to set your
CLASSPATH!
Ok, now comment out
line # 851 in
/home/itguy/java/src/co
m/hadooprus/hadoop/h
adoop/mapreducejobs/j
obs/codes/analytic/mym
apreducejob/mapper.jav
a
. . . . . . . . . .
. . . .
Now, compile
the .jar
And ship the .jar to
the cluster, replacing
the old one
Ok, now run the hadoop
jar command. Don’t
forget the CLASSPATH!
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
Oh, that’s easyFirst, check the
code out of git
Then, download,
install and configure
Eclipse. Don’t forget
to set your
CLASSPATH!
Ok, now comment out
line # 851 in
/home/itguy/java/src/co
m/hadooprus/hadoop/h
adoop/mapreducejobs/j
obs/codes/analytic/mym
apreducejob/mapper.jav
a
. . . . . . . . . .
. . . . . .
Now, compile
the .jar
And ship the .jar to
the cluster, replacing
the old one
Ok, now run the hadoop
jar command. Don’t
forget the CLASSPATH!
Did that work?
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
No
SCENARIO #1:
JUST CHANGE THAT ONE LITTLE LINE
. . .
Ah, let’s try something
else and do that again!
SCENARIO #2:
JUST CHANGE THAT ONE LITTLE LINE
(this time with Pig)
SCENARIO #2:
JUST CHANGE THAT ONE LITTLE LINE
(this time with Pig)
IT guy here. Your
MapReduce job is blowing
up the cluster, how do I fix
this thing?
SCENARIO #2:
JUST CHANGE THAT ONE LITTLE LINE
(this time with Pig)
Ah, that’s pretty easy to
fix. Just comment out that
line that says “FILTER blah
blah” and save the file.
SCENARIO #2:
JUST CHANGE THAT ONE LITTLE LINE
(this time with Pig)
Ok, thanks!
Pig: Deployment & Maintainability
• Don’t have to worry about version mismatch (for the
most part)
• You can have multiple Pig client libraries installed at
once
• Takes compilation out of the build and deployment
process
• Can make changes to scripts in place if you have to
• Iteratively tweaking scripts during development and
debugging
• Less chances for the developer to write Java-level bugs
Some Caveats
• Hadoop Streaming provides some of these
same benefits
• Big problems in both are still going to take
time
• If you are using Java UDFs, you still need to
compile them (which is why I use Python)
Unstructured Data
• Delimited data is pretty easy
• Pig has issues dealing with out of the box:
– Media: images, videos, audio
– Time series: utilizing order of data, lists
– Ambiguously delimited text
– Log data: rows with different context/meaning/format
You can write custom loaders and tons of UDFs…
but what’s the point?
What about semi-structured data?
• Some forms more natural that others
– Well-defined JSON/XML schemas are usually OK
• Pig has trouble dealing with:
– Complex operations on unbounded lists of objects
(e.g., bags)
– Very Flexible schemas (think BigTable/Hbase)
– Poorly designed JSON/XML
Sometimes, it’s just more pain than it’s worth to try
to do in Pig
Pig vs. Hive vs. MapReduce
• Same arguments apply for Hive vs. Java MR
• Using Pig or Hive doesn’t make that big of a difference
… but pick one because UDFs/Storage functions aren’t easily interchangeable
• I think you’ll like Pig better than Hive
(just like everyone likes emacs more than vi)
WRAP UP: AN ANALOGY (#1)
Pig is a scripting language,
Hadoop’s MapReduce is a compiled language.
PYTHON
C
::
WRAP UP: AN ANALOGY (#2)
Pig is a higher level of abstraction,
Hadoop’s MapReduce is a lower level of abstraction.
SQL
C
::
A lot of the same arguments apply!
• Compilation
– Don’t have to compile Pig
• Efficiency of code
– Pig will be a bit less efficient (but…)
• Lines of code and verbosity
– Pig will have fewer lines of code
• Optimization
– Pig has more opportunities to do automatic optimization of queries
• Code portability
– The same Pig script will work across versions (for the most part)
• Code readability
– It should be easier to understand a Pig script
• Underlying bugs
– Underlying bugs in Pig can cause frustrating problems (thanks be to God for open source)
• Amount of control and space of possibilities
– There are fewer things you CAN do in Pig

Mais conteúdo relacionado

Mais de Donald Miner

Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

Mais de Donald Miner (6)

SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Último

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Último (20)

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Pig vs mapreduce

  • 1. Pig vs. MapReduce By Donald Miner NYC Pig User Group August 21, 2013
  • 3. I’ll be talking about What is Java MapReduce good for? Why is Pig better in some ways? When should I use which?
  • 4. When do I use Pig?? Can I use Pig to do this? YES NO Let’s get to the point
  • 5. When do I use Pig?? Can I use Pig to do this? YES NO USE PIG!
  • 6. When do I use Pig?? Can I use Pig to do this? YES NO TRY TO USE PIG ANYWAYS!
  • 7. When do I use Pig?? Can I use Pig to do this? YES NO TRY TO USE PIG ANYWAYS! Did that work? YES NO
  • 8. When do I use Pig?? Can I use Pig to do this? YES NO TRY TO USE PIG ANYWAYS! Did that work? YES NO OK… use Java MapReduce
  • 9. Why? • If you can do it with Pig, save yourself the pain • Almost always developer time is worth more than machine time • Trying something out in Pig is not risky (time- wise) – you might learn something about your problem – Ok, so it turned out to look a bit like a hack, but who cares? – Ok, so it ended up being slow, but who cares?
  • 10. Use the right tool for the job Pig Java MapReduce HTML Get the job done faster and better Big Data Problem TM
  • 11. Which is faster, Pig or Java MapReduce? Hypothetically, any Pig job could be rewritten using MapReduce… so Java MR can only be faster. The TRUE battle is the Pig optimizer vs. the developer VS Are you better than the Pig optimizer than figuring out how to string multiple jobs together (and other things)?
  • 12. Things that are hard to express in Pig • When something is hard to express succinctly in Pig, you are going to end up with a slow job i.e., building something up of several primitives • Some examples: – Tricky groupings or joins – Combining lots of data sets – Tricky usage of the distributed cache (replicated join) – Tricky cross products – Doing crazy stuff in nested FOREACH • In these cases, Pig is going to spawn off a bunch of MapReduce jobs, which could have been done with less This is change in “speed” that doesn’t just have to do with cost-of-abstraction
  • 13. The Fancy MAPREDUCE keyword! Pig has a relational operator called MAPREDUCE that allows your to plug in a Java MapReduce job! Use this to only replace the tricky things … don’t throw out all the stuff Pig is good at B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; Have the best of both worlds! To the rescue…
  • 14. Somewhat related: Is developer time worthless? Does speed really matter? Time spent writing Pig job Runtime of Pig job x times job is ran Time spent maintaining Pig job Time spent writing MR job Runtime of MR job x times job is ran Time spent maintaining MR job When does the scale tip in one direction or the other? Will the job run many times? Or once? Are your Java programmers sloppy? Is the Java MR significantly faster in this case? Is 14 minutes really that different from 20 minutes?
  • 15. Why is development so much faster in Pig? • Fewer java-level bugs to work out … but bugs might be harder to figure out • Fewer lines of code simply means less typing • Compilation and deployment can significantly slow down incremental improvements • Easier to read: The purpose of the analytic is more straightforward (the context is self-evident)
  • 16. Avoiding Java! • Not everyone is a Java expert … especially all those SQL guys you are repurposing • The higher level of abstraction makes Pig easier to learn and read – I’ve had both software engineers and SQL developers become productive in Pig in <4 days Oh, you want to learn Hadoop? Read this first!
  • 17. But can I really? not really. Pig is good at moving data sets between states … but not so good at manipulating the data itself examples: advanced string operations, math, complex aggregates, dates, NLP, model building You need user-defined functions (UDFs) I’ve seen too many people try to avoid UDFs UDFs are powerful: manipulate bags after a GROUP BY Plug into external libraries like NLTK or OpenNLP Loaders for complex custom data types Exploiting the order of data
  • 18. Ok, so I still want to avoid Java Do you work by yourself??? Give someone else the task of writing you a UDF! (they are bite-size little projects) Current UDF support in 0.11.1: Java, Python, JavaScript, Ruby, Groovy These can help you avoid Java if you simply don’t like it (me)
  • 19. Why did you write a book on MR Design Patterns if you think you should do stuff in Pig?? Good question! • I’ve seen plenty of devs do DUMB stuff in Pig just because there is a keyword for it e.g., silly joins, ordering, using the PARALLEL keyword wrong • Knowing how MapReduce works will result in you writing better Pig • In particular– how do Pig optimizations and relational keywords translate into MapReduce design patterns?
  • 20. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE A STORY ABOUT MAINTAINABILITY
  • 21. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE IT guy here. Your MapReduce job is blowing up the cluster, how do I fix this thing?
  • 22. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Ah, that’s pretty easy to fix. Just comment out that first line in the mapper function.
  • 23. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Ok, how do I do that?
  • 24. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easy
  • 25. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easy First, check the code out of git
  • 26. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easyFirst, check the code out of git Then, download, install and configure Eclipse. Don’t forget to set your CLASSPATH!
  • 27. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easyFirst, check the code out of git Then, download, install and configure Eclipse. Don’t forget to set your CLASSPATH! Ok, now comment out line # 851 in /home/itguy/java/src/co m/hadooprus/hadoop/ha doop/mapreducejobs/job s/codes/analytic/mymapr educejob/mapper.java . . .
  • 28. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easyFirst, check the code out of git Then, download, install and configure Eclipse. Don’t forget to set your CLASSPATH! Ok, now comment out line # 851 in /home/itguy/java/src/co m/hadooprus/hadoop/h adoop/mapreducejobs/j obs/codes/analytic/mym apreducejob/mapper.jav a . . . . . . Now, build the .jar
  • 29. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easyFirst, check the code out of git Then, download, install and configure Eclipse. Don’t forget to set your CLASSPATH! Ok, now comment out line # 851 in /home/itguy/java/src/co m/hadooprus/hadoop/h adoop/mapreducejobs/j obs/codes/analytic/mym apreducejob/mapper.jav a . . . . . . . . .Now, compile the .jar And ship the .jar to the cluster, replacing the old one
  • 30. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easyFirst, check the code out of git Then, download, install and configure Eclipse. Don’t forget to set your CLASSPATH! Ok, now comment out line # 851 in /home/itguy/java/src/co m/hadooprus/hadoop/h adoop/mapreducejobs/j obs/codes/analytic/mym apreducejob/mapper.jav a . . . . . . . . . . . . . . Now, compile the .jar And ship the .jar to the cluster, replacing the old one Ok, now run the hadoop jar command. Don’t forget the CLASSPATH!
  • 31. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE Oh, that’s easyFirst, check the code out of git Then, download, install and configure Eclipse. Don’t forget to set your CLASSPATH! Ok, now comment out line # 851 in /home/itguy/java/src/co m/hadooprus/hadoop/h adoop/mapreducejobs/j obs/codes/analytic/mym apreducejob/mapper.jav a . . . . . . . . . . . . . . . . Now, compile the .jar And ship the .jar to the cluster, replacing the old one Ok, now run the hadoop jar command. Don’t forget the CLASSPATH! Did that work?
  • 32. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE No
  • 33. SCENARIO #1: JUST CHANGE THAT ONE LITTLE LINE . . . Ah, let’s try something else and do that again!
  • 34. SCENARIO #2: JUST CHANGE THAT ONE LITTLE LINE (this time with Pig)
  • 35. SCENARIO #2: JUST CHANGE THAT ONE LITTLE LINE (this time with Pig) IT guy here. Your MapReduce job is blowing up the cluster, how do I fix this thing?
  • 36. SCENARIO #2: JUST CHANGE THAT ONE LITTLE LINE (this time with Pig) Ah, that’s pretty easy to fix. Just comment out that line that says “FILTER blah blah” and save the file.
  • 37. SCENARIO #2: JUST CHANGE THAT ONE LITTLE LINE (this time with Pig) Ok, thanks!
  • 38. Pig: Deployment & Maintainability • Don’t have to worry about version mismatch (for the most part) • You can have multiple Pig client libraries installed at once • Takes compilation out of the build and deployment process • Can make changes to scripts in place if you have to • Iteratively tweaking scripts during development and debugging • Less chances for the developer to write Java-level bugs
  • 39. Some Caveats • Hadoop Streaming provides some of these same benefits • Big problems in both are still going to take time • If you are using Java UDFs, you still need to compile them (which is why I use Python)
  • 40. Unstructured Data • Delimited data is pretty easy • Pig has issues dealing with out of the box: – Media: images, videos, audio – Time series: utilizing order of data, lists – Ambiguously delimited text – Log data: rows with different context/meaning/format You can write custom loaders and tons of UDFs… but what’s the point?
  • 41. What about semi-structured data? • Some forms more natural that others – Well-defined JSON/XML schemas are usually OK • Pig has trouble dealing with: – Complex operations on unbounded lists of objects (e.g., bags) – Very Flexible schemas (think BigTable/Hbase) – Poorly designed JSON/XML Sometimes, it’s just more pain than it’s worth to try to do in Pig
  • 42. Pig vs. Hive vs. MapReduce • Same arguments apply for Hive vs. Java MR • Using Pig or Hive doesn’t make that big of a difference … but pick one because UDFs/Storage functions aren’t easily interchangeable • I think you’ll like Pig better than Hive (just like everyone likes emacs more than vi)
  • 43. WRAP UP: AN ANALOGY (#1) Pig is a scripting language, Hadoop’s MapReduce is a compiled language. PYTHON C ::
  • 44. WRAP UP: AN ANALOGY (#2) Pig is a higher level of abstraction, Hadoop’s MapReduce is a lower level of abstraction. SQL C ::
  • 45. A lot of the same arguments apply! • Compilation – Don’t have to compile Pig • Efficiency of code – Pig will be a bit less efficient (but…) • Lines of code and verbosity – Pig will have fewer lines of code • Optimization – Pig has more opportunities to do automatic optimization of queries • Code portability – The same Pig script will work across versions (for the most part) • Code readability – It should be easier to understand a Pig script • Underlying bugs – Underlying bugs in Pig can cause frustrating problems (thanks be to God for open source) • Amount of control and space of possibilities – There are fewer things you CAN do in Pig

Notas do Editor

  1. Donald&apos;s talk will cover how to use native MapReduce in conjunction with Pig, including a detailed discussion of when users might be best served to use one or the other.