SlideShare uma empresa Scribd logo
1 de 24
EEDC
                          34330
Execution                                   Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS



                                           Homework number: 3
                                          Group number: EEDC-3
                                             Group members:
                                       Javier Álvarez – javicid@gmail.com
                                   Francesc Lordan – francesc.lordan@gmail.com
                                     Roger Rafanell – rogerrafanell@gmail.com
Outline

1.- Introduction

2.- Pig Latin
    2.1.- Data Model
    2.2.- Programming Model

3.- Implementation

4.- Conclusions




                              2
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 1
Networks and Systems - CANS        Introduction
Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

   – Parallel databases can be prohibitively expensive at this scale.

   – Programmers tend to find declarative languages such as SQL very
     unnatural.

   – Other approaches such map-reduce are low-level and rigid.




                                       4
What is Apache Pig?

A platform for analyzing large data sets that:

   – It is based in Pig Latin which lies between declarative (SQL) and
     procedural (C++) programming languages.

   – At the same time, enables the construction of programs with an easy
     parallelizable structure.




                                      5
Which features does it have?
 Dataflow Language
   – Data processing is expressed step-by-step.

 Quick Start & Interoperability
   – Pig can work over any kind of input and produce any kind of output.

 Nested Data Model
   – Pig works with complex types like tuples, bags, ...

 User Defined Functions (UDFs)
   – Potentially in any programming language (only Java for the moment).

 Only parallel
   – Pig Latin forces to use directives that are parallelizable in a direct way.

 Debugging environment
   – Debugging at programming time.
                                        6
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Part 2
Networks and Systems - CANS        Pig Latin
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,    Section 2.1
Networks and Systems - CANS        Data Model
Data Model
Very rich data model consisting on 4 simple data types:

 Atom: Simple atomic value such as strings or numbers.
                                        ‘Alice’
 Tuple: Sequence of fields of any type of data.
                                   (‘Alice’, ‘Apple’)
                            (‘Alice’, (‘Barça’, ‘football’))
 Bag: collection of tuples with possible duplicates.
                                      (‘Alice’, ‘Apple’)
                               (‘Alice’, (‘Barça’, ‘football’))
 Map: collection of data items with an associated key (always an atom).
                              ‘Fan of’           (‘Apple’)
                                                (‘Barça’, ‘football’)

                                    ‘Age’  ’20’


                                            9
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Section 2.2
Networks and Systems - CANS        Programming
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);




visits: (‘Amy’, ‘cnn.com’, ‘8am’)
     (‘Amy’, ‘nytimes.com’, ‘9am’)
     (‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)
    (‘nytimes.com’, ‘0.6’)
    (‘elmundotoday’, ‘0.2’)


                                      11
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url




v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
    (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
    (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)




                                      12
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user




user:   (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
        (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

    (‘Bob’,   {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})


                                        13
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr




user:   (‘Amy’, ‘0.7’)
    (‘Bob’, ‘0.2’)




                             14
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’




answer: (‘Amy’, ‘0.7’)




                             15
Programming Model
Other relational operators:

    – STORE : exports data into a file.
    STORE var1_name INTO 'output.txt‘;

    – COGROUP : groups together tuples from diferent datasets.
    COGROUP var1_name BY field_id, var2_name BY field_id

    –   UNION : computes the union of two variables.
    –   CROSS : computes the cross product.
    –   ORDER : sorts a data set by one or more fields.
    –   DISTINCT : removes replicated tuples in a dataset.




                                           16
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,        Part 3
Networks and Systems - CANS        Implementation
Implementation: Highlights

 Works on top of Hadoop ecosystem:
   – Current implementation uses Hadoop as execution platform.

 On-the-fly compilation:
   – Pig translates the Pig Latin commands to Map and Reduce methods.

 Lazy style language:
   – Pig try to pospone the data materialization (on disk writes) as much as
     possible.




                                     18
Implementation: Building the logical
plan
 Query parsing:
   – Pig interpreter parses the commands verifying that the input files and
     bags referenced are valid.

 On-the-fly compilation:
   – Pig compiles the logical plan for that bag into physical plan (Map-Reduce
     statements) when the command cannot be more delayed and must be
     executed.

 Lazy characteristics:
   – No processing are carried out when the logical plan are constructed.
   – Processing is triggered only when the user invokes STORE command
     on a bag.
   – Lazy style execution permits in-memory pipelining and other interesting
     optimizations.



                                      5
                                      19
Implementation: Map-Reduce plan
compilation
   CO(GROUP):
     – Each command is compiled in a distinct map-reduce job with its own map and reduce functions.
     – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to
        multiple reduce instances.


   LOAD:
     – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.


   FILTER/FOREACH:
     – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run
         in parallel.


   ORDER (compiled in two map-reduce jobs):
     – First: Determine quantiles of the sort key
     – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase
       resulting in a global sorted file.




                                                     20
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 4
Networks and Systems - CANS        Conclusions
Conclusions
   Advantages:
     –   Step-by-step syntaxis.
     –   Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
     –   Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
     –   Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
     –   Debugging environment.
     –   Open Source (IMPORTANT!!)


   Disadvantages:
     –   UDFs methods could be a source of performance loss (the control relies on the user side).
     –   Overhead while compiling Pig Latin into map-reduce jobs.


   Usage Scenarios:
     –   Temporal analysis: search logs mainly involves studying how search query distribution changes
         over time.
     –   Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
         analized to calculate some metrics such:
     –   how long is the average user session?
     –   how many links does a user click on before leaving a website?
     –   Others, ...


                                                    22
Q&A




      23
FLATTERED




            24

Mais conteúdo relacionado

Mais procurados

Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 

Mais procurados (20)

Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
 
mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Distributed Tracing, from internal SAAS insights
Distributed Tracing, from internal SAAS insightsDistributed Tracing, from internal SAAS insights
Distributed Tracing, from internal SAAS insights
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 

Destaque (6)

Everything as a Service
Everything as a ServiceEverything as a Service
Everything as a Service
 
The fighting forties life in and around brighouse over 70 years ago
The fighting forties   life in and around brighouse over 70 years agoThe fighting forties   life in and around brighouse over 70 years ago
The fighting forties life in and around brighouse over 70 years ago
 
Lightcliffe Cemetery and some of its residents - by Chris Helme
Lightcliffe Cemetery and some of its residents - by Chris HelmeLightcliffe Cemetery and some of its residents - by Chris Helme
Lightcliffe Cemetery and some of its residents - by Chris Helme
 
Welholme Park History Project
Welholme Park History Project   Welholme Park History Project
Welholme Park History Project
 
Brighouse and christmas past + 1
Brighouse and christmas past + 1Brighouse and christmas past + 1
Brighouse and christmas past + 1
 
The 1940s & 50s remembered
The  1940s & 50s rememberedThe  1940s & 50s remembered
The 1940s & 50s remembered
 

Semelhante a EEDC - Apache Pig

Semelhante a EEDC - Apache Pig (20)

Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRI
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Scala+data
Scala+dataScala+data
Scala+data
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

EEDC - Apache Pig

  • 1. EEDC 34330 Execution Apache Pig Environments for Distributed Computing Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  • 2. Outline 1.- Introduction 2.- Pig Latin 2.1.- Data Model 2.2.- Programming Model 3.- Implementation 4.- Conclusions 2
  • 3. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 1 Networks and Systems - CANS Introduction
  • 4. Why Apache Pig? Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  • 5. What is Apache Pig? A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  • 6. Which features does it have?  Dataflow Language – Data processing is expressed step-by-step.  Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output.  Nested Data Model – Pig works with complex types like tuples, bags, ...  User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment).  Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way.  Debugging environment – Debugging at programming time. 6
  • 7. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 2 Networks and Systems - CANS Pig Latin
  • 8. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.1 Networks and Systems - CANS Data Model
  • 9. Data Model Very rich data model consisting on 4 simple data types:  Atom: Simple atomic value such as strings or numbers. ‘Alice’  Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) ‘Age’  ’20’ 9
  • 10. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.2 Networks and Systems - CANS Programming
  • 11. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’) pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  • 12. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  • 13. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  • 14. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr user: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  • 15. Programming Model visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr answer = FILTER useravg BY avgpr > ‘0.5’ answer: (‘Amy’, ‘0.7’) 15
  • 16. Programming Model Other relational operators: – STORE : exports data into a file. STORE var1_name INTO 'output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  • 17. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 3 Networks and Systems - CANS Implementation
  • 18. Implementation: Highlights  Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform.  On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods.  Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  • 19. Implementation: Building the logical plan  Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid.  On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed.  Lazy characteristics: – No processing are carried out when the logical plan are constructed. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 5 19
  • 20. Implementation: Map-Reduce plan compilation  CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances.  LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.  FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel.  ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  • 21. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 4 Networks and Systems - CANS Conclusions
  • 22. Conclusions  Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!)  Disadvantages: – UDFs methods could be a source of performance loss (the control relies on the user side). – Overhead while compiling Pig Latin into map-reduce jobs.  Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  • 23. Q&A 23
  • 24. FLATTERED 24