SlideShare uma empresa Scribd logo
1 de 31
Embedding Pig in scripting languages What happens when you feed a Pig to a Python? Julien Le Dem – Principal Engineer - Content Platforms at Yahoo! Pig committer julien@ledem.net @julienledem
Disclaimer No animals were hurt in the making of this presentation I’m cute I’m hungry Picture credits: OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/ Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/
What for ? Simplifying the implementation of  iterative algorithms: Loop and exit criteria Simpler User Defined Functions Easier parameter passing
Before The implementation  has the following  artifacts:
Pig Script(s) warshall_n_minus_1 = LOAD '$workDir/warshall_0' 	USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); to_join_n_minus_1 = LOAD '$workDir/to_join_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1; followed = FOREACH joined GENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1)); followed_byid = GROUP followed BY id1; warshall_n = FOREACH followed_byid GENERATE group, FLATTEN(coalesceLine(followed.(id2, status))); to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1; STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage; STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;
External loop #!/usr/bin/python  import os num_iter=int(10) for i in range(num_iter): os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig') os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i)) os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u'); 	os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1)) os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i)) os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z'); 	os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))
Java UDF(s)
Development Iteration
So… What happens? Credits:  Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/
After One script (to rule them all): 	- main program 	- UDFs as script functions 	- embedded Pig statements All the algorithm in one place
References It uses JVM implementations of scripting languages (Jython, Rhino). This is a joint effort, see the following Jiras:  	in Pig 0.8: PIG-928 Python UDFs    in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript support Doc: http://pig.apache.org/docs/
Examples 1) Simple example: fixed loop iteration 2) Adding convergence criteria and accessing intermediary output 3)More advanced example with UDFs
1) A Simple Example PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot):  It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank
Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input
Practical result Applied to the English Wikipedia link graph: http://wiki.dbpedia.org/Downloads36#owikipediapagelinks It turns out that the max PageRank is awarded to: http://en.wikipedia.org/wiki/United_States Thanks @ogrisel for the suggestion
2) Same example, one step further Now let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.
Same thing as previously Computation of the maximum difference with the previous iteration … continued next slide
The main program The parameter-less bind() uses the variables in the current scope We can easily read the output of Pig from the grid Stop if we reach a threshold
3) Now somethingmore complex Compute a transitive closure:  find the connected components of a graph.  - Useful if you’re doing de-duplication  - Requires iterations and UDFs
Or more visually Turn this:	                Into this:
Convergence Converges in :  log2(max(Diameter of a component)) Diameter = “the longest shortest path” Bottom line: the number of iterations will be reasonable.
UDFs are in the same script as the main program Zoom next slide Page 1/3
Zoom on UDFs The output schema of the UDF is defined using a decorator The native structures of the language can be used directly
Zoom next slides Zoom next slides Page 2/3
Zoom on the Pig script … UDFs are directly available, no extra declaration needed
Zoom on the loop Iterate a maximum of 10 times (2^10 maximum diameter of a component) Convergence criteria: all links have been followed
Final part: formatting Turning the graph representation into a component list representation This is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main() Page 3/3
One more thing … I presented Python but JavaScript is available as well (experimental). The framework is extensible. Any JVM implementation of a language could be integrated (contribute!). The examples can be found at: https://github.com/julienledem/Pig-scripting-examples
Questions? ? ?

Mais conteúdo relacionado

Mais procurados

Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest WordsEkaKurniawan40
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
GPU Accelerated Machine Learning
GPU Accelerated Machine LearningGPU Accelerated Machine Learning
GPU Accelerated Machine LearningSri Ambati
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 

Mais procurados (20)

Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest Words
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
GPU Accelerated Machine Learning
GPU Accelerated Machine LearningGPU Accelerated Machine Learning
GPU Accelerated Machine Learning
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 

Destaque

BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksHakky St
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 

Destaque (12)

BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 

Semelhante a Embedding Pig in scripting languages

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkDoug Chang
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Aori Nevo, PhD
 
Incredible Machine with Pipelines and Generators
Incredible Machine with Pipelines and GeneratorsIncredible Machine with Pipelines and Generators
Incredible Machine with Pipelines and Generatorsdantleech
 
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014Matthias Noback
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in RRory Winston
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with PythonBrian Lyttle
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Easy deployment & management of cloud apps
Easy deployment & management of cloud appsEasy deployment & management of cloud apps
Easy deployment & management of cloud appsDavid Cunningham
 
Who pulls the strings?
Who pulls the strings?Who pulls the strings?
Who pulls the strings?Ronny
 
High Performance NodeJS
High Performance NodeJSHigh Performance NodeJS
High Performance NodeJSDicoding
 
2nd puc computer science chapter 8 function overloading
 2nd puc computer science chapter 8   function overloading 2nd puc computer science chapter 8   function overloading
2nd puc computer science chapter 8 function overloadingAahwini Esware gowda
 
Pemrograman Python untuk Pemula
Pemrograman Python untuk PemulaPemrograman Python untuk Pemula
Pemrograman Python untuk PemulaOon Arfiandwi
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersAlessandro Sanino
 
python introduction initial lecture unit1.pptx
python introduction initial lecture unit1.pptxpython introduction initial lecture unit1.pptx
python introduction initial lecture unit1.pptxChandraPrakash715640
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...Codemotion
 
Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1Kyunghoon Kim
 

Semelhante a Embedding Pig in scripting languages (20)

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...Need to make a horizontal change across 100+ microservices? No worries, Sheph...
Need to make a horizontal change across 100+ microservices? No worries, Sheph...
 
Incredible Machine with Pipelines and Generators
Incredible Machine with Pipelines and GeneratorsIncredible Machine with Pipelines and Generators
Incredible Machine with Pipelines and Generators
 
Pig
PigPig
Pig
 
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
High Quality Symfony Bundles tutorial - Dutch PHP Conference 2014
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in R
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Easy deployment & management of cloud apps
Easy deployment & management of cloud appsEasy deployment & management of cloud apps
Easy deployment & management of cloud apps
 
Who pulls the strings?
Who pulls the strings?Who pulls the strings?
Who pulls the strings?
 
Python1
Python1Python1
Python1
 
High Performance NodeJS
High Performance NodeJSHigh Performance NodeJS
High Performance NodeJS
 
2nd puc computer science chapter 8 function overloading
 2nd puc computer science chapter 8   function overloading 2nd puc computer science chapter 8   function overloading
2nd puc computer science chapter 8 function overloading
 
Pemrograman Python untuk Pemula
Pemrograman Python untuk PemulaPemrograman Python untuk Pemula
Pemrograman Python untuk Pemula
 
The GO Language : From Beginners to Gophers
The GO Language : From Beginners to GophersThe GO Language : From Beginners to Gophers
The GO Language : From Beginners to Gophers
 
python introduction initial lecture unit1.pptx
python introduction initial lecture unit1.pptxpython introduction initial lecture unit1.pptx
python introduction initial lecture unit1.pptx
 
Puppet quick start guide
Puppet quick start guidePuppet quick start guide
Puppet quick start guide
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
 
Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1
 

Mais de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 

Mais de Julien Le Dem (18)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Embedding Pig in scripting languages

  • 1. Embedding Pig in scripting languages What happens when you feed a Pig to a Python? Julien Le Dem – Principal Engineer - Content Platforms at Yahoo! Pig committer julien@ledem.net @julienledem
  • 2. Disclaimer No animals were hurt in the making of this presentation I’m cute I’m hungry Picture credits: OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/ Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/
  • 3. What for ? Simplifying the implementation of iterative algorithms: Loop and exit criteria Simpler User Defined Functions Easier parameter passing
  • 4. Before The implementation has the following artifacts:
  • 5. Pig Script(s) warshall_n_minus_1 = LOAD '$workDir/warshall_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); to_join_n_minus_1 = LOAD '$workDir/to_join_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray); joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1; followed = FOREACH joined GENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1)); followed_byid = GROUP followed BY id1; warshall_n = FOREACH followed_byid GENERATE group, FLATTEN(coalesceLine(followed.(id2, status))); to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1; STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage; STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;
  • 6. External loop #!/usr/bin/python import os num_iter=int(10) for i in range(num_iter): os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig') os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i)) os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u'); os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1)) os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i)) os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z'); os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))
  • 9. So… What happens? Credits: Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/
  • 10. After One script (to rule them all): - main program - UDFs as script functions - embedded Pig statements All the algorithm in one place
  • 11. References It uses JVM implementations of scripting languages (Jython, Rhino). This is a joint effort, see the following Jiras: in Pig 0.8: PIG-928 Python UDFs in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript support Doc: http://pig.apache.org/docs/
  • 12. Examples 1) Simple example: fixed loop iteration 2) Adding convergence criteria and accessing intermediary output 3)More advanced example with UDFs
  • 13. 1) A Simple Example PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank
  • 14. Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
  • 15.
  • 16. Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input
  • 17. Practical result Applied to the English Wikipedia link graph: http://wiki.dbpedia.org/Downloads36#owikipediapagelinks It turns out that the max PageRank is awarded to: http://en.wikipedia.org/wiki/United_States Thanks @ogrisel for the suggestion
  • 18. 2) Same example, one step further Now let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.
  • 19. Same thing as previously Computation of the maximum difference with the previous iteration … continued next slide
  • 20. The main program The parameter-less bind() uses the variables in the current scope We can easily read the output of Pig from the grid Stop if we reach a threshold
  • 21. 3) Now somethingmore complex Compute a transitive closure: find the connected components of a graph. - Useful if you’re doing de-duplication - Requires iterations and UDFs
  • 22. Or more visually Turn this: Into this:
  • 23. Convergence Converges in : log2(max(Diameter of a component)) Diameter = “the longest shortest path” Bottom line: the number of iterations will be reasonable.
  • 24. UDFs are in the same script as the main program Zoom next slide Page 1/3
  • 25. Zoom on UDFs The output schema of the UDF is defined using a decorator The native structures of the language can be used directly
  • 26. Zoom next slides Zoom next slides Page 2/3
  • 27. Zoom on the Pig script … UDFs are directly available, no extra declaration needed
  • 28. Zoom on the loop Iterate a maximum of 10 times (2^10 maximum diameter of a component) Convergence criteria: all links have been followed
  • 29. Final part: formatting Turning the graph representation into a component list representation This is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main() Page 3/3
  • 30. One more thing … I presented Python but JavaScript is available as well (experimental). The framework is extensible. Any JVM implementation of a language could be integrated (contribute!). The examples can be found at: https://github.com/julienledem/Pig-scripting-examples