SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
PyCascading for Intuitive
Flow Processing With
Hadoop
Gabor Szabo
Senior Data Scientist
Twitter, Inc.
Outline
• Basic concepts in the Hadoop ecosystem, with an example
• Hadoop
• Cascading
• PyCascading
• Essential PyCascading operations
• PyCascading by example: discovering main interests among friends
• Miscellaneous remarks, caveats
2
Hadoop
Architecture
• The Hadoop file system (HDFS)
• Large, distributed file system
• Thousands of nodes, PBs of data
• The storage layer for Apache Hive, HBase, ...
• MapReduce
• Idea: ship the code to the data, not other way around
• Do aggregations locally
• Iterate on the results
• Map phase: process the input records, emit a key & a value
• Reduce phase: collect records with the same key from Map, emit a new (aggregate) record
• Fault tolerance
• Both storage and compute are fault tolerant (redundancy, replication, restart)
3
Hadoop
In practice
• Language
• Java
• Need to think in MapReduce
• Hard to translate the problem to MR
• Hard to maintain and make changes in the topology
• Best used for
• Archiving (HDFS)
• Batch processing (MR)
4
Cascading	
The Cascading way: flow processing
• Cascading is built on top of Hadoop
• Introduces semi-structured flow processing of tuples with typed fields
• Analogy: data is flowing in pipes
• Input comes from source taps
• Output goes to sink taps
• Data is reshaped in the pipes by different operations
• Builds a DAG from the job, and optimizes the topology to minimize the number of
MapReduce phases
• The pipes analogy is more intuitive to use than raw MapReduce
5
Flow processing in (Py)Cascading
6
Source: cascading.org
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Sink tap
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Sink tap
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Sink tap
“Map”
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Sink tap
“Map” Join
Flow processing in (Py)Cascading
6
Source: cascading.org
Source tap
Filter
Group & aggregate
Sink tap
“Map” Join
PyCascading
Design
• Built on top of Cascading
• Uses the Jython 2.5 interpreter
• Everything in Python
• Building the pipelines
• User-defined functions that operate on data
• Completely hides Java if the user wants it to
• However due to the strong ties with Java, it’s worth knowing the Cascading classes
7
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
M
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
MR
Example: as always, WordCount
Writing MapReduce jobs by hand is hard
• WordCount: split the input file into words, and count how many times each word occurs
8
MRSupportcodeSupportcode
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
M
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
MG
Cascading WordCount
• Still in Java, but algorithm design is easier
• Need to write separate classes for each user-defined operation
9
MGSupportcodeSupportcode
word_count.py
PyCascading minimizes programmer effort
10
word_count.py
PyCascading minimizes programmer effort
10
Map
word_count.py
PyCascading minimizes programmer effort
10
GMap
word_count.py
PyCascading minimizes programmer effort
10
GSupportcodeMap
PyCascading workflow
The basics of writing a Cascading flow in Python
• There is one main script that must contain a main() function
• We build the pipeline in main()
• Pipes are joined together with the pipe operator |
• Pipe ends may be assigned to variables and reused (split)
• All the user-defined operations are Python functions
• Globally or locally-scoped
• Then submit the pipeline to be run to PyCascading
• The main Python script will be executed on each of the workers when they spin up to
import global declarations
• This is the reason we have to have main(), so that it won’t be executed again
11
PyCascading by example
Walk through the operations using an example
• Data
• A friendship network in long format
• List of interests per user, ordered by decreasing importance
• Question
• For every user, find which main interest among the friends occurs the most
• Workflow
• Take the most important interest per user, and join it to the friendship table
• For each user, count how many times each interest appears, and select the one with the
maximum count
12
Friendship network User interests
The full source
13
Defining the inputs
14
Defining the inputs
14
Need to use Java
types since this is a
Cascading call
Shaping the fields: “mapping”
15
Shaping the fields: “mapping”
15
Replace the interest field
with the result yielded by
take_first, and call it
interest
Shaping the fields: “mapping”
15
Decorators annotate
user-defined functions
Replace the interest field
with the result yielded by
take_first, and call it
interest
Shaping the fields: “mapping”
15
Decorators annotate
user-defined functions
tuple is a Cascading
record type
Replace the interest field
with the result yielded by
take_first, and call it
interest
Shaping the fields: “mapping”
15
Decorators annotate
user-defined functions
tuple is a Cascading
record type
We can return any
number of new
records with yield
Replace the interest field
with the result yielded by
take_first, and call it
interest
Checkpointing
16
Checkpointing
16
Take the data EITHER from the cache
(ID: “users_first_interests”), OR
generate it if it’s not cached yet
Grouping & aggregating
17
Grouping & aggregating
17
Group by user, and call
the two result fields
user and friend
Grouping & aggregating
17
Define a UDF that takes the the
grouping fields, a tuple iterator, and
optional arguments
Group by user, and call
the two result fields
user and friend
Grouping & aggregating
17
Define a UDF that takes the the
grouping fields, a tuple iterator, and
optional arguments
Use the .get getter
with the field name
Group by user, and call
the two result fields
user and friend
Grouping & aggregating
17
Define a UDF that takes the the
grouping fields, a tuple iterator, and
optional arguments
Use the .get getter
with the field name
Yield any number of
results
Group by user, and call
the two result fields
user and friend
Joins & field algebra
18
Joins & field algebra
18
Joins & field algebra
18
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Use built-in
aggregators
where possible
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Use built-in
aggregators
where possible
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Use built-in
aggregators
where possible
Joins & field algebra
18
No field name
overlap is
allowed
Keep certain
fields & rename
Join on the friend field from
the 1st stream, and on the
user field from the 2nd
Save this stream!
Use built-in
aggregators
where possible
Split & aggregate more
19
Split & aggregate more
19
Split the stream to
group by user, and
find the interest
that appears most
by count
Split & aggregate more
19
Split the stream to
group by user, and
find the interest
that appears most
by count
Once the data flow is
built, submit and run it!
Running the script
Local or remote runs
• Cascading flows can run locally or on HDFS
• Local run for testing
• local_run.sh recommendation.py
• Remote run in production
• remote_deploy.sh -s server.com 
recommendation.py
• The example had 5 MR stages
• Although the problem was simple, doing it by hand would
have been inconvenient
20
Friendship network User interests
friends_interests_counts
recommendations
Some remarks
• Benefits
• Can use any Java class
• Can be mixed with Java code
• Can use Python libraries
• Caveats
• Only pure Python code can be used, no compiled C (numpy, scipy)
• But with streaming it’s possible to execute a CPython interpreter
• Some idiosyncrasies because of Jython’s representation of basic types
• Strings are OK, but Python integers are represented as java.math.BigInteger, so
before yielding explicit conversion is needed (joins!)
21
Contact
• Javadoc: http://cascading.org
• Other Cascading-based wrappers
• Scalding (Scala), Cascalog (Clojure), Cascading-JRuby (Ruby)
22
http://github.org/twitter/pycascading
http://pycascading.org
@gaborjszabo
gabor@twitter.com
Implementation details
Challenges due to an interpreted language
• We need to make code available on all workers
• Java bytecode is easy, same .jar everywhere
• Although Jython represents Python functions as classes, they cannot be serialized
• We need to start an interpreter on every worker
• The Python source of the UDFs is retrieved and shipped in the .jar
• Different Hadoop distributions explode the .jar differently, need to use the Hadoop distributed
cache
23

Mais conteúdo relacionado

Mais procurados

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseMapR Technologies
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 

Mais procurados (19)

Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Semelhante a PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)

Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Robert Schadek
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and ReuseVasia Kalavri
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUGStu Hood
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdmainside-BigData.com
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 

Semelhante a PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo) (20)

Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
Data Science
Data ScienceData Science
Data Science
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

Mais de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Mais de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)

  • 1. PyCascading for Intuitive Flow Processing With Hadoop Gabor Szabo Senior Data Scientist Twitter, Inc.
  • 2. Outline • Basic concepts in the Hadoop ecosystem, with an example • Hadoop • Cascading • PyCascading • Essential PyCascading operations • PyCascading by example: discovering main interests among friends • Miscellaneous remarks, caveats 2
  • 3. Hadoop Architecture • The Hadoop file system (HDFS) • Large, distributed file system • Thousands of nodes, PBs of data • The storage layer for Apache Hive, HBase, ... • MapReduce • Idea: ship the code to the data, not other way around • Do aggregations locally • Iterate on the results • Map phase: process the input records, emit a key & a value • Reduce phase: collect records with the same key from Map, emit a new (aggregate) record • Fault tolerance • Both storage and compute are fault tolerant (redundancy, replication, restart) 3
  • 4. Hadoop In practice • Language • Java • Need to think in MapReduce • Hard to translate the problem to MR • Hard to maintain and make changes in the topology • Best used for • Archiving (HDFS) • Batch processing (MR) 4
  • 5. Cascading The Cascading way: flow processing • Cascading is built on top of Hadoop • Introduces semi-structured flow processing of tuples with typed fields • Analogy: data is flowing in pipes • Input comes from source taps • Output goes to sink taps • Data is reshaped in the pipes by different operations • Builds a DAG from the job, and optimizes the topology to minimize the number of MapReduce phases • The pipes analogy is more intuitive to use than raw MapReduce 5
  • 6. Flow processing in (Py)Cascading 6 Source: cascading.org
  • 7. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap
  • 8. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Sink tap
  • 9. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Sink tap
  • 10. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Sink tap “Map”
  • 11. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Sink tap “Map” Join
  • 12. Flow processing in (Py)Cascading 6 Source: cascading.org Source tap Filter Group & aggregate Sink tap “Map” Join
  • 13. PyCascading Design • Built on top of Cascading • Uses the Jython 2.5 interpreter • Everything in Python • Building the pipelines • User-defined functions that operate on data • Completely hides Java if the user wants it to • However due to the strong ties with Java, it’s worth knowing the Cascading classes 7
  • 14. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8
  • 15. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8 M
  • 16. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8 MR
  • 17. Example: as always, WordCount Writing MapReduce jobs by hand is hard • WordCount: split the input file into words, and count how many times each word occurs 8 MRSupportcodeSupportcode
  • 18. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9
  • 19. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9 M
  • 20. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9 MG
  • 21. Cascading WordCount • Still in Java, but algorithm design is easier • Need to write separate classes for each user-defined operation 9 MGSupportcodeSupportcode
  • 26. PyCascading workflow The basics of writing a Cascading flow in Python • There is one main script that must contain a main() function • We build the pipeline in main() • Pipes are joined together with the pipe operator | • Pipe ends may be assigned to variables and reused (split) • All the user-defined operations are Python functions • Globally or locally-scoped • Then submit the pipeline to be run to PyCascading • The main Python script will be executed on each of the workers when they spin up to import global declarations • This is the reason we have to have main(), so that it won’t be executed again 11
  • 27. PyCascading by example Walk through the operations using an example • Data • A friendship network in long format • List of interests per user, ordered by decreasing importance • Question • For every user, find which main interest among the friends occurs the most • Workflow • Take the most important interest per user, and join it to the friendship table • For each user, count how many times each interest appears, and select the one with the maximum count 12 Friendship network User interests
  • 30. Defining the inputs 14 Need to use Java types since this is a Cascading call
  • 31. Shaping the fields: “mapping” 15
  • 32. Shaping the fields: “mapping” 15 Replace the interest field with the result yielded by take_first, and call it interest
  • 33. Shaping the fields: “mapping” 15 Decorators annotate user-defined functions Replace the interest field with the result yielded by take_first, and call it interest
  • 34. Shaping the fields: “mapping” 15 Decorators annotate user-defined functions tuple is a Cascading record type Replace the interest field with the result yielded by take_first, and call it interest
  • 35. Shaping the fields: “mapping” 15 Decorators annotate user-defined functions tuple is a Cascading record type We can return any number of new records with yield Replace the interest field with the result yielded by take_first, and call it interest
  • 37. Checkpointing 16 Take the data EITHER from the cache (ID: “users_first_interests”), OR generate it if it’s not cached yet
  • 39. Grouping & aggregating 17 Group by user, and call the two result fields user and friend
  • 40. Grouping & aggregating 17 Define a UDF that takes the the grouping fields, a tuple iterator, and optional arguments Group by user, and call the two result fields user and friend
  • 41. Grouping & aggregating 17 Define a UDF that takes the the grouping fields, a tuple iterator, and optional arguments Use the .get getter with the field name Group by user, and call the two result fields user and friend
  • 42. Grouping & aggregating 17 Define a UDF that takes the the grouping fields, a tuple iterator, and optional arguments Use the .get getter with the field name Yield any number of results Group by user, and call the two result fields user and friend
  • 43. Joins & field algebra 18
  • 44. Joins & field algebra 18
  • 45. Joins & field algebra 18 Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 46. Joins & field algebra 18 No field name overlap is allowed Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 47. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 48. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 49. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd
  • 50. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Use built-in aggregators where possible
  • 51. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Use built-in aggregators where possible
  • 52. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Use built-in aggregators where possible
  • 53. Joins & field algebra 18 No field name overlap is allowed Keep certain fields & rename Join on the friend field from the 1st stream, and on the user field from the 2nd Save this stream! Use built-in aggregators where possible
  • 54. Split & aggregate more 19
  • 55. Split & aggregate more 19 Split the stream to group by user, and find the interest that appears most by count
  • 56. Split & aggregate more 19 Split the stream to group by user, and find the interest that appears most by count Once the data flow is built, submit and run it!
  • 57. Running the script Local or remote runs • Cascading flows can run locally or on HDFS • Local run for testing • local_run.sh recommendation.py • Remote run in production • remote_deploy.sh -s server.com recommendation.py • The example had 5 MR stages • Although the problem was simple, doing it by hand would have been inconvenient 20 Friendship network User interests friends_interests_counts recommendations
  • 58. Some remarks • Benefits • Can use any Java class • Can be mixed with Java code • Can use Python libraries • Caveats • Only pure Python code can be used, no compiled C (numpy, scipy) • But with streaming it’s possible to execute a CPython interpreter • Some idiosyncrasies because of Jython’s representation of basic types • Strings are OK, but Python integers are represented as java.math.BigInteger, so before yielding explicit conversion is needed (joins!) 21
  • 59. Contact • Javadoc: http://cascading.org • Other Cascading-based wrappers • Scalding (Scala), Cascalog (Clojure), Cascading-JRuby (Ruby) 22 http://github.org/twitter/pycascading http://pycascading.org @gaborjszabo gabor@twitter.com
  • 60. Implementation details Challenges due to an interpreted language • We need to make code available on all workers • Java bytecode is easy, same .jar everywhere • Although Jython represents Python functions as classes, they cannot be serialized • We need to start an interpreter on every worker • The Python source of the UDFs is retrieved and shipped in the .jar • Different Hadoop distributions explode the .jar differently, need to use the Hadoop distributed cache 23