Building Complex Data Workflows with Cascading on Hadoop

•Download as ODP, PDF•

3 likes•1,034 views

In the Hadoop world, writing Map Reduce is always a painful task especially when you want to create complex workflows with multiple Map Reduce jobs and having complex dependencies between them. There are high level languages like Pig latin or Hive which make writing Map Reduce jobs easy. But if you want to write complex logic, you need to write custom functions in Java, which makes testing and debugging difficult. This is where Cascading makes a developer's life easy. Everything is written in Java with ease of writing Map Reduce in high level language, similar to Pig or Hive. Once the logic is written, it can be easily be tested by running in stand-alone mode since everything is in Java. Not only that, Cascading provides Hadoop (or any other framework) independent APIs, which means workflows written in Cascading can be executed on multiple frameworks without any code change as long as Cascading connector is available

Software

Building Complex Data Workflows with
Cascading on Hadoop
Gagan Agrawal, Sr. Principal Engineer,
Snapdeal

Agenda
•
What is Cascading
•
Why use Cascading
•
Cascading Building Blocks
•
Example Workflows
•
Testing
•
Advantages / Disadvantages
•
Cascading @ Snapdeal

What is Cascading ?
•
Abstraction over Map Reduce
•
API for data-processing workflows
•
JVM framework and SDK for creating abstracted
data flows
•
Translates data flows into actual Hadoop/Local
jobs

Why use Cascading ?
•
Options in Hadoop
– Map Reduce API
– Pig
– Hive

Why use Cascading ?
•
Map Reduce is
– Low Level API
– Requires lot of boilerplate code
– Can be difficult to chain jobs

Why use Cascading ?
•
Pig
– Good for scripting and ad-hoc queries
– Can be difficult to manage multi-module
workflows
– Can be difficult to debug and unit test
– UDF written in different language (Java,
Python (via streaming) etc.)

Why use Cascading ?
•
Hive
– Good for simple queries
– Complex workflows can be very difficult to
implement

Why use Cascading ?
•
Cascading
– Helps create higher level data processing
abstractions
– Sophisticated data pipelines
– Re-usable components

Cascading Building Blocks
•
Tap
– Source
– Sink
•
Pipes
– Each
– Every
– Merge
– GroupBy
– CoGroup
– HashJoin

Cascading Building Blocks
•
Tuple
•
Fields
•
Functions
•
Filter
•
Aggregator
•
Buffer
•
Sub Assemblies

Cascading Terminologies
•
Flow
– A path for data with some number of inputs,
some operations, and some outputs
•
Cascade
– A series of connected flows
•
Operation
– A function applied to data, yielding new data
•

Cascading Terminologies
•
Pipe
– Moves data from some place to some other
place
•
Tap
– Feeds data from outside the flow into it and
writes data from inside the flow out of it.

Pipe Assemblies
•
Define work against a Tuple Stream
•
Read from tap sources and written to tap sinks
•
Work includes actions
– Filtering
– Transforming
– Calculating
•
May use multiple sources and sinks
•
May define splits, merges and joins to
manipulate tuple streams
•

Each and Every Pipe
•
new Each( previousPipe, argumentSelector,
operation, outputSelector )
•
new Every( previousPipe, argumentSelector,
operation, outputSelector )

Custom Function
•
A function expects a stream of individual tuples
•
Returns zero or more tuples
•
Used with Each pipe
•
To create custom function
– Sublcass cascading.operation.BaseOperation
– Implement cascading.operation.Function

Cascade
●
CascadeConnector connector = new
CascadeConnector();
●
Cascade cascade = connector.connect( flowFirst,
flowSecond, flowThird );

Testing
•
Create flows entirely in code on a local machine
•
Write tests for controlled sample data sets
•
Run tests as a regular old Java without needing
access to actual Hadoop or databases
•
Local machine and CI testing are easy

Advantages
•
Re-usability
– Pipe assemblies are designed for reuse
– Once created and tested, use them in other
flows
– Write logic to do something only once
•
Simple Stack
– Cascading creates DAG of dependent jobs
for us
– Removes most of the need for oozie

Advantages
•
Simpler Stack
– Keeps track of where a flow fails and can
rerun from that point on failure
•
Common Code Base
– Since everything is written in java, everybody
can use same terms and same tech

Disadvantages
•
JVM Based
– Java / Scala / Clojure
•
Doesn't have job scheduler
– Can figure out dependency graph for jobs,
but nothing to run them on regular interval
– We still need scheduler like quartz
•
No real built-in monitoring

Building Complex Data Workflows with Cascading on Hadoop

What's hot

HashiCorp at Just EatAndrew Brown

tdtechtalk20160330johanJohan Gustavsson

Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo

Benefits of CassandraDeanna Medina

Docker. Does it matter for Java developer ?Izzet Mustafaiev

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan

#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub

London HUG 14/4 - Deploying and Discovering at Scale with Consul and NomadLondon HashiCorp User Group

Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks

Immutable infrastructure with TerraformPrashant Kalkar

Reactive streamscodepitbull

Stacktician - CloudStack Collab Conference 2014amoghvk

Structured Streaming Use-Cases at AppleDatabricks

Compression talkIlya Ganelin

CI/CD with Azure DevOps and Azure DatabricksGoDataDriven

xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura

Portable UDFs: Write Once, Run AnywhereDatabricks

Autoscaling Spark on AWS EC2 - 11th Spark London meetupRafal Kwasny

NRD: Nagios Result DistributorJose Luis Martínez

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

What's hot (20)

HashiCorp at Just Eat

tdtechtalk20160330johan

Enable IPv6 on Route53 AWS ELB, docker and node App

Benefits of Cassandra

Docker. Does it matter for Java developer ?

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode

London HUG 14/4 - Deploying and Discovering at Scale with Consul and Nomad

Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator

Immutable infrastructure with Terraform

Reactive streams

Stacktician - CloudStack Collab Conference 2014

Structured Streaming Use-Cases at Apple

Compression talk

CI/CD with Azure DevOps and Azure Databricks

xPatterns on Spark, Shark, Mesos, Tachyon

Portable UDFs: Write Once, Run Anywhere

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

NRD: Nagios Result Distributor

Scylla Summit 2016: Graph Processing with Titan and Scylla

Similar to Building Complex Data Workflows with Cascading on Hadoop

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco

Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA

Creating Reusable Geospatial PipelinesDatabricks

Impala Architecture presentationhadooparchbook

Architectures, Frameworks and Infrastructureharendra_pathak

StackMate - CloudFormation for CloudStackChiradeep Vittal

Cascalog at Strange Loopnathanmarz

Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit

Learn from HomeAway Hadoop Development and Operations Best PracticesDriven Inc.

Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend

AWS (Hadoop) Meetup 30.04.09Chris Purrington

WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2

BDM25 - Spark runtime internalDavid Lauzon

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaAkara Sucharitakul

Introduction to Cascading Cascading

Scalding by Adform Research, Alex GryzlovVasil Remeniuk

Similar to Building Complex Data Workflows with Cascading on Hadoop (20)

Real time Analytics with Apache Kafka and Apache Spark

DrupalCampLA 2014 - Drupal backend performance and scalability

Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...

Creating Reusable Geospatial Pipelines

Impala Architecture presentation

Architectures, Frameworks and Infrastructure

StackMate - CloudFormation for CloudStack

Cascalog at Strange Loop

Low latency high throughput streaming using Apache Apex and Apache Kudu

Learn from HomeAway Hadoop Development and Operations Best Practices

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...

AWS (Hadoop) Meetup 30.04.09

WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber

BDM25 - Spark runtime internal

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Introduction to Cascading

Scalding by Adform Research, Alex Gryzlov

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Professional Resume Template for Software DevelopersVinodh Ram

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Software Quality Assurance Interview QuestionsArshad QA

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Test Automation Strategy for Frontend and BackendArshad QA

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Hand gesture recognition PROJECT PPT.pptx

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Professional Resume Template for Software Developers

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Software Quality Assurance Interview Questions

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Optimizing AI for immediate response in Smart CCTV

How To Use Server-Side Rendering with Nuxt.js

Active Directory Penetration Testing, cionsystems.com.pdf

5 Signs You Need a Fashion PLM Software.pdf

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Test Automation Strategy for Frontend and Backend

Microsoft AI Transformation Partner Playbook.pdf

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Building Complex Data Workflows with Cascading on Hadoop

1. Building Complex Data Workflows with Cascading on Hadoop Gagan Agrawal, Sr. Principal Engineer, Snapdeal

2. Agenda • What is Cascading • Why use Cascading • Cascading Building Blocks • Example Workflows • Testing • Advantages / Disadvantages • Cascading @ Snapdeal

3. What is Cascading ?

4. What is Cascading ? • Abstraction over Map Reduce • API for data-processing workflows • JVM framework and SDK for creating abstracted data flows • Translates data flows into actual Hadoop/Local jobs

5. Why use Cascading ?

6. Why use Cascading ? • Options in Hadoop – Map Reduce API – Pig – Hive

7. Why use Cascading ? • Map Reduce is – Low Level API – Requires lot of boilerplate code – Can be difficult to chain jobs

8. Why use Cascading ? • Pig – Good for scripting and ad-hoc queries – Can be difficult to manage multi-module workflows – Can be difficult to debug and unit test – UDF written in different language (Java, Python (via streaming) etc.)

9. Why use Cascading ? • Hive – Good for simple queries – Complex workflows can be very difficult to implement

10. Why use Cascading ? • Cascading – Helps create higher level data processing abstractions – Sophisticated data pipelines – Re-usable components

11. TF-IDF Example

12. Cascading Building Blocks

13. Cascading Building Blocks • Tap – Source – Sink • Pipes – Each – Every – Merge – GroupBy – CoGroup – HashJoin

14. Cascading Building Blocks • Tuple • Fields • Functions • Filter • Aggregator • Buffer • Sub Assemblies

15. Cascading Terminologies • Flow – A path for data with some number of inputs, some operations, and some outputs • Cascade – A series of connected flows • Operation – A function applied to data, yielding new data •

16. Cascading Terminologies • Pipe – Moves data from some place to some other place • Tap – Feeds data from outside the flow into it and writes data from inside the flow out of it.

17. Pipe Assemblies • Define work against a Tuple Stream • Read from tap sources and written to tap sinks • Work includes actions – Filtering – Transforming – Calculating • May use multiple sources and sinks • May define splits, merges and joins to manipulate tuple streams •

18. Pipe Assemblies

19. Pipe Assemblies

20. Each and Every Pipe • new Each( previousPipe, argumentSelector, operation, outputSelector ) • new Every( previousPipe, argumentSelector, operation, outputSelector )

21. Each Pipe

22. Every Pipe

23. Example Workflows

24. Distributed Copy

25. Word Count

26. Custom Function • A function expects a stream of individual tuples • Returns zero or more tuples • Used with Each pipe • To create custom function – Sublcass cascading.operation.BaseOperation – Implement cascading.operation.Function

27. Custom Function

28. Joins • CoGroup • HashJoin

29. Joins

30. TF-IDF Example

31. Cascade

32. Cascade ● CascadeConnector connector = new CascadeConnector(); ● Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );

33. Testing

34. Testing • Create flows entirely in code on a local machine • Write tests for controlled sample data sets • Run tests as a regular old Java without needing access to actual Hadoop or databases • Local machine and CI testing are easy

35. Testing

36. Advantages / Disadvantages

37. Advantages • Re-usability – Pipe assemblies are designed for reuse – Once created and tested, use them in other flows – Write logic to do something only once • Simple Stack – Cascading creates DAG of dependent jobs for us – Removes most of the need for oozie

38. Advantages • Simpler Stack – Keeps track of where a flow fails and can rerun from that point on failure • Common Code Base – Since everything is written in java, everybody can use same terms and same tech

39. Disadvantages • JVM Based – Java / Scala / Clojure • Doesn't have job scheduler – Can figure out dependency graph for jobs, but nothing to run them on regular interval – We still need scheduler like quartz • No real built-in monitoring

40. Cascading @ Snapdeal

Building Complex Data Workflows with Cascading on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Complex Data Workflows with Cascading on Hadoop

Similar to Building Complex Data Workflows with Cascading on Hadoop (20)

More from Gagan Agrawal

More from Gagan Agrawal (6)

Recently uploaded

Recently uploaded (20)

Building Complex Data Workflows with Cascading on Hadoop