SlideShare uma empresa Scribd logo
1 de 42
Simplifying
Application
Development on
Hadoop
WidasConcepts Unternehmensberatung GmbH  Maybachstraße 2  71299 Wimsheim  http://www.widas.de
Big Data Engineer, WidasConcepts
Vinoth Kannan
Cascading User Group Meet
Berlin, Germany
26.05.2014
2What is Hadoop?
“Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.“ (Wikipedia)
Designed for
Possible to:
Works on:
• Batch
Processing
• Horizontal Scaling
• Bringing
Computation to
Data
Principles of Hadoop:
3Main Features
Reliable and Redundant
• No performance or data loss even on failure
Powerful
• Possible to have huge clusters (largest 40,000 nodes)
• Supports “Best of Breed Analytics“
Scalable
• Linearly scalable with increase in data volume
Cost Efficient
• No need for expensive hardware. Supports commodity hardware
Simple and flexible APIs
• Great ecosystem with multitude of solutions to support
4Traditional vs. Hadoop
Traditional Hadoop
More and larger server necessary to
accomplish tasks:
• computing capacity
• data capacity
Instead of upgrading the server, the cluster
size is increased with more machines
5
MapReduce are programming model to run applications
mostly on Hadoop
What is MapReduce?
Mapper
• Converts
input
(K,V) to
new (K,V)
Shuffle
• Sorts and
Groups
similar
keys with
all its
values
Reducer
• Translates
the Value
each
unique
Key to
new (K,V)
6MapReduce Paradigm
Map Shuffle Reduce
(K1, V1)
(K1, V1)
(K1, V1)
(K5, V5)
(K2, V2)
(K3, V3)
(K3, V3)
(K3, V3)
(K6, V6)
(K7, V7)
7Map Reduce with Multiple data sources
HDFS
Cassandra
SQL
HBase
MapReduce job
HDFS
Neo4j
SQL
MongoDB
Input Processing Output
8Jumping to the Hadoop Bandwagon
9Challenges with Map Reduce
Complex jobs which requires multiple mappers and
reducers
Chaining multiple MR jobs and scheduling them together
Wrong level of granularity of MR
Transforming business rules into Map Reduce paradigm
Testing and maintaining the code
10Growing opportunities in Hadoop
With the growing job trends in Hadoop, there is a huge gap
in the skillset required to meet the demand
Huge investment already made by enterprises in existing
business processes and training
How to Train Your Elephant ?!
Cascading
13What is Cascading ?
Cascading is a open source Java framework that
provides application development platform for building
data applications on Hadoop.
Developed by Chris Wensel in 2007
Underlying motivation for developing the Cascading Java
framework
Difficulty for Java developers
to write MapReduce Code
MapReduce is based on
functional programming
element
14Enterprise Data Flow - Challenge
Business Goals Data Sources
Using existing Skillset,
business process and tools
15Cascading Building Blocks – Highlevel Overview
Cascading
MapReduce
HDFS
Distributed Storage
16Cascading in Short
Functional programming way to Hadoop
Alternative and Easy API for MapReduce
Reusable Java components
Possibility for Test driven development
Can be used with any JVM- based languages
Java, JRuby, Clojure, etc
17Cascading Building Blocks
Pipes
Sinks
Taps Flow
18Sample Look of Cascading Flow
Source Tap
Sink Tap
Pipe Assembly
Flow
19Cascading Pipe Assemblies
Original
Tuple Streams
Transformed
Tuple Streams
Pipe
Each
GroupBy
CoGroup
Every
SubAssembly
20The quintessential WordCount Example
21The quintessential WordCount Example
22The quintessential WordCount Example
23The quintessential WordCount Example
Initialize properties
and tell Hadoop
which jar file to use
24The quintessential WordCount Example
Word-count
25The quintessential WordCount Example
Word-count
26Typical Pipe Assembly
CSV
NoSQL
Sequence File
Flow Definition
Flow A
27Cascading Multiple Flows
Flow A
Flow E
Flow B
Flow C
Flow D
Flow F
Flow G
Flow H
28Cascading Pipe Assemblies
lhs pipe definition
rhs pipe definition
Join lhs & rhs pipes
Join pipe assembly
29Cascading real-world Data Flow Use Cases
Analytics on login information
Analytics from ClickStream Data
30Support With multiple data Sources
HDFS
Cassandra
Mongodb
ElasticSearch
HBase
Memcached
Neo4j
Solr
ElephantDB RDBMS
Splunk
http://www.cascading.org/extensions/
31Support With major Serializers
http://www.cascading.org/extensions/
JSON AVRO
KYRO THRIFT
Predictive Models on Hadoop
33
Cascading Pattern is a machine learning project within the Cascading
development framework used to build enterprise data workflows
Pattern uses the industrial standard Predictive Model Markup Language
(PMML), an XML-based file format developed by Data Mining group
PMML is supported by most of the popular analytical tools such as R,
SaS, TeraData, Weka, Knime, Microsoft etc
Cascading Pattern
http://www.dmg.org/
34
Track trips
Maintain Logbook
Get Notified about best gas stations
Manage and compare vehicle cost
Fleet management
Social platform connecting drivers
Cascading Pattern on CarbookPlus
www.carbookplus.com
35CarbookPlus Fuel Cost Predicition
“MDM: Mobilitäts Daten
Marktplatz”, is a German federal
government organization that
provides open data about the
fuel prices across Germany on
real time.
http://www.mdm-portal.de/
Our Objective :
• Store the data from MDM into
HDFS
• Process and clean the data with
Cascading
• Build a model with R, predicting
the fuel price trend for the next 7
days & 24 hours
• Export the model as PMML
• Scale-out on the hadoop cluster,
with Cascading Pattern
• Store the results in Mongodb
36Exporting PMML model from R
Export model as PMML file
37Cascading Pattern Flow Definition
38Fuel Cost Predictor Result
39Algorithms Supported by Cascading Pattern
Random Forest
Linear Regression
Logistical Regression
K-Means Clustering
Hierarchical Clustering
Multinominal Model
https://github.com/cascading/pattern
40
Cascading Pattern to Support more predictive models
Neural Network
Support Vector Machine
More new features in Cascading 3.0
Future of Cascading
YARN
Cluster Resource Management
HDFS
Distributed Storage
Cascading 3.0
Spark
Tez
Execution Engine
Storm
When do you Start ?
42Questions?
Q & A
Thank you !!
Vinoth Kannan
Credits
www.soundcloud.com
www.concurrentinc.co
m
www.cascading.org
Big Data Engineer
WidasConcepts Gmbh
www.widas.de
@WidasConcepts@vinoth4v
/WidasConcepts
vinoth.kannan@widas.de

Mais conteúdo relacionado

Mais procurados

"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan
 
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Dataconomy Media
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed RJorge Martinez de Salinas
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionSteve Loughran
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUG IT
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop systemToby Woolfe
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Srikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydSrikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydsrikanth K
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 

Mais procurados (20)

"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
 
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware Hadoop
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
2014 10 09 Top reasons to use IBM BigInsights as your Big Data Hadoop system
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Srikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydSrikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hyd
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 

Semelhante a Cascading User Group Meet

Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey ResultsKim Loughead
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSandish Kumar H N
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @IndixManoj Mahalingam
 
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionCisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionAppfluent Technology
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Business Growth Is Fueled By Your Event-Centric Digital Strategy
Business Growth Is Fueled By Your Event-Centric Digital StrategyBusiness Growth Is Fueled By Your Event-Centric Digital Strategy
Business Growth Is Fueled By Your Event-Centric Digital Strategyzitipoff
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearnCascading
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
 
Hadoop performance modeling for job estimation and resource provisioning
Hadoop performance modeling for job estimation and resource provisioningHadoop performance modeling for job estimation and resource provisioning
Hadoop performance modeling for job estimation and resource provisioningLeMeniz Infotech
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...Manish Harsh
 
The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosSenturus
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRushtempledf
 

Semelhante a Cascading User Group Meet (20)

Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR DistributionCisco Big Data Warehouse Expansion Featuring MapR Distribution
Cisco Big Data Warehouse Expansion Featuring MapR Distribution
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Business Growth Is Fueled By Your Event-Centric Digital Strategy
Business Growth Is Fueled By Your Event-Centric Digital StrategyBusiness Growth Is Fueled By Your Event-Centric Digital Strategy
Business Growth Is Fueled By Your Event-Centric Digital Strategy
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
 
Hadoop performance modeling for job estimation and resource provisioning
Hadoop performance modeling for job estimation and resource provisioningHadoop performance modeling for job estimation and resource provisioning
Hadoop performance modeling for job estimation and resource provisioning
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and Cognos
 
Why Hadoop as a Service?
Why Hadoop as a Service?Why Hadoop as a Service?
Why Hadoop as a Service?
 
Pattern -A scoring engine
Pattern -A scoring enginePattern -A scoring engine
Pattern -A scoring engine
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRush
 

Último

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 

Último (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

Cascading User Group Meet

  • 1. Simplifying Application Development on Hadoop WidasConcepts Unternehmensberatung GmbH  Maybachstraße 2  71299 Wimsheim  http://www.widas.de Big Data Engineer, WidasConcepts Vinoth Kannan Cascading User Group Meet Berlin, Germany 26.05.2014
  • 2. 2What is Hadoop? “Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia) Designed for Possible to: Works on: • Batch Processing • Horizontal Scaling • Bringing Computation to Data Principles of Hadoop:
  • 3. 3Main Features Reliable and Redundant • No performance or data loss even on failure Powerful • Possible to have huge clusters (largest 40,000 nodes) • Supports “Best of Breed Analytics“ Scalable • Linearly scalable with increase in data volume Cost Efficient • No need for expensive hardware. Supports commodity hardware Simple and flexible APIs • Great ecosystem with multitude of solutions to support
  • 4. 4Traditional vs. Hadoop Traditional Hadoop More and larger server necessary to accomplish tasks: • computing capacity • data capacity Instead of upgrading the server, the cluster size is increased with more machines
  • 5. 5 MapReduce are programming model to run applications mostly on Hadoop What is MapReduce? Mapper • Converts input (K,V) to new (K,V) Shuffle • Sorts and Groups similar keys with all its values Reducer • Translates the Value each unique Key to new (K,V)
  • 6. 6MapReduce Paradigm Map Shuffle Reduce (K1, V1) (K1, V1) (K1, V1) (K5, V5) (K2, V2) (K3, V3) (K3, V3) (K3, V3) (K6, V6) (K7, V7)
  • 7. 7Map Reduce with Multiple data sources HDFS Cassandra SQL HBase MapReduce job HDFS Neo4j SQL MongoDB Input Processing Output
  • 8. 8Jumping to the Hadoop Bandwagon
  • 9. 9Challenges with Map Reduce Complex jobs which requires multiple mappers and reducers Chaining multiple MR jobs and scheduling them together Wrong level of granularity of MR Transforming business rules into Map Reduce paradigm Testing and maintaining the code
  • 10. 10Growing opportunities in Hadoop With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demand Huge investment already made by enterprises in existing business processes and training
  • 11. How to Train Your Elephant ?!
  • 13. 13What is Cascading ? Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop. Developed by Chris Wensel in 2007 Underlying motivation for developing the Cascading Java framework Difficulty for Java developers to write MapReduce Code MapReduce is based on functional programming element
  • 14. 14Enterprise Data Flow - Challenge Business Goals Data Sources Using existing Skillset, business process and tools
  • 15. 15Cascading Building Blocks – Highlevel Overview Cascading MapReduce HDFS Distributed Storage
  • 16. 16Cascading in Short Functional programming way to Hadoop Alternative and Easy API for MapReduce Reusable Java components Possibility for Test driven development Can be used with any JVM- based languages Java, JRuby, Clojure, etc
  • 18. 18Sample Look of Cascading Flow Source Tap Sink Tap Pipe Assembly Flow
  • 19. 19Cascading Pipe Assemblies Original Tuple Streams Transformed Tuple Streams Pipe Each GroupBy CoGroup Every SubAssembly
  • 23. 23The quintessential WordCount Example Initialize properties and tell Hadoop which jar file to use
  • 24. 24The quintessential WordCount Example Word-count
  • 25. 25The quintessential WordCount Example Word-count
  • 26. 26Typical Pipe Assembly CSV NoSQL Sequence File Flow Definition Flow A
  • 27. 27Cascading Multiple Flows Flow A Flow E Flow B Flow C Flow D Flow F Flow G Flow H
  • 28. 28Cascading Pipe Assemblies lhs pipe definition rhs pipe definition Join lhs & rhs pipes Join pipe assembly
  • 29. 29Cascading real-world Data Flow Use Cases Analytics on login information Analytics from ClickStream Data
  • 30. 30Support With multiple data Sources HDFS Cassandra Mongodb ElasticSearch HBase Memcached Neo4j Solr ElephantDB RDBMS Splunk http://www.cascading.org/extensions/
  • 31. 31Support With major Serializers http://www.cascading.org/extensions/ JSON AVRO KYRO THRIFT
  • 33. 33 Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc Cascading Pattern http://www.dmg.org/
  • 34. 34 Track trips Maintain Logbook Get Notified about best gas stations Manage and compare vehicle cost Fleet management Social platform connecting drivers Cascading Pattern on CarbookPlus www.carbookplus.com
  • 35. 35CarbookPlus Fuel Cost Predicition “MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time. http://www.mdm-portal.de/ Our Objective : • Store the data from MDM into HDFS • Process and clean the data with Cascading • Build a model with R, predicting the fuel price trend for the next 7 days & 24 hours • Export the model as PMML • Scale-out on the hadoop cluster, with Cascading Pattern • Store the results in Mongodb
  • 36. 36Exporting PMML model from R Export model as PMML file
  • 39. 39Algorithms Supported by Cascading Pattern Random Forest Linear Regression Logistical Regression K-Means Clustering Hierarchical Clustering Multinominal Model https://github.com/cascading/pattern
  • 40. 40 Cascading Pattern to Support more predictive models Neural Network Support Vector Machine More new features in Cascading 3.0 Future of Cascading YARN Cluster Resource Management HDFS Distributed Storage Cascading 3.0 Spark Tez Execution Engine Storm
  • 41. When do you Start ?
  • 42. 42Questions? Q & A Thank you !! Vinoth Kannan Credits www.soundcloud.com www.concurrentinc.co m www.cascading.org Big Data Engineer WidasConcepts Gmbh www.widas.de @WidasConcepts@vinoth4v /WidasConcepts vinoth.kannan@widas.de

Notas do Editor

  1. Pipe - Each – Defines Filter or Function each tuple has to pass through GroupBy – groups the filed on selected tuple stream by field name. Allows merging CoGroup – joins on common set of values. Joins can be Inner, outer, Left or Right Every – applies aggregtor to every group of tuples Subassembly - nesting reusable pipe assemblies into a Pipe class for inclusion in a larger pipe assembly.
  2. A Scheme defines what is stored in a Tap instance by declaring the Tuple field names, and alternately parsing or rendering the incoming or outgoing Tuple stream, respectively. A Scheme defines the type of resource data will be sourced from or sinked to.
  3. ----- Meeting Notes (21/05/14 11:43) ----- Pipe Each Filters Functions GroupBy Merge CoGroup Joins (Left,Right,Inner, Outter) Every Aggregator (Sum & Count) Buffer SubAssembly Nesting reusable pipe