Cascading User Group Meet

Simplifying
Application
Development on
Hadoop
WidasConcepts Unternehmensberatung GmbH  Maybachstraße 2  71299 Wimsheim  http://www.widas.de
Big Data Engineer, WidasConcepts
Vinoth Kannan
Cascading User Group Meet
Berlin, Germany
26.05.2014

2What is Hadoop?
“Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.“ (Wikipedia)
Designed for
Possible to:
Works on:
• Batch
Processing
• Horizontal Scaling
• Bringing
Computation to
Data
Principles of Hadoop:

3Main Features
Reliable and Redundant
• No performance or data loss even on failure
Powerful
• Possible to have huge clusters (largest 40,000 nodes)
• Supports “Best of Breed Analytics“
Scalable
• Linearly scalable with increase in data volume
Cost Efficient
• No need for expensive hardware. Supports commodity hardware
Simple and flexible APIs
• Great ecosystem with multitude of solutions to support

4Traditional vs. Hadoop
Traditional Hadoop
More and larger server necessary to
accomplish tasks:
• computing capacity
• data capacity
Instead of upgrading the server, the cluster
size is increased with more machines

5
MapReduce are programming model to run applications
mostly on Hadoop
What is MapReduce?
Mapper
• Converts
input
(K,V) to
new (K,V)
Shuffle
• Sorts and
Groups
similar
keys with
all its
values
Reducer
• Translates
the Value
each
unique
Key to
new (K,V)

6MapReduce Paradigm
Map Shuffle Reduce
(K1, V1)
(K1, V1)
(K1, V1)
(K5, V5)
(K2, V2)
(K3, V3)
(K3, V3)
(K3, V3)
(K6, V6)
(K7, V7)

7Map Reduce with Multiple data sources
HDFS
Cassandra
SQL
HBase
MapReduce job
HDFS
Neo4j
SQL
MongoDB
Input Processing Output

8Jumping to the Hadoop Bandwagon

9Challenges with Map Reduce
Complex jobs which requires multiple mappers and
reducers
Chaining multiple MR jobs and scheduling them together
Wrong level of granularity of MR
Transforming business rules into Map Reduce paradigm
Testing and maintaining the code

10Growing opportunities in Hadoop
With the growing job trends in Hadoop, there is a huge gap
in the skillset required to meet the demand
Huge investment already made by enterprises in existing
business processes and training

13What is Cascading ?
Cascading is a open source Java framework that
provides application development platform for building
data applications on Hadoop.
Developed by Chris Wensel in 2007
Underlying motivation for developing the Cascading Java
framework
Difficulty for Java developers
to write MapReduce Code
MapReduce is based on
functional programming
element

14Enterprise Data Flow - Challenge
Business Goals Data Sources
Using existing Skillset,
business process and tools

15Cascading Building Blocks – Highlevel Overview
Cascading
MapReduce
HDFS
Distributed Storage

16Cascading in Short
Functional programming way to Hadoop
Alternative and Easy API for MapReduce
Reusable Java components
Possibility for Test driven development
Can be used with any JVM- based languages
Java, JRuby, Clojure, etc

17Cascading Building Blocks
Pipes
Sinks
Taps Flow

18Sample Look of Cascading Flow
Source Tap
Sink Tap
Pipe Assembly
Flow

19Cascading Pipe Assemblies
Original
Tuple Streams
Transformed
Tuple Streams
Pipe
Each
GroupBy
CoGroup
Every
SubAssembly

20The quintessential WordCount Example

Initialize properties
and tell Hadoop
which jar file to use

Word-count

26Typical Pipe Assembly
CSV
NoSQL
Sequence File
Flow Definition
Flow A

27Cascading Multiple Flows
Flow A
Flow E
Flow B
Flow C
Flow D
Flow F
Flow G
Flow H

28Cascading Pipe Assemblies
lhs pipe definition
rhs pipe definition
Join lhs & rhs pipes
Join pipe assembly

29Cascading real-world Data Flow Use Cases
Analytics on login information
Analytics from ClickStream Data

30Support With multiple data Sources
HDFS
Cassandra
Mongodb
ElasticSearch
HBase
Memcached
Neo4j
Solr
ElephantDB RDBMS
Splunk
http://www.cascading.org/extensions/

31Support With major Serializers
http://www.cascading.org/extensions/
JSON AVRO
KYRO THRIFT

33
Cascading Pattern is a machine learning project within the Cascading
development framework used to build enterprise data workflows
Pattern uses the industrial standard Predictive Model Markup Language
(PMML), an XML-based file format developed by Data Mining group
PMML is supported by most of the popular analytical tools such as R,
SaS, TeraData, Weka, Knime, Microsoft etc
Cascading Pattern
http://www.dmg.org/

34
Track trips
Maintain Logbook
Get Notified about best gas stations
Manage and compare vehicle cost
Fleet management
Social platform connecting drivers
Cascading Pattern on CarbookPlus
www.carbookplus.com

35CarbookPlus Fuel Cost Predicition
“MDM: Mobilitäts Daten
Marktplatz”, is a German federal
government organization that
provides open data about the
fuel prices across Germany on
real time.
http://www.mdm-portal.de/
Our Objective :
• Store the data from MDM into
HDFS
• Process and clean the data with
Cascading
• Build a model with R, predicting
the fuel price trend for the next 7
days & 24 hours
• Export the model as PMML
• Scale-out on the hadoop cluster,
with Cascading Pattern
• Store the results in Mongodb

36Exporting PMML model from R
Export model as PMML file

37Cascading Pattern Flow Definition

39Algorithms Supported by Cascading Pattern
Random Forest
Linear Regression
Logistical Regression
K-Means Clustering
Hierarchical Clustering
Multinominal Model
https://github.com/cascading/pattern

40
Cascading Pattern to Support more predictive models
Neural Network
Support Vector Machine
More new features in Cascading 3.0
Future of Cascading
YARN
Cluster Resource Management
HDFS
Distributed Storage
Cascading 3.0
Spark
Tez
Execution Engine
Storm

42Questions?
Q & A
Thank you !!
Vinoth Kannan
Credits
www.soundcloud.com
www.concurrentinc.co
m
www.cascading.org
Big Data Engineer
WidasConcepts Gmbh
www.widas.de
@WidasConcepts@vinoth4v
/WidasConcepts
vinoth.kannan@widas.de

Cascading User Group Meet

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Cascading User Group Meet

Semelhante a Cascading User Group Meet (20)

Último

Último (20)

Cascading User Group Meet

Notas do Editor