2. 2What is Hadoop?
“Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of commodity
hardware.“ (Wikipedia)
Designed for
Possible to:
Works on:
• Batch
Processing
• Horizontal Scaling
• Bringing
Computation to
Data
Principles of Hadoop:
3. 3Main Features
Reliable and Redundant
• No performance or data loss even on failure
Powerful
• Possible to have huge clusters (largest 40,000 nodes)
• Supports “Best of Breed Analytics“
Scalable
• Linearly scalable with increase in data volume
Cost Efficient
• No need for expensive hardware. Supports commodity hardware
Simple and flexible APIs
• Great ecosystem with multitude of solutions to support
4. 4Traditional vs. Hadoop
Traditional Hadoop
More and larger server necessary to
accomplish tasks:
• computing capacity
• data capacity
Instead of upgrading the server, the cluster
size is increased with more machines
5. 5
MapReduce are programming model to run applications
mostly on Hadoop
What is MapReduce?
Mapper
• Converts
input
(K,V) to
new (K,V)
Shuffle
• Sorts and
Groups
similar
keys with
all its
values
Reducer
• Translates
the Value
each
unique
Key to
new (K,V)
9. 9Challenges with Map Reduce
Complex jobs which requires multiple mappers and
reducers
Chaining multiple MR jobs and scheduling them together
Wrong level of granularity of MR
Transforming business rules into Map Reduce paradigm
Testing and maintaining the code
10. 10Growing opportunities in Hadoop
With the growing job trends in Hadoop, there is a huge gap
in the skillset required to meet the demand
Huge investment already made by enterprises in existing
business processes and training
13. 13What is Cascading ?
Cascading is a open source Java framework that
provides application development platform for building
data applications on Hadoop.
Developed by Chris Wensel in 2007
Underlying motivation for developing the Cascading Java
framework
Difficulty for Java developers
to write MapReduce Code
MapReduce is based on
functional programming
element
14. 14Enterprise Data Flow - Challenge
Business Goals Data Sources
Using existing Skillset,
business process and tools
16. 16Cascading in Short
Functional programming way to Hadoop
Alternative and Easy API for MapReduce
Reusable Java components
Possibility for Test driven development
Can be used with any JVM- based languages
Java, JRuby, Clojure, etc
33. 33
Cascading Pattern is a machine learning project within the Cascading
development framework used to build enterprise data workflows
Pattern uses the industrial standard Predictive Model Markup Language
(PMML), an XML-based file format developed by Data Mining group
PMML is supported by most of the popular analytical tools such as R,
SaS, TeraData, Weka, Knime, Microsoft etc
Cascading Pattern
http://www.dmg.org/
34. 34
Track trips
Maintain Logbook
Get Notified about best gas stations
Manage and compare vehicle cost
Fleet management
Social platform connecting drivers
Cascading Pattern on CarbookPlus
www.carbookplus.com
35. 35CarbookPlus Fuel Cost Predicition
“MDM: Mobilitäts Daten
Marktplatz”, is a German federal
government organization that
provides open data about the
fuel prices across Germany on
real time.
http://www.mdm-portal.de/
Our Objective :
• Store the data from MDM into
HDFS
• Process and clean the data with
Cascading
• Build a model with R, predicting
the fuel price trend for the next 7
days & 24 hours
• Export the model as PMML
• Scale-out on the hadoop cluster,
with Cascading Pattern
• Store the results in Mongodb
39. 39Algorithms Supported by Cascading Pattern
Random Forest
Linear Regression
Logistical Regression
K-Means Clustering
Hierarchical Clustering
Multinominal Model
https://github.com/cascading/pattern
40. 40
Cascading Pattern to Support more predictive models
Neural Network
Support Vector Machine
More new features in Cascading 3.0
Future of Cascading
YARN
Cluster Resource Management
HDFS
Distributed Storage
Cascading 3.0
Spark
Tez
Execution Engine
Storm
42. 42Questions?
Q & A
Thank you !!
Vinoth Kannan
Credits
www.soundcloud.com
www.concurrentinc.co
m
www.cascading.org
Big Data Engineer
WidasConcepts Gmbh
www.widas.de
@WidasConcepts@vinoth4v
/WidasConcepts
vinoth.kannan@widas.de
Notas do Editor
Pipe -
Each – Defines Filter or Function each tuple has to pass through
GroupBy – groups the filed on selected tuple stream by field name. Allows merging
CoGroup – joins on common set of values. Joins can be Inner, outer, Left or Right
Every – applies aggregtor to every group of tuples
Subassembly - nesting reusable pipe assemblies into a Pipe class for inclusion in a larger pipe assembly.
A Scheme defines what is stored in a Tap instance by declaring the Tuple field names, and alternately parsing or rendering the incoming or outgoing Tuple stream, respectively.
A Scheme defines the type of resource data will be sourced from or sinked to.