+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
1.
2. Josh Patterson
Email: Past
Published in IAAI-09:
josh@floe.tv “TinyTermite: A Secure Routing Algorithm”
Twitter: Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
@jpatanooga Hadoop and the Smartgrid
Github: Cloudera
Principal Solution Architect
https://github.com/jp Today
atanooga Independent Consultant
5. The World as Optimization
Data tells us about our model/engine/product
We take this data and evolve our product towards a
state of minimal market error
WSJ Special Section, Monday March 11, 2013
Zynga changing games based off player behavior
UPS cut fuel consumption by 8.4MM gallons
Ford used sentiment analysis to look at how new car
features would be received
6. The Modern Data Landscape
Apps are coming but they need
Platforms
Components
Workflows
Lots of investment in Hadoop in this space
Lots of ETL pipelines
Lots of descriptive Statistics
Growing interest in Machine Learning
7. Hadoop as The Linux of Data
Hadoop has won the Cycle “Hadoop is the
kernel of a
Gartner: Hadoop will be in
distributed operating
2/3s of advanced analytics
products by 2015 [1] system, and all the
other components
around the kernel
are now arriving on
this stage”
---Doug Cutting
8. Today’s Hadoop ML Pipeline
Data cleansing / ETL performed with Hive or Pig
Data In Place Processed
Mahout
R
Custom MapReduce Algorithm
Or Externally Processed
SAS
SPSS
KXEN
Weka
9. As Focus Shifts to Applications
Data rates have been climbing fast
Speed at Scale becomes the new Killer App
Companies will want to leverage the Big Data
infrastructure they’ve already been working with
Hadoop
HDFS as main storage system
A drive to validate big data investments with results
Emergence of applications which create “data products”
10. Patterson’s Law
“As the percent of your total data held
in a storage system approaches 100%
the amount of in-system processing
and analytics also approaches 100%”
11. Tools Will Move onto Hadoop
Already seeing this with Vendors
Who hasn’t announced a SQL engine on Hadoop
lately?
Trend will continue with machine learning tools
Mahout was the beginning
More are following
But what about parallel iterative algorithms?
12. Distributed Systems Are Hard
Lots of moving parts
Especially as these applications become more complicated
Machine learning can be a non-trivial operation
We need great building blocks that work well together
I agree with Jimmy Lin [3]: “keep it simple”
“make sure costs don’t outweigh benefits”
Minimize “Yet Another Tool To Learn” (YATTL) as much as
we can!
13. To Summarize
Data moving into Hadoop everywhere
Patterson’s Law
Focus on hadoop, build around next-gen “linux of data”
Need simple components to build next-gen data base apps
They should work cleanly with the cluster that the fortune
500 has: Hadoop
Also should be easy to integrate into Hadoop and with the
hadoop-tool ecosystem
Minimize YATTL
14.
15. Linear Regression
In linear regression, data is
modeled using linear predictor
functions
unknown model parameters are
estimated from the data.
We use optimization techniques
like Stochastic Gradient Descent to
find the coeffcients in the model
Y = (1*x0) + (c1*x1) + … + (cN*xN)
17. 17
Stochastic Gradient Descent
Hypothesis about data
Cost function
Update function
Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view
/11
18. 18
Stochastic Gradient Descent
Training Data
Training
Simple gradient descent procedure
Loss functions needs to be convex
(with exceptions)
Linear Regression
SGD
Loss Function: squared error of
prediction
Prediction: linear combination of
coefficients and input variables
Model
19. 19
Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation
20. 20
Current Limitations
Sequential algorithms on a single node only goes so
far
The “Data Deluge”
Presents algorithmic challenges when combined with
large data sets
need to design algorithms that are able to perform in a
distributed fashion
MapReduce only fits certain types of algorithms
21. 21
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured
Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Dean’s Work on Parallel SGD
DownPour SGD
Sandblaster
23. 23
YARN
Yet Another Resource Negotiator
Node
Manager
Framework for scheduling
Container App Mstr
distributed applications Client
Resource Node
Manager Manager
Allows for any type of parallel Client
application to run natively on App Mstr Container
hadoop
MRv2 is now a distributed MapReduce Status Node
Manager
application
Job Submission
Node Status
Resource Request Container Container
24. 24
IterativeReduce
Designed specifically for parallel iterative
algorithms on Hadoop
Implemented directly on top of YARN
Intrinsic Parallelism
Easier to focus on problem
Not focusing on the distributed application part
26. 26
SGD Master
Collects all parameter vectors at each pass /
superstep
Produces new global parameter vector
By averaging workers’ vectors
Sends update to all workers
Workers replace local parameter vector with new
global parameter vector
27. 27
SGD Worker
Each given a split of the total dataset
Similar to a map task
Performs local SGD pass
Local parameter vector sent to master at
superstep
Stays active/resident between iterations
28. 28
SGD: Serial vs Parallel
Split 1 Split 2 Split 3
Training Data
Worker N
Worker 1 Worker 2
…
Partial Partial Model Partial
Model Model
Master
Model Global Model
29. Parallel Linear Regression with IterativeReduce
Based directly on work we did with Knitting Boar
Parallel logistic regression
Scales linearly with input size
Can produce a linear regression model off large amounts
of data
Packaged in a new suite of parallel iterative algorithms
called Metronome
100% Java, ASF 2.0 Licensed, on github
30. Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/jav
a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat
eLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j
ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB
oar_IRUnitSim.java
31.
32. Running the Job via YARN
Build with Maven
Copy Jar to host with cluster access
Copy dataset to HDFS
Run job
Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
33. Results
Linear Regression - Parallel vs Serial
200
Total Processing Time
150
100
Parallel Runs
50 Serial Runs
0
64 128 192 256 320
Megabytes Processed Total
34. Lessons Learned
Linear scale continues to be achieved with
parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
YARN still experimental, has caveats
Container allocation is still slow
Metronome continues to be experimental
35. Special Thanks
Michael Katzenellenbollen
Dr. James Scott
University of Texas at Austin
Dr. Jason Baldridge
University of Texas at Austin
36. Future Directions
More testing, stability
Cache vectors in memory for speed
Metronome
Take on properties of LibLinear
Plugable optimization, general linear models
YARN-centric first class Hadoop citizen
Focus on being a complement to Mahout
K-means, PageRank implementations
38. References
1. http://www.infoworld.com/d/business-
intelligence/gartner-hadoop-will-be-in-two-thirds-of-
advanced-analytics-products-2015-211475
2. https://cwiki.apache.org/MAHOUT/logistic-
regression.html
3. MapReduce is Good Enough? If All You Have is a
Hammer, Throw Away Everything That’s Not a Nail!
• http://arxiv.org/pdf/1209.2191.pdf
Notas do Editor
Reference some thoughts on attribution pipelines
Talk about how you normally would use the Normal equation, notes from Andrew Ng
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
Bottou similar to Xu2010 in the 2010 paper
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
Performance still largely dependent on implementation of algo
POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point