SlideShare uma empresa Scribd logo
1 de 38
Josh Patterson
 Email:                   Past
                            Published in IAAI-09:
  josh@floe.tv                   “TinyTermite: A Secure Routing Algorithm”

 Twitter:                        Grad work in Meta-heuristics, Ant-algorithms

                            Tennessee Valley Authority (TVA)
  @jpatanooga                    Hadoop and the Smartgrid

 Github:                    Cloudera
                                 Principal Solution Architect

  https://github.com/jp   Today
  atanooga                  Independent Consultant
Sections

1. Modern Data Analytics
2. Parallel Linear Regression
3. Performance and Results
The World as Optimization
 Data tells us about our model/engine/product
   We take this data and evolve our product towards a
   state of minimal market error
 WSJ Special Section, Monday March 11, 2013
   Zynga changing games based off player behavior
   UPS cut fuel consumption by 8.4MM gallons
   Ford used sentiment analysis to look at how new car
   features would be received
The Modern Data Landscape
 Apps are coming but they need
   Platforms
   Components
   Workflows
 Lots of investment in Hadoop in this space
   Lots of ETL pipelines
   Lots of descriptive Statistics
   Growing interest in Machine Learning
Hadoop as The Linux of Data

 Hadoop has won the Cycle      “Hadoop is the
                               kernel of a
  Gartner: Hadoop will be in
                               distributed operating
  2/3s of advanced analytics
  products by 2015 [1]         system, and all the
                               other components
                               around the kernel
                               are now arriving on
                               this stage”
                                  ---Doug Cutting
Today’s Hadoop ML Pipeline
 Data cleansing / ETL performed with Hive or Pig
 Data In Place Processed
    Mahout
    R
    Custom MapReduce Algorithm
  Or Externally Processed
    SAS
    SPSS
    KXEN
    Weka
As Focus Shifts to Applications

 Data rates have been climbing fast

   Speed at Scale becomes the new Killer App
 Companies will want to leverage the Big Data
 infrastructure they’ve already been working with

   Hadoop
   HDFS as main storage system
 A drive to validate big data investments with results

   Emergence of applications which create “data products”
Patterson’s Law

“As the percent of your total data held
in a storage system approaches 100%
the amount of in-system processing
and analytics also approaches 100%”
Tools Will Move onto Hadoop

 Already seeing this with Vendors
  Who hasn’t announced a SQL engine on Hadoop
  lately?
 Trend will continue with machine learning tools
  Mahout was the beginning
  More are following
  But what about parallel iterative algorithms?
Distributed Systems Are Hard
 Lots of moving parts
   Especially as these applications become more complicated
 Machine learning can be a non-trivial operation
   We need great building blocks that work well together
 I agree with Jimmy Lin [3]: “keep it simple”
   “make sure costs don’t outweigh benefits”
 Minimize “Yet Another Tool To Learn” (YATTL) as much as
 we can!
To Summarize
 Data moving into Hadoop everywhere
   Patterson’s Law
   Focus on hadoop, build around next-gen “linux of data”
 Need simple components to build next-gen data base apps
   They should work cleanly with the cluster that the fortune
   500 has: Hadoop
   Also should be easy to integrate into Hadoop and with the
   hadoop-tool ecosystem
   Minimize YATTL
Linear Regression
 In linear regression, data is
 modeled using linear predictor
 functions

   unknown model parameters are
   estimated from the data.
 We use optimization techniques
 like Stochastic Gradient Descent to
 find the coeffcients in the model


  Y = (1*x0) + (c1*x1) + … + (cN*xN)
16




     Machine Learning and Optimization

      Algorithms

      (Convergent) Iterative Methods

        Newton’s Method
        Quasi-Newton
        Gradient Descent
      Heuristics

        AntNet
        PSO
        Genetic Algorithms
17




        Stochastic Gradient Descent

         Hypothesis about data

         Cost function

         Update function




     Andrew Ng’s Tutorial:
     https://class.coursera.org/ml/lecture/preview_view
     /11
18




     Stochastic Gradient Descent
                                           Training Data
     Training
       Simple gradient descent procedure
       Loss functions needs to be convex
       (with exceptions)
     Linear Regression
                                             SGD
       Loss Function: squared error of
       prediction
       Prediction: linear combination of
       coefficients and input variables
                                             Model
19




     Mahout’s SGD
      Currently Single Process
       Multi-threaded parallel, but not cluster parallel
       Runs locally, not deployed to the cluster
       Tied to logistic regression implementation
20




     Current Limitations
     Sequential algorithms on a single node only goes so
     far
     The “Data Deluge”
      Presents algorithmic challenges when combined with
      large data sets
      need to design algorithms that are able to perform in a
      distributed fashion
     MapReduce only fits certain types of algorithms
21




     Distributed Learning Strategies

      McDonald, 2010
        Distributed Training Strategies for the Structured
        Perceptron
      Langford, 2007
        Vowpal Wabbit
      Jeff Dean’s Work on Parallel SGD
        DownPour SGD
        Sandblaster
22




     MapReduce               vs. Parallel Iterative

           Input
                                   Processor    Processor    Processor


     Map      Map      Map
                                               Superstep 1


                                   Processor    Processor    Processor


     Reduce         Reduce
                                               Superstep 2


           Output                                  . . .
23




     YARN
     Yet Another Resource Negotiator
                                                                                Node
                                                                               Manager




     Framework for scheduling
                                                                        Container   App Mstr



     distributed applications            Client

                                                             Resource           Node
                                                             Manager           Manager

       Allows for any type of parallel   Client

       application to run natively on                                   App Mstr    Container


       hadoop
       MRv2 is now a distributed          MapReduce Status                      Node
                                                                               Manager

       application
                                            Job Submission
                                            Node Status
                                          Resource Request              Container   Container
24




     IterativeReduce
      Designed specifically for parallel iterative
      algorithms on Hadoop
        Implemented directly on top of YARN
      Intrinsic Parallelism
        Easier to focus on problem
        Not focusing on the distributed application part
25




     IterativeReduce API
      ComputableMaster   Worker   Worker   Worker

       Setup()
                                  Master
       Compute()
       Complete()        Worker   Worker   Worker
      ComputableWorker
                                  Master
       Setup()
       Compute()                   . . .
26




     SGD Master
      Collects all parameter vectors at each pass /
      superstep
      Produces new global parameter vector
       By averaging workers’ vectors
      Sends update to all workers
       Workers replace local parameter vector with new
       global parameter vector
27




     SGD Worker
     Each given a split of the total dataset
       Similar to a map task
     Performs local SGD pass

     Local parameter vector sent to master at
     superstep

     Stays active/resident between iterations
28




     SGD: Serial vs Parallel
                          Split 1       Split 2            Split 3


       Training Data

                                                        Worker N
                       Worker 1     Worker 2
                                                    …

                       Partial      Partial Model        Partial
                       Model                             Model



                                     Master



         Model                      Global Model
Parallel Linear Regression with IterativeReduce


  Based directly on work we did with Knitting Boar
    Parallel logistic regression
  Scales linearly with input size
  Can produce a linear regression model off large amounts
  of data
  Packaged in a new suite of parallel iterative algorithms
  called Metronome
    100% Java, ASF 2.0 Licensed, on github
Unit Testing and IRUnit
 Simulates the IterativeReduce parallel framework
   Uses the same app.properties file that YARN applications do
 Examples
   https://github.com/jpatanooga/Metronome/blob/master/src/test/jav
   a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat
   eLinearRegressionIterativeReduce.java
   https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j
   ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB
   oar_IRUnitSim.java
Running the Job via YARN
 Build with Maven

 Copy Jar to host with cluster access

 Copy dataset to HDFS

 Run job
  Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
Results
                               Linear Regression - Parallel vs Serial
                         200
 Total Processing Time




                         150

                         100
                                                                      Parallel Runs
                          50                                          Serial Runs
                           0
                               64      128    192     256       320
                                    Megabytes Processed Total
Lessons Learned
 Linear scale continues to be achieved with
 parameter averaging variations
 Tuning is critical
   Need to be good at selecting a learning rate
 YARN still experimental, has caveats
   Container allocation is still slow
   Metronome continues to be experimental
Special Thanks
 Michael Katzenellenbollen

 Dr. James Scott
  University of Texas at Austin
 Dr. Jason Baldridge
  University of Texas at Austin
Future Directions
 More testing, stability
 Cache vectors in memory for speed
 Metronome
   Take on properties of LibLinear
     Plugable optimization, general linear models
   YARN-centric first class Hadoop citizen
   Focus on being a complement to Mahout
   K-means, PageRank implementations
Github
 IterativeReduce
  https://github.com/emsixteeen/IterativeReduce
 Metronome
  https://github.com/jpatanooga/Metronome
 Knitting Boar
  https://github.com/jpatanooga/KnittingBoar
References
1. http://www.infoworld.com/d/business-
   intelligence/gartner-hadoop-will-be-in-two-thirds-of-
   advanced-analytics-products-2015-211475

2. https://cwiki.apache.org/MAHOUT/logistic-
   regression.html

3. MapReduce is Good Enough? If All You Have is a
   Hammer, Throw Away Everything That’s Not a Nail!
  •   http://arxiv.org/pdf/1209.2191.pdf

Mais conteúdo relacionado

Mais procurados

Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Databricks
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
Databricks
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
Databricks
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Databricks
 

Mais procurados (20)

Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
 
CI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranCI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel Kobran
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
Whats new in_mlflow
Whats new in_mlflowWhats new in_mlflow
Whats new in_mlflow
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis Library
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
 

Destaque

Financialmodeling
FinancialmodelingFinancialmodeling
Financialmodeling
Talal Tahir
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
mohamedchaouche
 
Chapt 11 & 12 linear & multiple regression minitab
Chapt 11 & 12 linear &  multiple regression minitabChapt 11 & 12 linear &  multiple regression minitab
Chapt 11 & 12 linear & multiple regression minitab
Boyu Deng
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
Sharlaine Ruth
 

Destaque (20)

The power of RapidMiner, showing the direct marketing demo
The power of RapidMiner, showing the direct marketing demoThe power of RapidMiner, showing the direct marketing demo
The power of RapidMiner, showing the direct marketing demo
 
Statisticsfor businessproject solution
Statisticsfor businessproject solutionStatisticsfor businessproject solution
Statisticsfor businessproject solution
 
Midterm
MidtermMidterm
Midterm
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
Financialmodeling
FinancialmodelingFinancialmodeling
Financialmodeling
 
Qam formulas
Qam formulasQam formulas
Qam formulas
 
Regression
Regression Regression
Regression
 
ForecastIT 2. Linear Regression & Model Statistics
ForecastIT 2. Linear Regression & Model StatisticsForecastIT 2. Linear Regression & Model Statistics
ForecastIT 2. Linear Regression & Model Statistics
 
Regression: A skin-deep dive
Regression: A skin-deep diveRegression: A skin-deep dive
Regression: A skin-deep dive
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 
Chapt 11 & 12 linear & multiple regression minitab
Chapt 11 & 12 linear &  multiple regression minitabChapt 11 & 12 linear &  multiple regression minitab
Chapt 11 & 12 linear & multiple regression minitab
 
Simple linear regression project
Simple linear regression projectSimple linear regression project
Simple linear regression project
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
 
Statr session 23 and 24
Statr session 23 and 24Statr session 23 and 24
Statr session 23 and 24
 
Ch14
Ch14Ch14
Ch14
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Logistic regression for ordered dependant variable with more than 2 levels
Logistic regression for ordered dependant variable with more than 2 levelsLogistic regression for ordered dependant variable with more than 2 levels
Logistic regression for ordered dependant variable with more than 2 levels
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Chapter13
Chapter13Chapter13
Chapter13
 

Semelhante a Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
Josh Patterson
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
Adam Muise
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Yahoo Developer Network
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 

Semelhante a Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN (20)

Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Strata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting BoarStrata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting Boar
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
E031201032036
E031201032036E031201032036
E031201032036
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 

Mais de Josh Patterson

Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 

Mais de Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 

Último

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

  • 1.
  • 2. Josh Patterson Email: Past Published in IAAI-09: josh@floe.tv “TinyTermite: A Secure Routing Algorithm” Twitter: Grad work in Meta-heuristics, Ant-algorithms Tennessee Valley Authority (TVA) @jpatanooga Hadoop and the Smartgrid Github: Cloudera Principal Solution Architect https://github.com/jp Today atanooga Independent Consultant
  • 3. Sections 1. Modern Data Analytics 2. Parallel Linear Regression 3. Performance and Results
  • 4.
  • 5. The World as Optimization Data tells us about our model/engine/product We take this data and evolve our product towards a state of minimal market error WSJ Special Section, Monday March 11, 2013 Zynga changing games based off player behavior UPS cut fuel consumption by 8.4MM gallons Ford used sentiment analysis to look at how new car features would be received
  • 6. The Modern Data Landscape Apps are coming but they need Platforms Components Workflows Lots of investment in Hadoop in this space Lots of ETL pipelines Lots of descriptive Statistics Growing interest in Machine Learning
  • 7. Hadoop as The Linux of Data Hadoop has won the Cycle “Hadoop is the kernel of a Gartner: Hadoop will be in distributed operating 2/3s of advanced analytics products by 2015 [1] system, and all the other components around the kernel are now arriving on this stage” ---Doug Cutting
  • 8. Today’s Hadoop ML Pipeline Data cleansing / ETL performed with Hive or Pig Data In Place Processed Mahout R Custom MapReduce Algorithm Or Externally Processed SAS SPSS KXEN Weka
  • 9. As Focus Shifts to Applications Data rates have been climbing fast Speed at Scale becomes the new Killer App Companies will want to leverage the Big Data infrastructure they’ve already been working with Hadoop HDFS as main storage system A drive to validate big data investments with results Emergence of applications which create “data products”
  • 10. Patterson’s Law “As the percent of your total data held in a storage system approaches 100% the amount of in-system processing and analytics also approaches 100%”
  • 11. Tools Will Move onto Hadoop Already seeing this with Vendors Who hasn’t announced a SQL engine on Hadoop lately? Trend will continue with machine learning tools Mahout was the beginning More are following But what about parallel iterative algorithms?
  • 12. Distributed Systems Are Hard Lots of moving parts Especially as these applications become more complicated Machine learning can be a non-trivial operation We need great building blocks that work well together I agree with Jimmy Lin [3]: “keep it simple” “make sure costs don’t outweigh benefits” Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!
  • 13. To Summarize Data moving into Hadoop everywhere Patterson’s Law Focus on hadoop, build around next-gen “linux of data” Need simple components to build next-gen data base apps They should work cleanly with the cluster that the fortune 500 has: Hadoop Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem Minimize YATTL
  • 14.
  • 15. Linear Regression In linear regression, data is modeled using linear predictor functions unknown model parameters are estimated from the data. We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model Y = (1*x0) + (c1*x1) + … + (cN*xN)
  • 16. 16 Machine Learning and Optimization Algorithms (Convergent) Iterative Methods Newton’s Method Quasi-Newton Gradient Descent Heuristics AntNet PSO Genetic Algorithms
  • 17. 17 Stochastic Gradient Descent Hypothesis about data Cost function Update function Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view /11
  • 18. 18 Stochastic Gradient Descent Training Data Training Simple gradient descent procedure Loss functions needs to be convex (with exceptions) Linear Regression SGD Loss Function: squared error of prediction Prediction: linear combination of coefficients and input variables Model
  • 19. 19 Mahout’s SGD Currently Single Process Multi-threaded parallel, but not cluster parallel Runs locally, not deployed to the cluster Tied to logistic regression implementation
  • 20. 20 Current Limitations Sequential algorithms on a single node only goes so far The “Data Deluge” Presents algorithmic challenges when combined with large data sets need to design algorithms that are able to perform in a distributed fashion MapReduce only fits certain types of algorithms
  • 21. 21 Distributed Learning Strategies McDonald, 2010 Distributed Training Strategies for the Structured Perceptron Langford, 2007 Vowpal Wabbit Jeff Dean’s Work on Parallel SGD DownPour SGD Sandblaster
  • 22. 22 MapReduce vs. Parallel Iterative Input Processor Processor Processor Map Map Map Superstep 1 Processor Processor Processor Reduce Reduce Superstep 2 Output . . .
  • 23. 23 YARN Yet Another Resource Negotiator Node Manager Framework for scheduling Container App Mstr distributed applications Client Resource Node Manager Manager Allows for any type of parallel Client application to run natively on App Mstr Container hadoop MRv2 is now a distributed MapReduce Status Node Manager application Job Submission Node Status Resource Request Container Container
  • 24. 24 IterativeReduce Designed specifically for parallel iterative algorithms on Hadoop Implemented directly on top of YARN Intrinsic Parallelism Easier to focus on problem Not focusing on the distributed application part
  • 25. 25 IterativeReduce API ComputableMaster Worker Worker Worker Setup() Master Compute() Complete() Worker Worker Worker ComputableWorker Master Setup() Compute() . . .
  • 26. 26 SGD Master Collects all parameter vectors at each pass / superstep Produces new global parameter vector By averaging workers’ vectors Sends update to all workers Workers replace local parameter vector with new global parameter vector
  • 27. 27 SGD Worker Each given a split of the total dataset Similar to a map task Performs local SGD pass Local parameter vector sent to master at superstep Stays active/resident between iterations
  • 28. 28 SGD: Serial vs Parallel Split 1 Split 2 Split 3 Training Data Worker N Worker 1 Worker 2 … Partial Partial Model Partial Model Model Master Model Global Model
  • 29. Parallel Linear Regression with IterativeReduce Based directly on work we did with Knitting Boar Parallel logistic regression Scales linearly with input size Can produce a linear regression model off large amounts of data Packaged in a new suite of parallel iterative algorithms called Metronome 100% Java, ASF 2.0 Licensed, on github
  • 30. Unit Testing and IRUnit Simulates the IterativeReduce parallel framework Uses the same app.properties file that YARN applications do Examples https://github.com/jpatanooga/Metronome/blob/master/src/test/jav a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat eLinearRegressionIterativeReduce.java https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB oar_IRUnitSim.java
  • 31.
  • 32. Running the Job via YARN Build with Maven Copy Jar to host with cluster access Copy dataset to HDFS Run job Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
  • 33. Results Linear Regression - Parallel vs Serial 200 Total Processing Time 150 100 Parallel Runs 50 Serial Runs 0 64 128 192 256 320 Megabytes Processed Total
  • 34. Lessons Learned Linear scale continues to be achieved with parameter averaging variations Tuning is critical Need to be good at selecting a learning rate YARN still experimental, has caveats Container allocation is still slow Metronome continues to be experimental
  • 35. Special Thanks Michael Katzenellenbollen Dr. James Scott University of Texas at Austin Dr. Jason Baldridge University of Texas at Austin
  • 36. Future Directions More testing, stability Cache vectors in memory for speed Metronome Take on properties of LibLinear Plugable optimization, general linear models YARN-centric first class Hadoop citizen Focus on being a complement to Mahout K-means, PageRank implementations
  • 37. Github IterativeReduce https://github.com/emsixteeen/IterativeReduce Metronome https://github.com/jpatanooga/Metronome Knitting Boar https://github.com/jpatanooga/KnittingBoar
  • 38. References 1. http://www.infoworld.com/d/business- intelligence/gartner-hadoop-will-be-in-two-thirds-of- advanced-analytics-products-2015-211475 2. https://cwiki.apache.org/MAHOUT/logistic- regression.html 3. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! • http://arxiv.org/pdf/1209.2191.pdf

Notas do Editor

  1. Reference some thoughts on attribution pipelines
  2. Talk about how you normally would use the Normal equation, notes from Andrew Ng
  3. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  4. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  5. The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  6. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  7. Bottou similar to Xu2010 in the 2010 paper
  8. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  9. Performance still largely dependent on implementation of algo
  10. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point