Knitting boar - Toronto and Boston HUGs - Nov 2012

KNITTING BOAR
Machine Learning, Mahout, and Parallel Iterative Algorithms

Josh Patterson
Principal Solutions Architect

1

✛ Josh Patterson
> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga

> Email: josh@cloudera.com

✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts

Introduction to
MACHINE LEARNING

4

✛ What is Data Mining?
> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
> Raw data essentially useless
∗ Data is simply recorded facts
∗ Information is the patterns underlying the data

✛ Machine Learning
> Algorithms for acquiring structural descriptions from
data “examples”
∗ Process of learning “concepts”

✛ Information Retrieval
> information science, information
architecture, cognitive psychology, linguistics, and
statistics.
✛ Natural Language Processing
> grounded in machine learning, especially statistical
machine learning
✛ Statistics
> Math and stuff
✛ Machine Learning
> Considered a branch of artificial intelligence

✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization

“Descriptive Statistics”

✛ Don’t always assume you need “scale” and
parallelization
> Try it out on a single machine first
> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy
machine?
✛ We can always use the constructed model
back in MapReduce to score a ton of new
data

✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG
MOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for
predictive analytics

✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive
to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?
> What star rating might this user give this movie?

✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
or Pig
✛ ML work performed with
> SAS
> SPSS
> R
> Mahout

Introduction to
11
MAHOUT

✛ Classification
> “Fraud detection”
✛ Recommendation
> “Collaborative
Filtering”
✛ Clustering
> “Segmentation”
✛ Frequent Itemset
Mining

12 Copyright 2010 Cloudera Inc. All rights reserved

✛ Stochastic Gradient Descent
> Single process
> Logistic Regression Model Construction
✛ Naïve Bayes
> MapReduce-based
> Text Classification
✛ Random Forests
> MapReduce-based

13 Copyright 2010 Cloudera Inc. All rights reserved

✛ An algorithm that looks at a user’s past actions
and suggests
> Products
> Services
> People
✛ Advertisement
> Cloudera has a great Data Science training course on
this topic
> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-
_building_recommender_systems.html

✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation

✛ Why Machine Learning?
> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression

✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”

Introducing
KNITTING BOAR

17

✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later

✛ Training Training Data

> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD

> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter

19

Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms

20

Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-
Batches

21

Input Processor Processor Processor

Superstep 1
Map Map Map

Processor Processor Processor

Reduce Reduce Superstep 2

. . .
Output

22

“Are the gains gotten from using X worth the integration
costs incurred in building the end-to-end solution?

If no, then operationally, we can consider the Hadoop
stack …

there are substantial costs in knitting together a
patchwork of different frameworks, programming
models, etc.”

–– Lin, 2012

23

✛ Parallel Iterative implementation of SGD on
YARN

✛ Workers work on partitions of the data
✛ Master keeps global copy of merged parameter
vector

24

✛ Each given a split of the total dataset
> Similar to a map task
✛ Using a modified OLR
> process N samples in a epoch (subset of split)
✛ Local parameter vector sent to master node
> Master averages all workers’ vectors together

25

✛ Gathers and averages worker parameter vectors
> From worker OLR runs
✛ Produces new global parameter vector
> By averaging workers’ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector

26

✛ ComputableMaster
Worker Worker Worker
> Setup()
> Compute() Master
> Complete()
✛ ComputableWorker Worker Worker Worker

> Setup()
Master
> Compute()
. . .

27

OnlineLogisticRegression
Knitting Boar’s POLR
Split 1 Split 2 Split 3
Training Data

Worker 1 Worker 2
… Worker N

Partial Model Partial Model Partial Model
OnlineLogisticRegression

Master

Model
Global Model

28

300

250

200

150 OLR
POLR
100

50

0
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41

Input Size vs Processing Time

29

Knitting Boar
PARTING THOUGHTS

30

✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)

✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated

31

✛ Knitting Boar
> https://github.com/jpatanooga/KnittingBoar
> 100% Java
> ASF 2.0 Licensed
> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce
> https://github.com/emsixteeen/IterativeReduce
> 100% Java
> ASF 2.0 Licensed

32

✛ Machine Learning is hard
> Don’t believe the hype
> Do the work
✛ Model development takes
time
> Lots of iterations
> Speed is key here

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

33

✛ Strata / Hadoop World 2012 Slides
> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-
knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation
> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf

34

✛ Langford
> http://hunch.net/~vw/
✛ McDonald, 2010
> http://dl.acm.org/citation.cfm?id=1858068

35

✛ http://eteamjournal.files.wordpress.com/2011/03/
photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-
medium-large/-say-hello-to-my-little-friend--luis-
ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-
2001_space_odyssey_-_5.jpg

36

Knitting boar - Toronto and Boston HUGs - Nov 2012

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a Knitting boar - Toronto and Boston HUGs - Nov 2012

Semelhante a Knitting boar - Toronto and Boston HUGs - Nov 2012 (20)

Mais de Josh Patterson

Mais de Josh Patterson (20)

Último

Último (20)

Knitting boar - Toronto and Boston HUGs - Nov 2012

Notas do Editor