Machine learning pipeline with spark ml

Machine Learning Pipeline
with Spark ML
End to End Machine learning
https://github.com/RamkSwamy/sparkmlpipeline

● Ram Kuppuswamy
● Worked in Microsoft for 13yrs
● Co-Founder, Zinnia Systems
Pvt. Ltd.,
● Big data consultant and
trainer at datamantra.io

Agenda
● Machine Learning Pipeline
● Spark ML API
● Components of ML API
● Building a pipeline
● Persisting a model
● Evaluating a pipeline Model
● Cross validating pipeline Model

Machine learning
● Most of the developers think machine learning is mostly
learning algorithm
● Most of the big data libraries like MLLib, Mahout are
focused on implementing algorithm in distributed
manner
● But when you try to productionize an end to end solution
you will quickly realize that, machine learning is not just
about learning algorithm
● There are many other important steps to build an end to
end machine learning application

Stages of Machine Learning application
● Data Exploration
○ Read Data
○ Missing data
○ Look at correlation
○ Statistics of independent variables
● Data Preparation (Preprocess data)
■ Indexing the labels
■ Handling categorical variables
■ Numeric values in text data (wordtovec)

Stages of ML Application continued...
● Model training
● Model evaluation
● Model Tuning
● Repeat this process many times

Spark MLLib
● Only focused on model learning
● No standard way to do other steps of ML pipeline
● No way to combine all these steps and execute them
● Based on RDD API
● Though some of these steps are added later, they were
not uniform across the algorithms

Spark ML
● Provides higher-level API for construction and tuning of
ML workflows
● Built on top of DataFrames
● We are using Spark 2.0 which will have ML as the
library for Machine Learning going forward and MLLib
will be deprecated

Case Study
● We use the following dataset from the following
Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/Census+Income
● They have given the training and test data in the
following 2 files separately:
○ Adult.data
○ Adult.test
● Objective: To predict if the income of an individual is
>50K or <=50K? by constructing a pipeline.

Abstractions of Spark ML
● Transformer
● Estimator
● Evaluator
● Pipeline
● Params

Data Exploration
● Read the data and create a DataFrame
○ Util:loadSalaryCsvTrain
○ Util:loadSalaryCsvTest
● Looking at schema
○ SalaryDataSchema
● Look at Statistics of variables
○ SalaryDataExplore

Data Preparation
● Clean the data
○ cleanDataFrame Util
● Label Indexing
● Categorical handling
○ String Indexing
○ oneHot Encoding

Estimator
● An Estimator abstraction uses an algorithm which is
fitted on a DataFrame returning a model.
● It implements a method fit():
DF Estimator Model

Label Indexing
● We want to create the label which is the dependent
variable
● We have 2 different values a) >50K b) <=50K
● One will take the value of ‘0’ and the other ‘1’
● We use the StringIndexer API to achieve this
● It encodes a string column of labels to a column of label
indices.
● The indices are in [0, numLabels), ordered by label
frequencies, so the most frequent label gets index 0

String Indexer
● We use the StringIndexer API to achieve this
● It encodes a string column of labels to a column of label
indices.
● SalaryLabelIndexing

Categorical handling
● We have many categorical fields such as occupation,
sex, workclass, relationship and marital_status etc.,
● They are all String types and we use the StringIndexer
to generate the indices and then use OneHotEncoder,
which maps a column of label indices to a column of
binary vectors, with at most a single one-value.
● This encoding allows algorithms which expect
continuous features, such as Logistic Regression, to
use categorical features.

StringIndxer
● It is an Estimator which uses StringIndexerModel to fit
the data

Transformer
● A Transformer is an abstraction which has an algorithm which transforms
one DataFrame to another.
● It implements a method transform()
DF DFTransformer

OneHotEncoder
● It is a Transformer which takes the data and converts
into a vector
● CategoricalForWorkClass

Vector assembler
● A feature transformer that merges multiple columns into
a vector column.
● The LogisticRegression model expects a column named
“features” as vector by default or you can set it.
● SalaryVectorAssembler

Pipeline
● Chain of Transformers and Estimators
● Pipeline itself is an Estimator
● It is fitted on a DataFrame turning it into a model
● Once you have defined a pipeline, you can have the
training and test datasets go thru the same stages of
processing
○ buildOneHotPipeLine
○ buildPipeLineForFeaturePreparation
○ buildDataPrepPipeLine

Logistic regression
● We want to train the LogisticRegression model with
training data
● We create a pipeline, which will take the training data
and train the model with it and provides that model for
us to use it to predict with the test data
● Example code : LRTraining module

Model training with training data
● The model expects the feature vectors in a column
named “features” and the labels (dependent varible that
we are trying to predict) in a column named “label” by
default.

Evaluator
● Area under ROC curve is used to measure the accuracy
of our model
● We use various evaluators such as
BinaryClassificationEvaluator() in this process
● Evaluator takes the data and metric related parameter
and evaluates the metric asked for, say area under
ROC curve or PR Curve etc.,
● It has evaluate() method

Evaluator continued
● Takes Data and Parameters and provides metrics
○ SalaryEvaluator
DF
Param
Evaluator Metric

Cross-validator
● We want to tune the performance of our models
● It takes the following inputs:
○ Estimator : pipeline we have built
○ Parameter Grid : Regularization and num of iterations for LR
○ Evaluator: Binary classification evaluator
● Find best Parameters
○ SalaryCrossValidator

Recap ML Pipelines
Load Data
StringIndexer
OneHotEncoder
VectorAssembler
Pipeline
Evaluate
LogisticRegression
Transformer Estimator

Machine learning pipeline with spark ml

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Machine learning pipeline with spark ml

Semelhante a Machine learning pipeline with spark ml (20)

Mais de datamantra

Mais de datamantra (20)

Último

Último (20)

Machine learning pipeline with spark ml