SlideShare uma empresa Scribd logo
1 de 71
Baixar para ler offline
Kaminski, Schlegel | Oct. 25, 2017
BUILDING CUSTOM ML PIPELINESTAGES
FOR FEATURE SELECTION.
SPARK SUMMIT EUROPE 2017.
WHATYOU WILL LEARN DURING THIS SESSION.
 How data-driven car diagnostics look like at BMW.
 Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).
 Attention: There will be Scala code examples!
 Howto use spark-FeatureSelection in your Spark ML Pipeline.
 The impact of feature selection on learning performance andthe understanding of the big data black box.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
 Automatic knowledge generation.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
 Automatic knowledge generation.
 Automatic workshop diagnostics.
 Predictive maintenance.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
High sparsity
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
High sparsity
High class imbalance
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
SPARK PIPELINE.
Relational
DWH
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
PipelineStage
SPARK PIPELINE API.
Interface for usage in
Pipeline
data ?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
PipelineStage
SPARK PIPELINE API.
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
PipelineStage
SPARK PIPELINE API.
Estimator
‘Learns from data’
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data dataTransformer data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
ORG.APACHE.SPARK.ML.*
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Transforms data
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
FeatureSelector
Interface for FS
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
FeatureSelectionModel
Model from
FeatureSelector
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0 features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0
Attention:
VectorColumns have Metadata:
Name, Type, Range, etc.
features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
fit
(= learn from data)
Dataset Transformer
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Not necessary, but avoids
code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Same idea as in Estimator, but
different tasks.
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
Adds persistence.
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
getters are shared between
Estimator and Transformer.
setters not, for the pursuit of
concatenation.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
 Create DataFrame and use write.parquet(…)
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
 Create DataFrame and use write.parquet(…)
 How do we dothat?
 Create companion object FeatureSelectorModel, which offersthe following classes:
 abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}
 class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…}
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
val dfT = plModel.transform(df).drop(“Features")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
selected Label
[0,1] 1.0
[0,0] 0.0
[1,1] 0.0
[1,0] 1.0
df dft
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Transform
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
SPARK-FEATURESELECTION PACKAGE.
 Offers selection based on:
 Gini coefficient
 Correlation coefficient
 Information gain
 L1-Logistic regression weights
 Randomforest importances
 Utility stage:
 VectorMerger
 Three modes:
 Percentile (default)
 Fixed number of columns
 Compare to random column [4]
Find on GitHub: spark-FeatureSelection or on Spark-packages
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15
[4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Informationgain Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
LESSONS LEARNT.
 Know what your data looks like and where it is located! Example:
 Operations can succeed in local mode, but fail on a cluster.
 Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.
 Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.
 Use the SparkWeb GUIto understand your Spark jobs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
QUESTIONS?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
Marc.Kaminski@bmw.de
Bernhard.bb.Schegel@bmw.de
Page 18
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19
BACKUP.
DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE.
Own namespace
Pro Con
Safer solution Code duplication
org.apache.spark.ml.*
Pro Con
Less code duplication
(sharedParams,
SchemaUtils, …)
More dangerous,
when not
cautious
Easier to implement
persistence
vs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
E.g.:
- Correlation
- InformationGain
- RandomForest
etc.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
FEATURE SELECTION.
Description Advantages Disadvantages Examples
Filter Evaluate intrinsic data properties
Fast
Scalable
Ignore inter-feature dependencies
Ignore interaction with classifier
Chi-squared
Information gain
Correlation
Wrapper
Evaluate model performance of
feature subset
Feature dependencies
Simple
Classifier dependent selection
Computational expensive
Risk of overfitting
Genetic algorithms
Search algorithms
Embedded
Feature selection is embedded in
classifier training
Feature dependencies Classifier dependent selection L1-Logistic regression
Random forest
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
CHALLENGES.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
 Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.
 Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.
 Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended.
Page 23

Mais conteúdo relacionado

Mais procurados

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Artificial intelligence in industry
Artificial intelligence in industryArtificial intelligence in industry
Artificial intelligence in industryDipanjan Mitra
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scaleMaxim Salnikov
 
Function in Python
Function in PythonFunction in Python
Function in PythonYashdev Hada
 
Building Chatbots with Amazon Lex
Building Chatbots with Amazon LexBuilding Chatbots with Amazon Lex
Building Chatbots with Amazon LexAmazon Web Services
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Function arguments In Python
Function arguments In PythonFunction arguments In Python
Function arguments In PythonAmit Upadhyay
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Sri Ambati
 
Overview of Artificial Intelligence
Overview of Artificial IntelligenceOverview of Artificial Intelligence
Overview of Artificial IntelligenceSiddhant Fulzele
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudDaniel Zivkovic
 
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...DianaGray10
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stackingankit_ppt
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive DataSumit Rangwala
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaEdureka!
 
Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain BGA Cyber Security
 

Mais procurados (20)

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Artificial intelligence in industry
Artificial intelligence in industryArtificial intelligence in industry
Artificial intelligence in industry
 
introduction Azure OpenAI by Usama wahab khan
introduction  Azure OpenAI by Usama wahab khanintroduction  Azure OpenAI by Usama wahab khan
introduction Azure OpenAI by Usama wahab khan
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scale
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Function in Python
Function in PythonFunction in Python
Function in Python
 
Building Chatbots with Amazon Lex
Building Chatbots with Amazon LexBuilding Chatbots with Amazon Lex
Building Chatbots with Amazon Lex
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Function arguments In Python
Function arguments In PythonFunction arguments In Python
Function arguments In Python
 
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
 
Overview of Artificial Intelligence
Overview of Artificial IntelligenceOverview of Artificial Intelligence
Overview of Artificial Intelligence
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
 
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
AI and ML Series - Leveraging Generative AI and LLMs Using the UiPath Platfor...
 
Hyperautomation
HyperautomationHyperautomation
Hyperautomation
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
 
Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain
 

Destaque

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 

Destaque (7)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 

Semelhante a Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
 
Evolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedEvolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedSamuel Festus
 
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Tom Hubregtsen
 
StarTuned August 2013
StarTuned August 2013StarTuned August 2013
StarTuned August 2013RBMParts
 
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI Alliance
 
Triple Forward Camera from Tesla Model 3
 Triple Forward Camera from Tesla Model 3 Triple Forward Camera from Tesla Model 3
Triple Forward Camera from Tesla Model 3system_plus
 
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software GmbH
 
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Yole Developpement
 
Automotive supply chain visibility v2
Automotive supply chain visibility v2Automotive supply chain visibility v2
Automotive supply chain visibility v2Prasaga
 
Examining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H MichelExamining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H Michelmfrancis
 
Position Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsPosition Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsHEINZ OYRER
 
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...IRJET Journal
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela PoklukarDataScienceConferenc1
 
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...AEI Rsearch
 
daimler presentataion
daimler presentataiondaimler presentataion
daimler presentataionAnubhav goel
 
Tas case study one
Tas case study oneTas case study one
Tas case study oneRalph Paglia
 
Maxim auto business update final
Maxim auto business update finalMaxim auto business update final
Maxim auto business update finalmaxim2015ir
 
Fvdi abrites commander
Fvdi abrites commanderFvdi abrites commander
Fvdi abrites commanderLandy Lan
 

Semelhante a Building Custom ML PipelineStages for Feature Selection with Marc Kaminski (20)

Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
 
Evolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedEvolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managed
 
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
 
StarTuned August 2013
StarTuned August 2013StarTuned August 2013
StarTuned August 2013
 
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
 
Bmw cas4
Bmw cas4Bmw cas4
Bmw cas4
 
Triple Forward Camera from Tesla Model 3
 Triple Forward Camera from Tesla Model 3 Triple Forward Camera from Tesla Model 3
Triple Forward Camera from Tesla Model 3
 
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
 
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
 
Automotive supply chain visibility v2
Automotive supply chain visibility v2Automotive supply chain visibility v2
Automotive supply chain visibility v2
 
MYNews 2015 01
MYNews 2015 01MYNews 2015 01
MYNews 2015 01
 
Examining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H MichelExamining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H Michel
 
Position Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsPosition Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive Applications
 
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
 
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
On semi(06/2021) --Innovative semiconductor packaging technology enables ev d...
 
daimler presentataion
daimler presentataiondaimler presentataion
daimler presentataion
 
Tas case study one
Tas case study oneTas case study one
Tas case study one
 
Maxim auto business update final
Maxim auto business update finalMaxim auto business update final
Maxim auto business update final
 
Fvdi abrites commander
Fvdi abrites commanderFvdi abrites commander
Fvdi abrites commander
 

Mais de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 

Último

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

  • 1. Kaminski, Schlegel | Oct. 25, 2017 BUILDING CUSTOM ML PIPELINESTAGES FOR FEATURE SELECTION. SPARK SUMMIT EUROPE 2017.
  • 2. WHATYOU WILL LEARN DURING THIS SESSION.  How data-driven car diagnostics look like at BMW.  Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).  Attention: There will be Scala code examples!  Howto use spark-FeatureSelection in your Spark ML Pipeline.  The impact of feature selection on learning performance andthe understanding of the big data black box. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
  • 3. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1] [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 4. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 5. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach: [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 6. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 7. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation.  Automatic workshop diagnostics.  Predictive maintenance. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 8. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 9. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 10. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 11. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity High class imbalance Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 12. SPARK PIPELINE. Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  • 13. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  • 14. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 15. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 16. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 17. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 18. PipelineStage SPARK PIPELINE API. Interface for usage in Pipeline data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 19. PipelineStage SPARK PIPELINE API. Transformer ‘Transforms data’ Interface for usage in Pipeline data data data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 20. PipelineStage SPARK PIPELINE API. Estimator ‘Learns from data’ Transformer ‘Transforms data’ Interface for usage in Pipeline data data dataTransformer data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 21. ORG.APACHE.SPARK.ML.* PipelineStage Estimator Interface for usage in Pipeline Transformer Transforms data Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 22. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors PipelineModel Model from Pipeline PredictionModel Model from predictor PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 23. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors FeatureSelector Interface for FS PipelineModel Model from Pipeline PredictionModel Model from predictor FeatureSelectionModel Model from FeatureSelector PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 24. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 25. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 26. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 27. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 28. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 29. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 30. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  • 31. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 Attention: VectorColumns have Metadata: Name, Type, Range, etc. features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  • 32. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  • 33. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. fit (= learn from data) Dataset Transformer Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  • 34. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 35. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 36. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Not necessary, but avoids code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 37. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11
  • 38. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11 For persistence.
  • 39. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 For persistence.
  • 40. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Same idea as in Estimator, but different tasks. For persistence.
  • 41. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence.
  • 42. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence. Adds persistence.
  • 43. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 44. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 45. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 46. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. getters are shared between Estimator and Transformer. setters not, for the pursuit of concatenation. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 47. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 48. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 49. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…) DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 50. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…)  How do we dothat?  Create companion object FeatureSelectorModel, which offersthe following classes:  abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}  class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…} DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 51. HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 52. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 53. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 54. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 55. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 56. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  • 57. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) val dfT = plModel.transform(df).drop(“Features") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. selected Label [0,1] 1.0 [0,0] 0.0 [1,1] 0.0 [1,0] 1.0 df dft Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Transform Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  • 58. SPARK-FEATURESELECTION PACKAGE.  Offers selection based on:  Gini coefficient  Correlation coefficient  Information gain  L1-Logistic regression weights  Randomforest importances  Utility stage:  VectorMerger  Three modes:  Percentile (default)  Fixed number of columns  Compare to random column [4] Find on GitHub: spark-FeatureSelection or on Spark-packages Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15 [4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
  • 59. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 60. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  • 61. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  • 62. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 63. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Informationgain Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 64. LESSONS LEARNT.  Know what your data looks like and where it is located! Example:  Operations can succeed in local mode, but fail on a cluster.  Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.  Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.  Use the SparkWeb GUIto understand your Spark jobs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
  • 65. QUESTIONS? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Marc.Kaminski@bmw.de Bernhard.bb.Schegel@bmw.de Page 18
  • 66. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19 BACKUP.
  • 67. DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE. Own namespace Pro Con Safer solution Code duplication org.apache.spark.ml.* Pro Con Less code duplication (sharedParams, SchemaUtils, …) More dangerous, when not cautious Easier to implement persistence vs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
  • 68. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  • 69. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 E.g.: - Correlation - InformationGain - RandomForest etc. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  • 70. FEATURE SELECTION. Description Advantages Disadvantages Examples Filter Evaluate intrinsic data properties Fast Scalable Ignore inter-feature dependencies Ignore interaction with classifier Chi-squared Information gain Correlation Wrapper Evaluate model performance of feature subset Feature dependencies Simple Classifier dependent selection Computational expensive Risk of overfitting Genetic algorithms Search algorithms Embedded Feature selection is embedded in classifier training Feature dependencies Classifier dependent selection L1-Logistic regression Random forest Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
  • 71. CHALLENGES. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017  Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.  Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.  Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended. Page 23