"Comparing Variable Importance from Ensemble and Deep Learning Methods for AdTech Data"
Variable Importance brings interpretability to popular black box modeling techniques. In this talk we study performance of popular ensemble techniques like Random Forest, Gradient Boosting with GLM. We observe certain traits that get magnified by non-linear techniques like Deep Learning that are otherwise missed by GBM or Random Forest.
We describe Open Source Scalable Machine Learning package, H2O which through ease-of-use and speed makes comparisons and picking best-of-breed and ensembles more natural. H2O's implementation of these algorithms tracks popular open source and text book implementations closely.
18. Common ensemble techniques
Bayesian Classifiers
Ensembles of all hypotheses in hypothesis-space.
Bagging
Each model votes with equal weight.
Bagging trains models on randomly drawn subset
Boosting
Incrementally build an ensemble of each new model
H2O.ai
Machine Intelligence
25. Generalized Linear Modeling – Variable Importance
GLM, Elastic Net (Binomial)
GLM, Elastic Net (Binomial)
Categorical expansion on Age
H2O.ai
Machine Intelligence
26. Variable Importance Comparison
Deep Learning (Tanh / 4-layer)
Deep Learning (Tanh / 3-layer)
H2O.ai
Machine Intelligence
27. every generation needs to invent it’s math.
Our data, our tools!
H2O.ai
Machine Intelligence
34. Sparkling Water Application Life
Cycle
Sparkling
App
jar file
Spark
Master
JVM
spark-submit
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
(1)
(2)
(3)
(1) User submits App to Spark cluster Master node
(2) App distributed to Spark cluster Worker nodes
(3) Spark Executor JVMs start for App
(4) H2O instance starts within each Executor JVM
(5) App’s Scala main program runs
Sparkling Water Cluster
Spark
Executor
JVM
H2O
(4)
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
35. Sparkling Water Data Distribution
Sparkling Water Cluster
H2O
H2O
H2O
Spark Executor JVM
Data
Source
(e.g.
HDFS)
(1)
(2)
(3)
(1) Use Spark SQL to read
data into a Spark RDD
(2) Convert Spark RDD to
H2O RDD; H2O RDD is
column-based and highly
compressed
(Not shown) Run modeling
and prediction workflows
with H2O
(3) Convert H2O RDD (e.g.
predictions) back to Spark
RDD
H2O
RDD
Spark
RDD
Spark Executor JVM
Spark Executor JVM
37. H2O – The Killer-App for Spark
MLlib H2O SQL
H2ORDD
HDFS=DATA
Sparkling Water
H2O.ai
Machine Intelligence
In-Memory Big Data, Columnar
ML 100x faster Algos
R CRAN, API, fast engine
API Spark API, Java MM
Community Devs, Data Science