SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Understanding Parallelization
of Machine Learning Algorithms
in Apache Spark™
About Richard
Richard Garris has spent 15 years
advising customers in data management
and analytics. As a Director of Solutions
Architecture at Databricks the past 3
years, he works closely with data
scientists to build machine learning
pipelines and advise companies on best
practices in advanced analytics. He has
advanced degrees from Carnegie Mellon
and The Ohio State University.
2
Agenda
●Machine Learning Overview
●Machine Learning as an Optimization Problem
●Logistic Regression
●Single Machine vs Distributed Machine Learning
●Other Algorithms e.g. Random Forest
●Other ML Parallelization Techniques
●Demonstration
●Q&A
3
AI is Changing the World
AlphaGoSelf-driving cars Alexa/Google Home
Data is the Fuel that Drives ML
Andrew Ng calls the algorithms the
“rocket ship” and the data “the fuel that
you feed machine learning” to build
deep learning applications
*Source: Buckham & Duffy
5
What is Machine Learning?
Supervised Machine Learning
A set of techniques that, given a set of examples, attempts to
predict outcomes for future values.
The set of techniques are the algorithms which includes both
transformational algorithms (featurizers) and machine learning
methods (estimators)
Examples are used to “train” the algorithm to produce the Model
A Model is a Mathematical Function
 
Example Bank Marketing
Who has heard of Signet Bank?
“data on the specific transactions of a bank’s customers can
improve models for deciding what product offers to make.1
”
(1) Data Science for Business
Provost, Foster
O’Reilly 2013
8
But …
given a set of examples how do I get training data to the
function (i.e. model)?
9
I want to
predict if my
customer
will do this
Given this data ...
I want to create this
Logistic regression
function
Machine Learning is an Optimization Problem
● Given the training data how do I as quickly as possible come up with
the most optimal equation given training data X that solves for Y
● In generalized linear regression, the goal is to learn the coefficients
(i.e. weights) based on the examples and fit a line based on the
points
● In decision tree learning, we are learning where to split on features
● Time can be saved by minimizing the amount of work and
optimization is achieved when the result produced yields the least
error on the test set
● Getting to the optimal model is called convergence
10
Error
The goal of the optimization is to either maximize a positive
metric like the area under the curve (AUC) or minimize a negative
metric (like error.)
This goal is called the objective function (or the loss / cost
function.)
Over each iteration
the AUC goes up.
11
Train - Tune - Test
Data is split into samples
Training is used to fit the model(s)
The tuning (or validation) set is used to find the optimal
parameters
The test set is used to
evaluate the final
model
12
Source: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
Amount of Work
● Big O Notation - provides a worst case scenario for how long an algorithm will take
relative to the amount of data (or number of iterations)
● In Spark MLLib we always want scalable machine learning so that the runtime
doesn’t explode when we get massive datasets
● Example: finding a single element in a set of 1M values (say it takes 0.001 seconds per
comparison):
○ Item by Item Search - O(N):
■ Worst Case - 16 minutes
○ Binary Search - O(Log N)
■ Worst Case - 6 milliseconds
○ Hash Search - O (1)
■ Worst Case - 1 millisecond
13
Data Representation
Pandas - DataFrames represented on a single machine as Python
data structures
RDDs - Spark’s foundational structure Resilient Distributed
Dataset is represented as a reference to partitioned data without
types
DataFrames - Spark’s strongly typed optimized distributed
collection of rows
All ML Algorithms need to represent data as vectors (array of
Double values) in order to train machine learning algorithms
14
ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!
Parallelization Technique # 1
Feature engineer in Spark but train the model on a single
driver machine
○ Gather source data, perform joins, do complex ETL and feature
engineer
○ Works well when you have big data for your transactions but smaller
aggregated or subsamples once you have features to train on
○ May need to use a larger size driver node that can fit all the training
features into memory (may need to adjust
spark.driver.maxResultSize)
○ Use Scikit-learn, R, TensorFlow (optionally with GPU) or other single
machine learners
17
Parallelization in ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
You can parallelize
here
Single machine here
Parallelization Technique # 2
Train the entire model with distributed machine learning
○ MLLib built for large scale model creation (millions of instances)
19
○ You will create a single model with all this
data (you use model selection techniques
like Train Tune Validation splits or a cross
validator)
○ Spark Deep Learning, Horvod (for
TensorFlow), XGBoost (the JVM version)
and H2O also support distributed ML
Spark Architecture
20
Parallelization Technique # 3
Create many models one on each worker
○ Works for creating one model per customer, one model per set of
features for feature selection, one model per set of hyperparameters
for tuning…
○ Spark-sklearn helps facilitate creating one model per worker
○ You can as of Spark 2.3 use the cross validation parallelism parameter
to run do model selection / hyperparameter tuning in parallel in
MLLib as well
○ Broadcast your data but use Spark to train models on each worker
with different sets of hyperparameters to find the optimal model
21
Single Machine Learning
In a single machine algorithm such as scikit-learn, the data is
loaded into the memory of a single machine (usually as Pandas or
NumPy vectors)
The single machine algorithm iterates (loops) over the data in
memory and using a single thread (sometimes multicore) takes a
“guess” at the equation
The error is calculated and if the error from the previous iteration
is less than the tolerance or the iterations are >= max iterations
then STOP
22
Distributed Machine Learning
In Distributed Machine Learning like Spark MLlib, the data is represented as an RDD or
a DataFrame typically in distributed file storage such as S3, Blob Store or HDFS
As a best practice, you should cache the data in memory across the nodes so that
multiple iterations don’t have to query the disk each time
The calculation of the gradients is distributed across all the nodes using Spark’s
distributed compute engine (similar to MapReduce)
After each iteration, the results return to the driver and iterations continue until either
max_iter or tol is reached
23
Algorithms
Logistic Regression Basics
The goal of logistic regression is to fit a sigmoid curve over
the dataset.
Why not use a standard straight line?
Probabilities have to between 0 and 1
25
Logistic Regression Solvers
There are many methods to finding an
optimal equation in Logistic regression
● Stochastic Gradient Descent
● LBFGS
● Newtonian
The basic premise is at each step you
“guess” a new model by taking a step
toward reducing the error or increasing
the AUC
26
Random Forest
Random Forest uses a collection of weighted decision trees trained on
random subsample of data to “vote” on the label
Random Forest is a general purpose algorithm that works well for larger
datasets due to its ability to find non-linearities in the data
Because Random Forest uses a number of trees it “averages out” outliers
and noise and therefore is resistant to overfitting
Each decision tree has depth and a number of decision trees in the forest
27
Single vs Distributed Random Forest
In Spark because each tree is trained on a subsample of data (and
features) and do not depend on the other trees many trees can be
trained in parallel.
The selection of features to split on is also distributed
Splits are chosen based on information gain
Tree creation is stopped when maxDepth is
hit, information gain is < minInfoGain or
no split candidates have minInstancesPerNode
28
Demonstration
https://community.cloud.databricks.com/?o=15269310110807
74#notebook/2040911831196221/command/174442750355907
3
Databricks Community Edition or Free Trial
https://databricks.com/try-databricks
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
2
9

Mais conteúdo relacionado

Mais procurados

PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersBayu Aldi Yansyah
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stackingankit_ppt
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Rehan Guha
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmhktripathy
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithmhina firdaus
 

Mais procurados (20)

PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning Practitioners
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Clustering
ClusteringClustering
Clustering
 
Isolation Forest
Isolation ForestIsolation Forest
Isolation Forest
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithm
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 

Semelhante a Understanding Parallelization of Machine Learning Algorithms in Apache Spark™

Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series ForecastingBillTubbs
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Presentation
PresentationPresentation
Presentationbutest
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_LibrarySaurabh Chauhan
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchArtemSunfun
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)ActiveEon
 

Semelhante a Understanding Parallelization of Machine Learning Algorithms in Apache Spark™ (20)

Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
unit 2 hpc.pptx
unit 2 hpc.pptxunit 2 hpc.pptx
unit 2 hpc.pptx
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Presentation
PresentationPresentation
Presentation
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
 
Unit 1 dsa
Unit 1 dsaUnit 1 dsa
Unit 1 dsa
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Último (20)

Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™

  • 1. Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
  • 2. About Richard Richard Garris has spent 15 years advising customers in data management and analytics. As a Director of Solutions Architecture at Databricks the past 3 years, he works closely with data scientists to build machine learning pipelines and advise companies on best practices in advanced analytics. He has advanced degrees from Carnegie Mellon and The Ohio State University. 2
  • 3. Agenda ●Machine Learning Overview ●Machine Learning as an Optimization Problem ●Logistic Regression ●Single Machine vs Distributed Machine Learning ●Other Algorithms e.g. Random Forest ●Other ML Parallelization Techniques ●Demonstration ●Q&A 3
  • 4. AI is Changing the World AlphaGoSelf-driving cars Alexa/Google Home
  • 5. Data is the Fuel that Drives ML Andrew Ng calls the algorithms the “rocket ship” and the data “the fuel that you feed machine learning” to build deep learning applications *Source: Buckham & Duffy 5
  • 6. What is Machine Learning? Supervised Machine Learning A set of techniques that, given a set of examples, attempts to predict outcomes for future values. The set of techniques are the algorithms which includes both transformational algorithms (featurizers) and machine learning methods (estimators) Examples are used to “train” the algorithm to produce the Model
  • 7. A Model is a Mathematical Function  
  • 8. Example Bank Marketing Who has heard of Signet Bank? “data on the specific transactions of a bank’s customers can improve models for deciding what product offers to make.1 ” (1) Data Science for Business Provost, Foster O’Reilly 2013 8
  • 9. But … given a set of examples how do I get training data to the function (i.e. model)? 9 I want to predict if my customer will do this Given this data ... I want to create this Logistic regression function
  • 10. Machine Learning is an Optimization Problem ● Given the training data how do I as quickly as possible come up with the most optimal equation given training data X that solves for Y ● In generalized linear regression, the goal is to learn the coefficients (i.e. weights) based on the examples and fit a line based on the points ● In decision tree learning, we are learning where to split on features ● Time can be saved by minimizing the amount of work and optimization is achieved when the result produced yields the least error on the test set ● Getting to the optimal model is called convergence 10
  • 11. Error The goal of the optimization is to either maximize a positive metric like the area under the curve (AUC) or minimize a negative metric (like error.) This goal is called the objective function (or the loss / cost function.) Over each iteration the AUC goes up. 11
  • 12. Train - Tune - Test Data is split into samples Training is used to fit the model(s) The tuning (or validation) set is used to find the optimal parameters The test set is used to evaluate the final model 12 Source: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
  • 13. Amount of Work ● Big O Notation - provides a worst case scenario for how long an algorithm will take relative to the amount of data (or number of iterations) ● In Spark MLLib we always want scalable machine learning so that the runtime doesn’t explode when we get massive datasets ● Example: finding a single element in a set of 1M values (say it takes 0.001 seconds per comparison): ○ Item by Item Search - O(N): ■ Worst Case - 16 minutes ○ Binary Search - O(Log N) ■ Worst Case - 6 milliseconds ○ Hash Search - O (1) ■ Worst Case - 1 millisecond 13
  • 14. Data Representation Pandas - DataFrames represented on a single machine as Python data structures RDDs - Spark’s foundational structure Resilient Distributed Dataset is represented as a reference to partitioned data without types DataFrames - Spark’s strongly typed optimized distributed collection of rows All ML Algorithms need to represent data as vectors (array of Double values) in order to train machine learning algorithms 14
  • 15. ML Pipelines Train model Evaluate Load data Extract features A very simple pipeline
  • 16. ML Pipelines Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline!
  • 17. Parallelization Technique # 1 Feature engineer in Spark but train the model on a single driver machine ○ Gather source data, perform joins, do complex ETL and feature engineer ○ Works well when you have big data for your transactions but smaller aggregated or subsamples once you have features to train on ○ May need to use a larger size driver node that can fit all the training features into memory (may need to adjust spark.driver.maxResultSize) ○ Use Scikit-learn, R, TensorFlow (optionally with GPU) or other single machine learners 17
  • 18. Parallelization in ML Pipelines Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble You can parallelize here Single machine here
  • 19. Parallelization Technique # 2 Train the entire model with distributed machine learning ○ MLLib built for large scale model creation (millions of instances) 19 ○ You will create a single model with all this data (you use model selection techniques like Train Tune Validation splits or a cross validator) ○ Spark Deep Learning, Horvod (for TensorFlow), XGBoost (the JVM version) and H2O also support distributed ML
  • 21. Parallelization Technique # 3 Create many models one on each worker ○ Works for creating one model per customer, one model per set of features for feature selection, one model per set of hyperparameters for tuning… ○ Spark-sklearn helps facilitate creating one model per worker ○ You can as of Spark 2.3 use the cross validation parallelism parameter to run do model selection / hyperparameter tuning in parallel in MLLib as well ○ Broadcast your data but use Spark to train models on each worker with different sets of hyperparameters to find the optimal model 21
  • 22. Single Machine Learning In a single machine algorithm such as scikit-learn, the data is loaded into the memory of a single machine (usually as Pandas or NumPy vectors) The single machine algorithm iterates (loops) over the data in memory and using a single thread (sometimes multicore) takes a “guess” at the equation The error is calculated and if the error from the previous iteration is less than the tolerance or the iterations are >= max iterations then STOP 22
  • 23. Distributed Machine Learning In Distributed Machine Learning like Spark MLlib, the data is represented as an RDD or a DataFrame typically in distributed file storage such as S3, Blob Store or HDFS As a best practice, you should cache the data in memory across the nodes so that multiple iterations don’t have to query the disk each time The calculation of the gradients is distributed across all the nodes using Spark’s distributed compute engine (similar to MapReduce) After each iteration, the results return to the driver and iterations continue until either max_iter or tol is reached 23
  • 25. Logistic Regression Basics The goal of logistic regression is to fit a sigmoid curve over the dataset. Why not use a standard straight line? Probabilities have to between 0 and 1 25
  • 26. Logistic Regression Solvers There are many methods to finding an optimal equation in Logistic regression ● Stochastic Gradient Descent ● LBFGS ● Newtonian The basic premise is at each step you “guess” a new model by taking a step toward reducing the error or increasing the AUC 26
  • 27. Random Forest Random Forest uses a collection of weighted decision trees trained on random subsample of data to “vote” on the label Random Forest is a general purpose algorithm that works well for larger datasets due to its ability to find non-linearities in the data Because Random Forest uses a number of trees it “averages out” outliers and noise and therefore is resistant to overfitting Each decision tree has depth and a number of decision trees in the forest 27
  • 28. Single vs Distributed Random Forest In Spark because each tree is trained on a subsample of data (and features) and do not depend on the other trees many trees can be trained in parallel. The selection of features to split on is also distributed Splits are chosen based on information gain Tree creation is stopped when maxDepth is hit, information gain is < minInfoGain or no split candidates have minInstancesPerNode 28
  • 29. Demonstration https://community.cloud.databricks.com/?o=15269310110807 74#notebook/2040911831196221/command/174442750355907 3 Databricks Community Edition or Free Trial https://databricks.com/try-databricks Additional Questions? Contact us at http://go.databricks.com/contact-databricks 2 9