SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
BrightTalk
5/20/2014
Building Random
Forest at Scale
Michal Malohlava!
@mmalohlava!
@hexadata
Who am I?
Background
•PhD in CS from Charles University in Prague, 2012
•1 year PostDoc at Purdue University experimenting with algos
for large computation
•1 year at 0xdata helping to develop H2O engine for big data
computation
!
Experience with domain-specific languages, distributed system,
software engineering, and big data.
Overview
1. A little bit of theory
!
2.Random Forest observations
!
3.Scaling & distribution of Random Forest

4.Q&A

Tree Planting
What is a model for
this data?
•Training sample of points covering area [0,3] x [0,3]
•Two possible colors of points
X
0 1 2 3
0
1
2
3
Y
What is a model for
this data?
The model should be able to predict a color of a new point
X
0 1 2 3
0
1
2
3
Y
What is a color !
of this point?
Decision tree
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
A. Possible impurity measures
• Gini, entropy, RSS
B. Respect type of feature -
nominal, ordinal, continuous
How to grow a
decision tree?!
Split rows in a given node
into two sets with respect to
impurity measure
•The smaller, the more
skewed is distribution
•Compare impurity of
parent with impurity of
children
x<0.8
???y < 2
y<0.8
When to stop growing
the tree?
1. Build full tree
or
2.Apply stopping criterion - limit on:
•Tree depth, or
•Minimum number of points in a leaf
x<0.8
y<0.8
x< 2 x<1.7
y < 2
y<0.8
How to assign leaf
value?
The leaf value is
•If leaf contains only one point
then its color represents leaf
value
!
•Else majority color is picked, or
color distribution is stored
x<0.8
y < 2
y<0.8 ???
Decision tree
Tree covered whole area by rectangles predicting a
point color
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
Decision tree scoring
The model can predict a point color based on its
coordinates.
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
X
0 1 2 3
0
1
2
3
Y
Overfitting
Tree perfectly represents training data (0% training error),
but also learned about noise!
Noise
X
0 1 2 3
0
1
2
3
Y
Overfitting
And hence poorly predicts a new point!
Expected color !
was RED!
since points follow !
“chess board”!
pattern
Handle overfitting
Pre-pruning via stopping criterion
!
Post-pruning: decreases complexity of model but helps with
model generalization
!
Randomize tree building and combine trees together
“Model should have low training error but also
generalization error!”
Random Forest idea
Breiman, L. (2001). Random forests. Machine Learning, 5–32.
http://link.springer.com/article/10.1023/A:1010933404324
Randomize #1
Bagging
X
0 1 2 3
0
1
2
3
Y
X
0 1 2 3
0
1
2
3
Y
Prepare bootstrap
sample for
each tree
by sampling
with replacement
X
0 1 2 3
0
1
2
3
Y
x<0.8
y<0.8
x< 2 x< 1.7
y < 2 y < 2.3
y < 2
y<0.8
Randomize #1
Bagging
x<0.8
y<0.8
x< 2 x< 1.7
y < 2 y < 2.3
y < 2
y<0.8
Randomize #1!
Bagging
Each tree sees only sample of training data and
captures only a part of the information.
!
Build multiple weak trees which vote together to
give resulting prediction
•voting is based on majority vote, or weighted
average
Randomize #2!
Feature selection
Randomized split selection
x<0.8
???y < 2
y<0.8
•Select randomly subset
of features of size sqrt(#features)

•Select the best split only using the
subset
Out of bag points
and validation
Each tree is built over a
sample of training points.
!
Remaining points are called
“out-of-bag” (OOB).
X
0 1 2 3
0
1
2
3
Y
These points are used for validation
as a good approximation for
generalization error. Almost
identical as N-fold cross validation.
Advantages of
Random Forest
Independent trees which can be built in parallel
!
The model does not overfit easily
!
Produces reasonable accuracy
!
Brings more features to analyze data - variable importance,
proximities, missing values imputation
!
Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm
A Few
Observations
Covtype dataset
Sampling rate impact
Number of split features
Variable importance
Building Forests
with H2O
H2O platform
Challenges
Parallelize and distribute Random Forest algorithm
•Preserve computation with data
•Minimize data transfers
Preserve Random Forest properties
•Split nodes in an efficient way
•Sample and preserve track of OOB samples
Handle large trees
Implementation #1
Build independent trees per machine local data
•RVotes approach
•Each node builds a subset of forest
Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and
accurate approach. The Journal of Machine Learning Research, 5, p421–451.
Implementation #1
Fast - trees are independent and can be built in parallel
Data have to fit into memory
!
Possible accuracy decrease if each node can see only
subset of data
Implementation #2
Build a distributed tree over all data
Dataset!
points
Implementation #2
Each data point has assigned a tree node
• stored in a temporary

vector
Dataset!
points
Pass over data points
means visiting tree nodes
Implementation #2
Each data point has in/out of bag flag
• stored in a temporary

vector
Dataset!
points
I O I I I O I II I I I I O
In/Out of!
bag !
flags
Trick for on-the-fly scoring:
position out-of-bag rows
inside a tree is tracked as
well
Implementation #2
Tree is built per layer
• Each node prepares

histogram for

splitting Active !
tree layer
Dataset!
points
I O I I I O I II I I I I O
In/Out of!
bag !
flags
Implementation #2
Tree is built per layer
• Histograms are reduced

and a new layer

is prepared Active !
tree layer
Dataset!
points
I O I I I O I II I I I I O
In/Out of!
bag !
flags
Implementation #2
Exact solution - no decrease of accuracy

Elegant solution merging tree building and OOB scoring
!
More data transfers to exchange histograms
Can produce huge trees (since tree size depends on
data)
Tree representation
Internally stored in compressed format as a byte array
•But it can be pretty huge (>10MB)
!
Externally can be stored as code
class Tree_1 {	
static final float predict(double[] data) {	
float pred = ((float) data[3 /* petal_wid */] < 0.8200002f ? 0.0f	
: ((float) data[2 /* petal_len */] < 4.835f	
? ((float) data[3 /* petal_wid */] < 1.6600002f ? 1.0f : 0.0f)	
: ((float) data[2 /* petal_len */] < 5.14475f	
? ((float) data[3 /* petal_wid */] < 1.7440002f	
? ((float) data[1 /* sepal_wid */] < 2.3600001f ? 0.0f : 1.0f)	
: 0.0f)	
: 0.0f)));	
return pred;	
}	
}
Lesson learned
Preserving deterministic computation is crucial!
!
Trees need to be sent around the cloud for validation which
can be expensive!
!
Tracking out-of-bag points can be tricky!
!
Clever data binning is a key trick to decrease memory
consumption
Time for questions
Thank you!
Learn more about H2O at
0xdata.com
or
git clone https://github.com/0xdata/h2o
Thank you!
Follow us at @hexadata
References
•0xdata, H2O: https://github.com/0xdata/h2o/
•Breiman, L. (1999). Pasting small votes for classification in large
databases and on-line. Machine Learning, Vol 36. Kluwer, p85–103.
•Breiman, L. (2001). Random forests. Machine Learning, 5–32. Retrieved
from http://link.springer.com/article/10.1023/A:1010933404324
•Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/
~breiman/RandomForests/cc_home.htm
•Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable
and accurate approach. The Journal of Machine Learning Research, 5,
p421–451.
•Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical
learning. cs.yale.edu. Retrieved from http://link.springer.com/content/
pdf/10.1007/978-0-387-84858-7.pdf

Mais conteúdo relacionado

Mais procurados

Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
 
The 7 steps of Machine Learning
The 7 steps of Machine LearningThe 7 steps of Machine Learning
The 7 steps of Machine LearningWaziri Shebogholo
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Classification
ClassificationClassification
ClassificationCloudxLab
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro9xdot
 
Neural Networks
Neural NetworksNeural Networks
Neural NetworksAdri Jovin
 
Deep learning: Overfitting , underfitting, and regularization
Deep learning: Overfitting , underfitting, and regularizationDeep learning: Overfitting , underfitting, and regularization
Deep learning: Overfitting , underfitting, and regularizationAly Abdelkareem
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overviewprih_yah
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 

Mais procurados (20)

Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
The 7 steps of Machine Learning
The 7 steps of Machine LearningThe 7 steps of Machine Learning
The 7 steps of Machine Learning
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Classification
ClassificationClassification
Classification
 
Learning from imbalanced data
Learning from imbalanced data Learning from imbalanced data
Learning from imbalanced data
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Deep learning: Overfitting , underfitting, and regularization
Deep learning: Overfitting , underfitting, and regularizationDeep learning: Overfitting , underfitting, and regularization
Deep learning: Overfitting , underfitting, and regularization
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 

Semelhante a Building Random Forest at Scale

Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesmustafa sarac
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)Nicholas Knize, Ph.D., GISP
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesIchigaku Takigawa
 
Building Data Products
Building Data ProductsBuilding Data Products
Building Data ProductsCloudera, Inc.
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Mining data streams using option trees
Mining data streams using option treesMining data streams using option trees
Mining data streams using option treesAlexander Decker
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 

Semelhante a Building Random Forest at Scale (20)

Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)High Dimensional Indexing using MongoDB (MongoSV 2012)
High Dimensional Indexing using MongoDB (MongoSV 2012)
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
 
Building Data Products
Building Data ProductsBuilding Data Products
Building Data Products
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Mining data streams using option trees
Mining data streams using option treesMining data streams using option trees
Mining data streams using option trees
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 

Mais de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mais de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Building Random Forest at Scale

  • 1. BrightTalk 5/20/2014 Building Random Forest at Scale Michal Malohlava! @mmalohlava! @hexadata
  • 2. Who am I? Background •PhD in CS from Charles University in Prague, 2012 •1 year PostDoc at Purdue University experimenting with algos for large computation •1 year at 0xdata helping to develop H2O engine for big data computation ! Experience with domain-specific languages, distributed system, software engineering, and big data.
  • 3. Overview 1. A little bit of theory ! 2.Random Forest observations ! 3.Scaling & distribution of Random Forest
 4.Q&A

  • 5. What is a model for this data? •Training sample of points covering area [0,3] x [0,3] •Two possible colors of points X 0 1 2 3 0 1 2 3 Y
  • 6. What is a model for this data? The model should be able to predict a color of a new point X 0 1 2 3 0 1 2 3 Y What is a color ! of this point?
  • 7. Decision tree x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  • 8. A. Possible impurity measures • Gini, entropy, RSS B. Respect type of feature - nominal, ordinal, continuous How to grow a decision tree?! Split rows in a given node into two sets with respect to impurity measure •The smaller, the more skewed is distribution •Compare impurity of parent with impurity of children x<0.8 ???y < 2 y<0.8
  • 9. When to stop growing the tree? 1. Build full tree or 2.Apply stopping criterion - limit on: •Tree depth, or •Minimum number of points in a leaf x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<0.8
  • 10. How to assign leaf value? The leaf value is •If leaf contains only one point then its color represents leaf value ! •Else majority color is picked, or color distribution is stored x<0.8 y < 2 y<0.8 ???
  • 11. Decision tree Tree covered whole area by rectangles predicting a point color x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  • 12. Decision tree scoring The model can predict a point color based on its coordinates. x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  • 13. X 0 1 2 3 0 1 2 3 Y Overfitting Tree perfectly represents training data (0% training error), but also learned about noise! Noise
  • 14. X 0 1 2 3 0 1 2 3 Y Overfitting And hence poorly predicts a new point! Expected color ! was RED! since points follow ! “chess board”! pattern
  • 15. Handle overfitting Pre-pruning via stopping criterion ! Post-pruning: decreases complexity of model but helps with model generalization ! Randomize tree building and combine trees together “Model should have low training error but also generalization error!” Random Forest idea Breiman, L. (2001). Random forests. Machine Learning, 5–32. http://link.springer.com/article/10.1023/A:1010933404324
  • 16. Randomize #1 Bagging X 0 1 2 3 0 1 2 3 Y X 0 1 2 3 0 1 2 3 Y Prepare bootstrap sample for each tree by sampling with replacement X 0 1 2 3 0 1 2 3 Y
  • 17. x<0.8 y<0.8 x< 2 x< 1.7 y < 2 y < 2.3 y < 2 y<0.8 Randomize #1 Bagging x<0.8 y<0.8 x< 2 x< 1.7 y < 2 y < 2.3 y < 2 y<0.8
  • 18. Randomize #1! Bagging Each tree sees only sample of training data and captures only a part of the information. ! Build multiple weak trees which vote together to give resulting prediction •voting is based on majority vote, or weighted average
  • 19. Randomize #2! Feature selection Randomized split selection x<0.8 ???y < 2 y<0.8 •Select randomly subset of features of size sqrt(#features)
 •Select the best split only using the subset
  • 20. Out of bag points and validation Each tree is built over a sample of training points. ! Remaining points are called “out-of-bag” (OOB). X 0 1 2 3 0 1 2 3 Y These points are used for validation as a good approximation for generalization error. Almost identical as N-fold cross validation.
  • 21. Advantages of Random Forest Independent trees which can be built in parallel ! The model does not overfit easily ! Produces reasonable accuracy ! Brings more features to analyze data - variable importance, proximities, missing values imputation ! Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/~breiman/ RandomForests/cc_home.htm
  • 25. Number of split features
  • 29. Challenges Parallelize and distribute Random Forest algorithm •Preserve computation with data •Minimize data transfers Preserve Random Forest properties •Split nodes in an efficient way •Sample and preserve track of OOB samples Handle large trees
  • 30. Implementation #1 Build independent trees per machine local data •RVotes approach •Each node builds a subset of forest Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and accurate approach. The Journal of Machine Learning Research, 5, p421–451.
  • 31. Implementation #1 Fast - trees are independent and can be built in parallel Data have to fit into memory ! Possible accuracy decrease if each node can see only subset of data
  • 32. Implementation #2 Build a distributed tree over all data Dataset! points
  • 33. Implementation #2 Each data point has assigned a tree node • stored in a temporary
 vector Dataset! points Pass over data points means visiting tree nodes
  • 34. Implementation #2 Each data point has in/out of bag flag • stored in a temporary
 vector Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags Trick for on-the-fly scoring: position out-of-bag rows inside a tree is tracked as well
  • 35. Implementation #2 Tree is built per layer • Each node prepares
 histogram for
 splitting Active ! tree layer Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags
  • 36. Implementation #2 Tree is built per layer • Histograms are reduced
 and a new layer
 is prepared Active ! tree layer Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags
  • 37. Implementation #2 Exact solution - no decrease of accuracy
 Elegant solution merging tree building and OOB scoring ! More data transfers to exchange histograms Can produce huge trees (since tree size depends on data)
  • 38. Tree representation Internally stored in compressed format as a byte array •But it can be pretty huge (>10MB) ! Externally can be stored as code class Tree_1 { static final float predict(double[] data) { float pred = ((float) data[3 /* petal_wid */] < 0.8200002f ? 0.0f : ((float) data[2 /* petal_len */] < 4.835f ? ((float) data[3 /* petal_wid */] < 1.6600002f ? 1.0f : 0.0f) : ((float) data[2 /* petal_len */] < 5.14475f ? ((float) data[3 /* petal_wid */] < 1.7440002f ? ((float) data[1 /* sepal_wid */] < 2.3600001f ? 0.0f : 1.0f) : 0.0f) : 0.0f))); return pred; } }
  • 39. Lesson learned Preserving deterministic computation is crucial! ! Trees need to be sent around the cloud for validation which can be expensive! ! Tracking out-of-bag points can be tricky! ! Clever data binning is a key trick to decrease memory consumption
  • 41. Learn more about H2O at 0xdata.com or git clone https://github.com/0xdata/h2o Thank you! Follow us at @hexadata
  • 42. References •0xdata, H2O: https://github.com/0xdata/h2o/ •Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine Learning, Vol 36. Kluwer, p85–103. •Breiman, L. (2001). Random forests. Machine Learning, 5–32. Retrieved from http://link.springer.com/article/10.1023/A:1010933404324 •Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/ ~breiman/RandomForests/cc_home.htm •Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and accurate approach. The Journal of Machine Learning Research, 5, p421–451. •Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. cs.yale.edu. Retrieved from http://link.springer.com/content/ pdf/10.1007/978-0-387-84858-7.pdf