SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Online Random Forest in
10 Minutes
Traditional Supervised Learning
Algorithms
●
●
●
●
●

Regression
Random Forest
Support Vector Machines
Classification and Regression Tree (CART)
etc
Inputs
● Data Matrix (Regression)
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

.56

Red

.456

Male

.589

.78

Green

.654

Female

.6654

.987

Blue

.678

Female

.789

.123

Blue

.999

Male

.543
Inputs
● Data Matrix (Binary Classification)
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Yes

Red

.456

Male

.589

No

Green

.654

Female

.6654

Yes

Blue

.678

Female

.789

No

Blue

.999

Male

.543
Inputs To Streaming Classification
● Observations now have an explicit arrival
order.
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Time

Yes

Red

.456

Male

.589

Jan 1st
2011

No

Green

.654

Female

.6654

Feb 4th
2012

Yes

Blue

.678

Female

.789

Feb 5th
2013

No

Blue

.999

Male

.543

July 4th
Inputs To Streaming Classification
● New Observations can arrive at any time
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Time

Yes

Red

.456

Male

.589

Jan 1st 2011

No

Green

.654

Female

.6654

Feb 4th
2012

Yes

Blue

.678

Female

.789

Feb 5th
2013

No

Blue

.999

Male

.543

July 4th
2013

Yes

Red

.456

Male

.456

NOW
Problems
● Do the important predictors change over
time and when does this change occur?
● How far back is data relevant to today’s
problem?
● What happens when our predictors change
again in the future?
● What if this is all happening rapidly… will it
scale?
Enter Online Random Forest
● Input is a single new observation
● Trees learn incrementally on this new data
● Trees are dropped from the forest based on
performance and replaced a new “ungrown”
tree
Visualization of a single tree
Accuracy on test cases: 75%

5, 6

0, 70

Pure data stop
splitting
Visualization of a single tree
Accuracy on test cases: 55%

0, 70

2, 25

20,3

50 new observations have
come and we create another
split off the parent node’s left
branch
Tree gets pruned
Accuracy on test cases: 55% …
compare to Random variable and
incorporate the age of the tree.
Accuracy is TOO BAD. Prune
the tree

0, 70

2, 25

20,3
New Tree
It’s a stump that hasn’t yet split
any data. If asked for a
classification request it will vote
the prior probability calculated
from the last 100 observations
that the old pruned tree saw
Online Random Forest
● By dropping trees that predict poorly we can
adapt to change in important predictors
● If previous data is relevant to today’s
problem, tree’s learned from it in the past. If
it no longer becomes relevant it will be
reflected in the accuracy and the tree will get
prune
Online Random Forest
● This process of incremental learning and
dropping is constantly occurring so we can
constantly adapt to a changing signal
● We built our Online Random Forest with
scala’s actor framework
● We distribute our tree’s computations (and
physical location) therefore we can handle
high input data streams
Example Stream
Changing Feature Importance

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
Experimental design
Experimental designExperimental design
Experimental design
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Clustering
ClusteringClustering
Clustering
 
Antifertility screening
Antifertility screening Antifertility screening
Antifertility screening
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
genetic variations and its role in health/ pharmacology
genetic variations and its role in health/ pharmacologygenetic variations and its role in health/ pharmacology
genetic variations and its role in health/ pharmacology
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
T test
T testT test
T test
 
gSpan algorithm
 gSpan algorithm gSpan algorithm
gSpan algorithm
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
K means
K meansK means
K means
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Online random forests in 10 minutes

  • 1. Online Random Forest in 10 Minutes
  • 2. Traditional Supervised Learning Algorithms ● ● ● ● ● Regression Random Forest Support Vector Machines Classification and Regression Tree (CART) etc
  • 3. Inputs ● Data Matrix (Regression) Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 .56 Red .456 Male .589 .78 Green .654 Female .6654 .987 Blue .678 Female .789 .123 Blue .999 Male .543
  • 4. Inputs ● Data Matrix (Binary Classification) Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Yes Red .456 Male .589 No Green .654 Female .6654 Yes Blue .678 Female .789 No Blue .999 Male .543
  • 5. Inputs To Streaming Classification ● Observations now have an explicit arrival order. Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time Yes Red .456 Male .589 Jan 1st 2011 No Green .654 Female .6654 Feb 4th 2012 Yes Blue .678 Female .789 Feb 5th 2013 No Blue .999 Male .543 July 4th
  • 6. Inputs To Streaming Classification ● New Observations can arrive at any time Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time Yes Red .456 Male .589 Jan 1st 2011 No Green .654 Female .6654 Feb 4th 2012 Yes Blue .678 Female .789 Feb 5th 2013 No Blue .999 Male .543 July 4th 2013 Yes Red .456 Male .456 NOW
  • 7. Problems ● Do the important predictors change over time and when does this change occur? ● How far back is data relevant to today’s problem? ● What happens when our predictors change again in the future? ● What if this is all happening rapidly… will it scale?
  • 8. Enter Online Random Forest ● Input is a single new observation ● Trees learn incrementally on this new data ● Trees are dropped from the forest based on performance and replaced a new “ungrown” tree
  • 9. Visualization of a single tree Accuracy on test cases: 75% 5, 6 0, 70 Pure data stop splitting
  • 10. Visualization of a single tree Accuracy on test cases: 55% 0, 70 2, 25 20,3 50 new observations have come and we create another split off the parent node’s left branch
  • 11. Tree gets pruned Accuracy on test cases: 55% … compare to Random variable and incorporate the age of the tree. Accuracy is TOO BAD. Prune the tree 0, 70 2, 25 20,3
  • 12. New Tree It’s a stump that hasn’t yet split any data. If asked for a classification request it will vote the prior probability calculated from the last 100 observations that the old pruned tree saw
  • 13. Online Random Forest ● By dropping trees that predict poorly we can adapt to change in important predictors ● If previous data is relevant to today’s problem, tree’s learned from it in the past. If it no longer becomes relevant it will be reflected in the accuracy and the tree will get prune
  • 14. Online Random Forest ● This process of incremental learning and dropping is constantly occurring so we can constantly adapt to a changing signal ● We built our Online Random Forest with scala’s actor framework ● We distribute our tree’s computations (and physical location) therefore we can handle high input data streams
  • 16.
  • 17.
  • 18.
  • 19.