SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
I Don’t Want to Be a Dummy!
Encoding Predictors for Trees
Max Kuhn
NYRC
Trees
Tree–based models are nested sets of if/else statements that make predictions in the
terminal nodes:
> library(rpart)
> library(AppliedPredictiveModeling)
> data(schedulingData)
> rpart(Class ~ ., data = schedulingData, control = rpart.control(maxdepth = 2))
n= 4331
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 4331 2100 VF (0.511 0.311 0.119 0.060)
2) Protocol=C,D,E,F,G,I,J,K,L,N 2884 860 VF (0.703 0.206 0.068 0.023) *
3) Protocol=A,H,M,O 1447 690 F (0.126 0.521 0.219 0.133)
6) Iterations< 1.5e+02 1363 610 F (0.134 0.553 0.232 0.081) *
7) Iterations>=1.5e+02 84 1 L (0.000 0.000 0.012 0.988) *
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 2 / 16
Rules
Similarly, rule–based models are non–nested sets of if statements:
> library(C50)
> summary(C5.0(Class ~ ., data = schedulingData, rules = TRUE))
<snip>
Rule 109: (17/7, lift 9.7)
Protocol in {F, J, N}
Compounds > 818
InputFields > 152
NumPending <= 0
Hour > 0.6333333
Day = Tue
-> class L [0.579]
Default class: VF
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 3 / 16
Bayes!
Bayesian regression and classification models don’t really specify anything about the predictors
beyond Pr[X] and Pr[X|Y ].
If there were only one categorical predictor, we could have Pr[X|Y ] be a table of raw
probabilities:
> xtab <- table(schedulingData$Day, schedulingData$Class)
> apply(xtab, 2, function(x) x/sum(x))
VF F M L
Mon 0.1678 0.1492 0.15 0.162
Tue 0.1913 0.2019 0.27 0.255
Wed 0.2090 0.2101 0.19 0.228
Thu 0.1678 0.1589 0.18 0.154
Fri 0.2171 0.2183 0.20 0.178
Sat 0.0068 0.0082 0.00 0.023
Sun 0.0403 0.0535 0.00 0.000
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 4 / 16
Dummy Variables
For the other models, we typically encode a predictor with C categories into C − 1 binary
dummy variables:
> design_mat <- model.matrix(Class ~ Day, data = head(schedulingData))
> design_mat[, colnames(design_mat) != "(Intercept)"]
DayTue DayWed DayThu DayFri DaySat DaySun
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
In this case, one predictor generates six columns in the design matrix
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 5 / 16
Encoding Choices
We make the decision on how to encode the data prior to creating the model.
That means we choose whether to present the model with the grouped categories or
ungrouped binary dummy variables.
The means we could get different representations of the model (see the next two slides).
Does it matter? Let’s do some experiments!
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 6 / 16
A Tree with Categorical Data
wday
1
Sun, Sat Mon, Tues, Wed, Thurs, Fri
Node 2 (n = 1530)
0
5000
10000
15000
20000
25000
q
qq
q
q
Node 3 (n = 3826)
0
5000
10000
15000
20000
25000
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 7 / 16
A Tree with Dummy Variables
Sun
1
≥ 0.5 < 0.5
Node 2 (n = 765)
0
5000
10000
15000
20000
25000
q
Sat
3
≥ 0.5 < 0.5
Node 4 (n = 765)
0
5000
10000
15000
20000
25000
q
q
q
q
Node 5 (n = 3826)
0
5000
10000
15000
20000
25000
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
qq
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 8 / 16
Data Sets
Classification:
German Credit, 13 categorical predictors out of 20 (ROC AUC ≈ 0.76)
UCI Car Evaluation, 6 of 6 (Acc ≈ 0.96)
APM High Performance Computing, 2 of 7 (κ ≈ 0.7)
Regression:
Sacramento house prices, 3 of 8 but one has 37 unique values and another has 68
(RMSE ≈ 0.13, R2 ≈ 0.6)
For each data set, we did 10 separate simulations were 20% of the data were used for testing.
Repeated cross-validation is used to the tune the models when they have tuning parameters.
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 9 / 16
Simulaitons
Models fit twice on each dataset (with and without dummy variables:
single trees (CART, C5.0)
single rulesets (C5.0, Cubist)
bagged trees
random forests
boosted models (SGB trees, C5.0, Cubist)
A number of performance metrics were computed for each (e.g. RMSE, binomial or
multinomial log–loss, etc.) and the test set results are used to compare models.
Confidence intervals were computed using a linear mixed model as to account for the
resample–to–resample correlation structure.
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 10 / 16
Regression Model Results
RF
CART
Cubist_boost
GBM
Cubist
Bagging
−0.010 −0.005 0.000 0.005 0.010
RMSE Difference
(DV Better) <−−−−−> (Factors Better)
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 11 / 16
Classification Model Results
German Credit UCI Cars HPC
CART
C50rule_boost
C50rule
C50tree_boost
C50tree
RF
Bagging
1 2 4 1 2 4 1 2 4
Loss Ratio
Ratio > 1 => Factors Did Better
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 12 / 16
It Depends!
For classification:
The larger differences in the UCI car data might indicate that, if the percentage of
categorical predictors is large, it might matter a lot.
However, the magnitude of improvement of factors over dummy variables depends on the
model.
For 2 or 3 data sets, there was no real difference.
For regression:
It doesn’t seem to matter (except when it does)
Two very similar models (bagging and random forests) showed effects in different
directions.
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 13 / 16
It Depends!
All of this is also dependent on how easy the problem is.
If no models are able to adequately model the data, the choice of factor vs. dummy won’t
matter.
Also, if the categorical predictors are really important, the difference would most likely be
discernible.
For the Sacramento data, ZIP code is very important. For the HPC data, the protocol variable
is also very informative.
However, one thing is definitive:
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 14 / 16
Factors Usually Take Less Time to Train
German Credit UCI Cars
HPC Sacramento
C50rule
C50tree
CART
C50tree_boost
C50rule_boost
RF
Bagging
Cubist_boost
Cubist
GBM
C50rule
C50tree
CART
C50tree_boost
C50rule_boost
RF
Bagging
Cubist_boost
Cubist
GBM
1 2 4 1 2 4
Speedup for Using Factors
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 15 / 16
R and Dummy Variables
In almost all cases, using a formula with a model function will convert factors to dummy
variables.
However, some do not (e.g. rpart, randomForest, gbm, C5.0, NaiveBayes, etc.). This
makes sense for these models.
If you are tuning your model with train, the formula method will create dummy variables and
the non–formula method does not:
> ## dummy variables presented to underlying model:
> train(Class ~ ., data = schedulingData, ...)
>
> ## any factors are preserved
> train(x = schedulingData[, -ncol(schedulingData)],
+ y = schedulingData$Class,
+ ...)
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 16 / 16

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

R and Visualization: A match made in Heaven
R and Visualization: A match made in HeavenR and Visualization: A match made in Heaven
R and Visualization: A match made in Heaven
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with R
 
Computer Science Assignment Help
Computer Science Assignment Help Computer Science Assignment Help
Computer Science Assignment Help
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...
 
Cs229 notes10
Cs229 notes10Cs229 notes10
Cs229 notes10
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Linear models
Linear modelsLinear models
Linear models
 
Business Logistics Assignment Help
Business Logistics Assignment HelpBusiness Logistics Assignment Help
Business Logistics Assignment Help
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 

Destaque

Destaque (19)

Inside the R Consortium
Inside the R ConsortiumInside the R Consortium
Inside the R Consortium
 
High-Performance Python
High-Performance PythonHigh-Performance Python
High-Performance Python
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
R for Everything
R for EverythingR for Everything
R for Everything
 
Using R at NYT Graphics
Using R at NYT GraphicsUsing R at NYT Graphics
Using R at NYT Graphics
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Building Scalable Prediction Services in R
Building Scalable Prediction Services in RBuilding Scalable Prediction Services in R
Building Scalable Prediction Services in R
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament editionIterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Thinking Small About Big Data
Thinking Small About Big DataThinking Small About Big Data
Thinking Small About Big Data
 
Reflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYCReflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYC
 
The Political Impact of Social Penumbras
The Political Impact of Social PenumbrasThe Political Impact of Social Penumbras
The Political Impact of Social Penumbras
 
Analyzing NYC Transit Data
Analyzing NYC Transit DataAnalyzing NYC Transit Data
Analyzing NYC Transit Data
 
The Feels
The FeelsThe Feels
The Feels
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Scaling Data Science at Airbnb
Scaling Data Science at AirbnbScaling Data Science at Airbnb
Scaling Data Science at Airbnb
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis Responsibly
 

Semelhante a I Don't Want to Be a Dummy! Encoding Predictors for Trees

Clustering
ClusteringClustering
Clustering
Anjan Goswami
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 
Monte Carlo Simulations
Monte Carlo SimulationsMonte Carlo Simulations
Monte Carlo Simulations
gfbreaux
 
HW2-1_05.doc
HW2-1_05.docHW2-1_05.doc
HW2-1_05.doc
butest
 

Semelhante a I Don't Want to Be a Dummy! Encoding Predictors for Trees (20)

Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
Clustering
ClusteringClustering
Clustering
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
6조
6조6조
6조
 
Monte Carlo Simulations
Monte Carlo SimulationsMonte Carlo Simulations
Monte Carlo Simulations
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
HW2-1_05.doc
HW2-1_05.docHW2-1_05.doc
HW2-1_05.doc
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 
modeling.ppt
modeling.pptmodeling.ppt
modeling.ppt
 

Mais de Work-Bench

Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
Work-Bench
 

Mais de Work-Bench (8)

2017 Enterprise Almanac
2017 Enterprise Almanac2017 Enterprise Almanac
2017 Enterprise Almanac
 
AI to Enable Next Generation of People Managers
AI to Enable Next Generation of People ManagersAI to Enable Next Generation of People Managers
AI to Enable Next Generation of People Managers
 
Startup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview ProcessStartup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview Process
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
 
Building a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDBBuilding a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDB
 
How to Market Your Startup to the Enterprise
How to Market Your Startup to the EnterpriseHow to Market Your Startup to the Enterprise
How to Market Your Startup to the Enterprise
 
Marketing & Design for the Enterprise
Marketing & Design for the EnterpriseMarketing & Design for the Enterprise
Marketing & Design for the Enterprise
 
Playing the Marketing Long Game
Playing the Marketing Long GamePlaying the Marketing Long Game
Playing the Marketing Long Game
 

Último

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

I Don't Want to Be a Dummy! Encoding Predictors for Trees

  • 1. I Don’t Want to Be a Dummy! Encoding Predictors for Trees Max Kuhn NYRC
  • 2. Trees Tree–based models are nested sets of if/else statements that make predictions in the terminal nodes: > library(rpart) > library(AppliedPredictiveModeling) > data(schedulingData) > rpart(Class ~ ., data = schedulingData, control = rpart.control(maxdepth = 2)) n= 4331 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 4331 2100 VF (0.511 0.311 0.119 0.060) 2) Protocol=C,D,E,F,G,I,J,K,L,N 2884 860 VF (0.703 0.206 0.068 0.023) * 3) Protocol=A,H,M,O 1447 690 F (0.126 0.521 0.219 0.133) 6) Iterations< 1.5e+02 1363 610 F (0.134 0.553 0.232 0.081) * 7) Iterations>=1.5e+02 84 1 L (0.000 0.000 0.012 0.988) * Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 2 / 16
  • 3. Rules Similarly, rule–based models are non–nested sets of if statements: > library(C50) > summary(C5.0(Class ~ ., data = schedulingData, rules = TRUE)) <snip> Rule 109: (17/7, lift 9.7) Protocol in {F, J, N} Compounds > 818 InputFields > 152 NumPending <= 0 Hour > 0.6333333 Day = Tue -> class L [0.579] Default class: VF Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 3 / 16
  • 4. Bayes! Bayesian regression and classification models don’t really specify anything about the predictors beyond Pr[X] and Pr[X|Y ]. If there were only one categorical predictor, we could have Pr[X|Y ] be a table of raw probabilities: > xtab <- table(schedulingData$Day, schedulingData$Class) > apply(xtab, 2, function(x) x/sum(x)) VF F M L Mon 0.1678 0.1492 0.15 0.162 Tue 0.1913 0.2019 0.27 0.255 Wed 0.2090 0.2101 0.19 0.228 Thu 0.1678 0.1589 0.18 0.154 Fri 0.2171 0.2183 0.20 0.178 Sat 0.0068 0.0082 0.00 0.023 Sun 0.0403 0.0535 0.00 0.000 Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 4 / 16
  • 5. Dummy Variables For the other models, we typically encode a predictor with C categories into C − 1 binary dummy variables: > design_mat <- model.matrix(Class ~ Day, data = head(schedulingData)) > design_mat[, colnames(design_mat) != "(Intercept)"] DayTue DayWed DayThu DayFri DaySat DaySun 1 1 0 0 0 0 0 2 1 0 0 0 0 0 3 0 0 1 0 0 0 4 0 0 0 1 0 0 5 0 0 0 1 0 0 6 0 1 0 0 0 0 In this case, one predictor generates six columns in the design matrix Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 5 / 16
  • 6. Encoding Choices We make the decision on how to encode the data prior to creating the model. That means we choose whether to present the model with the grouped categories or ungrouped binary dummy variables. The means we could get different representations of the model (see the next two slides). Does it matter? Let’s do some experiments! Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 6 / 16
  • 7. A Tree with Categorical Data wday 1 Sun, Sat Mon, Tues, Wed, Thurs, Fri Node 2 (n = 1530) 0 5000 10000 15000 20000 25000 q qq q q Node 3 (n = 3826) 0 5000 10000 15000 20000 25000 q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qqq q q qq q q q q qq q q q q qq q q q q q q q q q q q q q q q q q Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 7 / 16
  • 8. A Tree with Dummy Variables Sun 1 ≥ 0.5 < 0.5 Node 2 (n = 765) 0 5000 10000 15000 20000 25000 q Sat 3 ≥ 0.5 < 0.5 Node 4 (n = 765) 0 5000 10000 15000 20000 25000 q q q q Node 5 (n = 3826) 0 5000 10000 15000 20000 25000 q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qqq q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qqq q q qq qq q q qq q q q q qq q q q q q q q q q q q q q q q q q Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 8 / 16
  • 9. Data Sets Classification: German Credit, 13 categorical predictors out of 20 (ROC AUC ≈ 0.76) UCI Car Evaluation, 6 of 6 (Acc ≈ 0.96) APM High Performance Computing, 2 of 7 (κ ≈ 0.7) Regression: Sacramento house prices, 3 of 8 but one has 37 unique values and another has 68 (RMSE ≈ 0.13, R2 ≈ 0.6) For each data set, we did 10 separate simulations were 20% of the data were used for testing. Repeated cross-validation is used to the tune the models when they have tuning parameters. Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 9 / 16
  • 10. Simulaitons Models fit twice on each dataset (with and without dummy variables: single trees (CART, C5.0) single rulesets (C5.0, Cubist) bagged trees random forests boosted models (SGB trees, C5.0, Cubist) A number of performance metrics were computed for each (e.g. RMSE, binomial or multinomial log–loss, etc.) and the test set results are used to compare models. Confidence intervals were computed using a linear mixed model as to account for the resample–to–resample correlation structure. Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 10 / 16
  • 11. Regression Model Results RF CART Cubist_boost GBM Cubist Bagging −0.010 −0.005 0.000 0.005 0.010 RMSE Difference (DV Better) <−−−−−> (Factors Better) Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 11 / 16
  • 12. Classification Model Results German Credit UCI Cars HPC CART C50rule_boost C50rule C50tree_boost C50tree RF Bagging 1 2 4 1 2 4 1 2 4 Loss Ratio Ratio > 1 => Factors Did Better Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 12 / 16
  • 13. It Depends! For classification: The larger differences in the UCI car data might indicate that, if the percentage of categorical predictors is large, it might matter a lot. However, the magnitude of improvement of factors over dummy variables depends on the model. For 2 or 3 data sets, there was no real difference. For regression: It doesn’t seem to matter (except when it does) Two very similar models (bagging and random forests) showed effects in different directions. Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 13 / 16
  • 14. It Depends! All of this is also dependent on how easy the problem is. If no models are able to adequately model the data, the choice of factor vs. dummy won’t matter. Also, if the categorical predictors are really important, the difference would most likely be discernible. For the Sacramento data, ZIP code is very important. For the HPC data, the protocol variable is also very informative. However, one thing is definitive: Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 14 / 16
  • 15. Factors Usually Take Less Time to Train German Credit UCI Cars HPC Sacramento C50rule C50tree CART C50tree_boost C50rule_boost RF Bagging Cubist_boost Cubist GBM C50rule C50tree CART C50tree_boost C50rule_boost RF Bagging Cubist_boost Cubist GBM 1 2 4 1 2 4 Speedup for Using Factors Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 15 / 16
  • 16. R and Dummy Variables In almost all cases, using a formula with a model function will convert factors to dummy variables. However, some do not (e.g. rpart, randomForest, gbm, C5.0, NaiveBayes, etc.). This makes sense for these models. If you are tuning your model with train, the formula method will create dummy variables and the non–formula method does not: > ## dummy variables presented to underlying model: > train(Class ~ ., data = schedulingData, ...) > > ## any factors are preserved > train(x = schedulingData[, -ncol(schedulingData)], + y = schedulingData$Class, + ...) Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 16 / 16