SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
Forecasting Bike-Share Usage via Conditional
Inference Trees and Random Forests
Andrew Trice
April 29, 2015
1 Introduction
A bike-share is a system in which bike are rented on a on-demand basis. They
can be returned when and where they are no longer needed. They are paid for
through online kiosks that record arrival location, trip length, trip duration,
and departure location. This data is important because it can give researchers
an indication of the transportation patterns of a city. These transportation
patterns can be extrapolated to the city’s vitality, naunces, and improvements
that are needed
2 Materials
This is a purely research based competition hosted on the competitive Data
Science platform, Kaggle. The dataset was provided by Hadi Fanaee Tork using
data from Capital One Bikeshare and is hosted by the UCI machine learning
library.
Capital One Bikeshare in Washington DC presents bikeshare rental data
from January 1st 2011 to December 31st 2012. The season, whether it was a
holiday, whether it was a workingday, severity of the weather, temperature, ad-
justed temperature, humidity, windspeed, casual rentals, and registered rentals
are all included as features. The goal of the competition is to predict bicycle
rental count for each individual hour.
We are using the R open-source statistical programming language and Rstu-
dio software to conduct analysis. The data is already partitioned into training
and testing sets in csv or comma separated value format. We import the “party”
and “randomForests” packages from R.
3 Methods
3.1 Feature Engineering
All data given was numerical and different regression methods could suffice to
forecast bike rentals. But we want to apply novel methods of forecasting demand
by first using a decision tree to establish as a benchmark and then by creating
a random forests ensemble.
1
First though we must feature engineer our data into more useful variables.
The datetime feature contains the date in y-m-d form followed by the time.
We substring the time from datetime and then an integer variable hour for the
particular hour, we convert hour back to a factor. We use the weekdays and
as.date functions on the datetime variable to create a day of the week factor
variable. We factor the weather, season, and working day variables. We notice
that there are minimal instances of weather = 4 so we reassign those variables
to weather = 3 to increase model robustness.
With these newly engineered variables we can view average rental times by
hour and day of the week and understand their effect on bike rentals.
3.2 Initial Plot Analysis
We notice that Saturday and Sunday have significantly different times than the
weekdays. The weekdays have the most pronounced rental periods at 8AM
and 5PM where the weekend has a gradual period at 10AM through 5PM. We
proceed to build our our initial decision tree.
Figure 1: Bike Rental Averages by Day of Week and Hour
3.3 Conditional Inference Trees
3.3.1 Decision Tree Overview
Decision trees are like flowcharts in which each node represents the class label
and each branch represents the decision made. Extrapolated to our case we may
see a decision tree node of weather. If the weather is 3 or 4 then we will see a
reduction in the number of bikes rented. A tree may have multiple nodes and
branches to make a conclusion.
Decision trees are commonly used in operations research to determine opti-
mal strategy but have applications to many fields. They are especially useful
for determining variable importance in our forecasting methods.
2
3.3.2 Conditional Inference Tree Overview
We choose to implement a conditional inference tree, or ctree, as a benchmark
over the traditional inference tree. The ctree avoids some of the variable selec-
tion bias of the traditional decision tree by using a significance test procedure to
select variables as opposed to selecting the variables that maximize information
gain.
3.3.3 Conditional Inference Tree Implementation
Our ctree is built with the variables: season, holiday, workingday, weather,
temp, atemp, humidity, and hour. This builds an extensive tree with 393 indi-
vidual nodes.
3.4 Random Forests
3.4.1 Random Forests Overview
Random forests construct many decision trees and output the mean prediction
of the decision trees it builds. Each decision tree is constructed optimally from a
random subset of the features in what is called “feature bagging” or “bagging”.
This helps correct for overfitting or when a model simply memorizes the training
data as opposed to detecting the dependent variable relationships.
Figure 2: Random Forests Unseen Predictions
Figure 3: Random Forests Variable Importance
3.4.2 Random Forests Implementation
We use the same variables as before except we are going to create two formulas:
one for the casual rentals and one for the registered rentals. We hypothesize
that the random forests can pick up on the nuances of the two rental types. We
set the seed to be 192 or the mean rental of the training data, number of trees
in the Forests to be 2000 each, the number of variables per level to be 2, which
is the default, and for importance = true meaning high variable importance.
3
4 Results
4.1 Conditional Inference Tree
The submission of the conditional inference tree received a rank of 950 out of
2219 distinct teams at the time of submission. This makes it better than 57%
of teams. The average bike rental value is 189.9 which differs only marginally
from the training data which is 191.6. This is an appropriate benchmark and
sets a reasonable standard for the random forest implementation.
4.2 Random Forests
At the time of this publication the random forests receives a placement of 479 out
of 2778. This makes it better than 82% of teams. We can see that our random
forests receives a significantly improved score over the conditional inference tree
benchmark and we can observe the variables’ importance.
4.3 Variable Importance
Figure 4: Casual Rental Variable Importance
Figure 5: Registered Rental Variable Importance
Notice that in order of importance the variables for casual are: hour, work-
ingday, humidity, season, temp, atemp, weather, and holiday. For registered the
4
variables are: hour, workingday, season, humidity, temp, weather, atemp, and
holiday.
The registered has a far greater importance on hour than the casual riders
which primarily comes at the expense of the workingday feature.
5 References
Cˆome Etienne and Oukhellou Latifa. 2014. Model-Based Count Series Clus-
tering for Bike Sharing System Usage Mining: A Case Study with the V´elib’
System of Paris. ACM Trans. Intell. Syst. Technol. 5, 3, Article 39 (July
2014), 21 pages. DOI=10.1145/2560188 http://doi.acm.org/10.1145/2560188
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detec-
tors and background knowledge, Progress in Artificial Intelligence (2013): pp.
1-15, Springer Berlin Heidelberg. Rixey, R. Alexander. ”Station-Level Fore-
casting of Bikesharing Ridership.”Transportation Research Record: Journal of
the Transportation Research Board2387.1 (2013): 46-55.
6 Appendix
6.1 Acknowledgments
• Brandon Harris for his excellent blog post on feature engineering and
developing a basic Conditional Inference Tree.
• Kaggle user bruschkov for his forum post on a random forests tutorial.
• Beata Strubel for her ggplot rental times imaging.
• And the entire Kaggle community for the competition and knowledge fos-
tering environment.
6.2 Conditional Inference Tree R Code
1 #package party i s used to implement c o n d i t i o n a l i n f e r e n c e t r e e
2 #package ggplot2 i s used f o r v i s u a l i z a t i o n s
3 l i b r a r y ( ’ party ’ )
4 l i b r a r y ( ’ ggplot2 ’ )
5
6 #s e t working d i r e c t o r y
7 setwd ( ”/ users / andrewtrice /Desktop/Data Science Capstone” )
8
9 #read in t r a i n / t e s t
10 t r a i n <− read . csv ( ” t r a i n . csv ” )
11 t e s t <− read . csv ( ” t e s t . csv ” )
12
13 s t r ( t r a i n )
14
15 #f a c t o r i z e t r a i n i n g s e t
16 t r a i n f a c t o r <− t r a i n
17 t r a i n f a c t o r $ weather <− f a c t o r ( t r a i n $ weather )
18 t r a i n f a c t o r $ holiday <− f a c t o r ( t r a i n $ holiday )
5
19 t r a i n f a c t o r $workingday <− f a c t o r ( t r a i n $workingday )
20 t r a i n f a c t o r $ season <− f a c t o r ( t r a i n $ season )
21
22 #f a c t o r i z e t e s t s e t
23 t e s t f a c t o r <− t e s t
24 t e s t f a c t o r $ weather <− f a c t o r ( t e s t $ weather )
25 t e s t f a c t o r $ holiday <− f a c t o r ( t e s t $ holiday )
26 t e s t f a c t o r $workingday <− f a c t o r ( t e s t $workingday )
27 t e s t f a c t o r $ season <− f a c t o r ( t e s t $ season )
28
29 #create time column by s t r i p p i n g out timestamp
30 t r a i n f a c t o r $time <− substring ( t r a i n $ datetime ,12 ,20)
31 t e s t f a c t o r $time <− substring ( t e s t $ datetime ,12 ,20)
32
33 #f a c t o r i z e new timestamp column
34 t r a i n f a c t o r $time <− f a c t o r ( t r a i n f a c t o r $time )
35 t e s t f a c t o r $time <− f a c t o r ( t e s t f a c t o r $time )
36
37 #create day of week column
38 t r a i n f a c t o r $day <− weekdays ( as . Date ( t r a i n f a c t o r $ datetime ) )
39 t r a i n f a c t o r $day <− as . f a c t o r ( t r a i n f a c t o r $day )
40 t e s t f a c t o r $day <− weekdays ( as . Date ( t e s t f a c t o r $ datetime ) )
41 t e s t f a c t o r $day <− as . f a c t o r ( t e s t f a c t o r $day )
42
43 aggregate ( t r a i n f a c t o r [ , ” count ” ] , l i s t ( t r a i n f a c t o r $day ) ,mean)
44
45 #create Sunday v a r i a b l e
46 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day == ”Sunday” ] <− ”1”
47 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day != ”1” ] <− ”0”
48
49 t e s t f a c t o r $sunday [ t e s t f a c t o r $day == ”Sunday” ] <− ”1”
50 t e s t f a c t o r $sunday [ t e s t f a c t o r $day != ”1” ] <− ”0”
51
52 #convert to f a c t o r
53 t r a i n f a c t o r $sunday <− as . f a c t o r ( t r a i n f a c t o r $sunday )
54 t e s t f a c t o r $sunday <− as . f a c t o r ( t e s t f a c t o r $sunday )
55
56 #convert time and create $hour as i n t e g e r to evaluate f o r daypart
57 t r a i n f a c t o r $hour<− as . numeric ( substr ( t r a i n f a c t o r $time , 1 , 2 ) )
58 t e s t f a c t o r $hour<− as . numeric ( substr ( t e s t f a c t o r $time , 1 , 2 ) )
59
60 #create daypart column , d e f a u l t to 4 to make things e a s i e r f o r
o u r s e l v e s
61 t r a i n f a c t o r $ daypart <− ”4”
62 t e s t f a c t o r $ daypart <− ”4”
63
64 #4AM − 10AM = 1
65 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 10) & ( t r a i n f a c t o r $hour
> 3) ] <− 1
66 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 10) & ( t e s t f a c t o r $hour >
3) ] <− 1
67
68 #11AM − 3PM = 2
69 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 16) & ( t r a i n f a c t o r $hour
> 9) ] <− 2
70 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 16) & ( t e s t f a c t o r $hour >
9) ] <− 2
71
72 #4PM − 9PM = 3
73 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 22) & ( t r a i n f a c t o r $hour
> 15) ] <− 3
6
74 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 22) & ( t e s t f a c t o r $hour >
15) ] <− 3
75
76 #convert daypart to f a c t o r
77 t r a i n f a c t o r $ daypart <− as . f a c t o r ( t r a i n f a c t o r $ daypart )
78 t e s t f a c t o r $ daypart <− as . f a c t o r ( t e s t f a c t o r $ daypart )
79
80 #convert hour back to f a c t o r
81 t r a i n f a c t o r $hour <− as . f a c t o r ( t r a i n f a c t o r $hour )
82 t e s t f a c t o r $hour <− as . f a c t o r ( t e s t f a c t o r $hour )
83
84 #create formulas f o r casual and r e g i s t e r e d r i d e r s
85 formula <− count ˜ season + holiday + workingday + weather + temp +
atemp + humidity + hour + daypart + sunday
86
87 #c o n d i t i o n a l i n f e r e n c e t r e e
88 f i t . c t r e e <− c t r e e ( formula , data=t r a i n f a c t o r )
89
90 #examine model f o r v a r i a b l e importance
91 f i t . c t r e e
92
93 #run model against t e s t data s e t
94 p r e d i c t . c t r e e <− p r e d i c t ( f i t . ctree , t e s t f a c t o r )
95
96 #l a b e l importance
97 imp <− importance ( f i t . ctree , type=1)
98 featureImportance <− data . frame ( Feature=row . names (imp) , Importance=
imp [ , 1 ] )
99
100 #plot v a r i a b l e importance
101 p <− ggplot ( featureImportance , aes (x=reorder ( Feature , Importance ) ,
y=Importance ) ) +
102 geom bar ( s t a t=” i d e n t i t y ” , f i l l =”#12 c f f f ” ) +
103 coord f l i p () +
104 theme l i g h t ( base s i z e =20) +
105 xlab ( ” Importance ” ) +
106 ylab ( ”” ) +
107 g g t i t l e ( ”Random Forest Feature Importance n” ) +
108 theme ( plot . t i t l e=element text ( s i z e =18) )
109
110 #build a dataframe with our r e s u l t s
111 submit . c t r e e <− data . frame ( datetime = t e s t $ datetime , count=c t r e e .
count )
112
113 #write r e s u l t s to . csv f o r submission
114 write . table ( submit . ctree , f i l e=”submit c t r e e v2 . csv ” , sep=” , ” , row .
names=FALSE, c o l . names=v )
6.3 Random Forests R Code
1 #import necessary packages
2 l i b r a r y ( ’ randomForest ’ )
3 l i b r a r y ( ’ ggplot2 ’ )
4
5 #s e t working d i r e c t o r y
6 setwd ( ”/ users / andrewtrice /Desktop/Data Science Capstone” )
7
8 #import datasets from working d i r e c t o r y
9 t r a i n <− read . csv ( ” t r a i n . csv ” ) #use nrows=1000 rows f o r speed
during f e a t u r e engineering
10 t e s t <− read . csv ( ” t e s t . csv ” ) #use nrows=1000 rows f o r speed during
f e a t u r e engineering
7
11
12 s t r ( t r a i n )
13
14 #f a c t o r i z e t r a i n i n g s e t
15 t r a i n f a c t o r <− t r a i n
16 t r a i n f a c t o r $ weather <− f a c t o r ( t r a i n $ weather )
17 t r a i n f a c t o r $ holiday <− f a c t o r ( t r a i n $ holiday )
18 t r a i n f a c t o r $workingday <− f a c t o r ( t r a i n $workingday )
19 t r a i n f a c t o r $ season <− f a c t o r ( t r a i n $ season )
20
21 #f a c t o r i z e t e s t s e t
22 t e s t f a c t o r <− t e s t
23 t e s t f a c t o r $ weather <− f a c t o r ( t e s t $ weather )
24 t e s t f a c t o r $ holiday <− f a c t o r ( t e s t $ holiday )
25 t e s t f a c t o r $workingday <− f a c t o r ( t e s t $workingday )
26 t e s t f a c t o r $ season <− f a c t o r ( t e s t $ season )
27
28 #create time column by s t r i p p i n g out timestamp
29 t r a i n f a c t o r $time <− substring ( t r a i n $ datetime ,12 ,20)
30 t e s t f a c t o r $time <− substring ( t e s t $ datetime ,12 ,20)
31
32 #f a c t o r i z e new timestamp column
33 t r a i n f a c t o r $time <− f a c t o r ( t r a i n f a c t o r $time )
34 t e s t f a c t o r $time <− f a c t o r ( t e s t f a c t o r $time )
35
36 #create day of week column
37 t r a i n f a c t o r $day <− weekdays ( as . Date ( t r a i n f a c t o r $ datetime ) )
38 t r a i n f a c t o r $day <− as . f a c t o r ( t r a i n f a c t o r $day )
39 t e s t f a c t o r $day <− weekdays ( as . Date ( t e s t f a c t o r $ datetime ) )
40 t e s t f a c t o r $day <− as . f a c t o r ( t e s t f a c t o r $day )
41
42 aggregate ( t r a i n f a c t o r [ , ” count ” ] , l i s t ( t r a i n f a c t o r $day ) ,mean)
43
44 #create Sunday v a r i a b l e
45 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day == ”Sunday” ] <− ”1”
46 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day != ”1” ] <− ”0”
47
48 t e s t f a c t o r $sunday [ t e s t f a c t o r $day == ”Sunday” ] <− ”1”
49 t e s t f a c t o r $sunday [ t e s t f a c t o r $day != ”1” ] <− ”0”
50
51 #convert to f a c t o r
52 t r a i n f a c t o r $sunday <− as . f a c t o r ( t r a i n f a c t o r $sunday )
53 t e s t f a c t o r $sunday <− as . f a c t o r ( t e s t f a c t o r $sunday )
54
55 #convert time and create $hour as i n t e g e r to evaluate f o r daypart
56 t r a i n f a c t o r $hour<− as . numeric ( substr ( t r a i n f a c t o r $time , 1 , 2 ) )
57 t e s t f a c t o r $hour<− as . numeric ( substr ( t e s t f a c t o r $time , 1 , 2 ) )
58
59 #create daypart column , d e f a u l t to 4 to make things e a s i e r f o r
o u r s e l v e s
60 t r a i n f a c t o r $ daypart <− ”4”
61 t e s t f a c t o r $ daypart <− ”4”
62
63 #4AM − 10AM = 1
64 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 10) & ( t r a i n f a c t o r $hour
> 3) ] <− 1
65 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 10) & ( t e s t f a c t o r $hour >
3) ] <− 1
66
67 #11AM − 3PM = 2
68 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 16) & ( t r a i n f a c t o r $hour
> 9) ] <− 2
8
69 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 16) & ( t e s t f a c t o r $hour >
9) ] <− 2
70
71 #4PM − 9PM = 3
72 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 22) & ( t r a i n f a c t o r $hour
> 15) ] <− 3
73 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 22) & ( t e s t f a c t o r $hour >
15) ] <− 3
74
75 #convert daypart to f a c t o r
76 t r a i n f a c t o r $ daypart <− as . f a c t o r ( t r a i n f a c t o r $ daypart )
77 t e s t f a c t o r $ daypart <− as . f a c t o r ( t e s t f a c t o r $ daypart )
78
79 #convert hour back to f a c t o r
80 t r a i n f a c t o r $hour <− as . f a c t o r ( t r a i n f a c t o r $hour )
81 t e s t f a c t o r $hour <− as . f a c t o r ( t e s t f a c t o r $hour )
82
83 #get r i d of weather 4
84 t r a i n $ weather [ t r a i n $ weather==4] <− 3
85 t e s t $ weather [ t e s t $ weather==4] <− 3
86
87 #v a r i a b l e s
88 myNtree = 500
89 myMtry = 7
90 myImportance = TRUE
91
92 #s e t the seed to mean f o r model r e p r o d u c i b i l i t y
93 s e t . seed (192)
94
95 #f i t and p r e d i c t casual
96 casualFit <− randomForest ( casual ˜ season + holiday + workingday +
weather + temp + atemp + humidity + hour , data=t r a i n factor ,
ntree=myNtree , mtry=myMtry , importance=myImportance )
97 predictCasual <− p r e d i c t ( casualFit , t e s t f a c t o r )
98
99 #f i t and p r e d i c t r e g i s t e r e d
100 r e g i s t e r e d F i t <− randomForest ( r e g i s t e r e d ˜ season + holiday +
workingday + weather + temp + atemp + humidity + hour , data=
t r a i n factor , ntree=myNtree , mtry=myMtry , importance=
myImportance )
101 p r e d i c t R e g i s t e r e d <− p r e d i c t ( r e g i s t e r e d F i t , t e s t f a c t o r )
102
103 #add both columns into f i n a l count , round to whole number
104 t e s t $ count <− casualFit + t e s t $ r e g i s t e r e d
105
106 #t e s t p l o t
107 plot ( t r a i n $ count )
108 plot ( t e s t $ count )
109
110 #write output to csv f o r submission
111 submit <− data . frame ( datetime = t e s t $ datetime , count = t e s t $ count )
112 write . csv ( submit , f i l e = ”randomForest Prediction . csv ” , row . names=
FALSE)
113
114 #l a b e l v a r i a b l e importances f o r both casual and r e g i s t e r e d
115 imp1 <− importance ( casualFit , type=1)
116 imp2 <− importance ( r e g i s t e r e d F i t , type=1)
117 featureImportance1 <− data . frame ( Feature=row . names ( imp1 ) ,
Importance=imp1 [ , 1 ] )
118 featureImportance2 <− data . frame ( Feature=row . names ( imp2 ) ,
Importance=imp2 [ , 1 ] )
119
9
120 #plot v a r i a b l e importances
121 casualImportance <− ggplot ( featureImportance1 , aes (x=reorder (
Feature , Importance ) , y=Importance ) ) +
122 geom bar ( s t a t=” i d e n t i t y ” , f i l l =” blue ” ) +
123 coord f l i p () +
124 theme l i g h t ( base s i z e =20) +
125 xlab ( ”” ) +
126 ylab ( ”” ) +
127 g g t i t l e ( ” Casual Rental Feature Importance n” ) +
128 theme ( plot . t i t l e=element text ( s i z e =18) )
129
130 registeredImportance <− ggplot ( featureImportance2 , aes (x=reorder (
Feature , Importance ) , y=Importance ) ) +
131 geom bar ( s t a t=” i d e n t i t y ” , f i l l =” blue ” ) +
132 coord f l i p () +
133 theme l i g h t ( base s i z e =20) +
134 xlab ( ”” ) +
135 ylab ( ”” ) +
136 g g t i t l e ( ” Registered Rental Feature Importance n” ) +
137 theme ( plot . t i t l e=element text ( s i z e =18) )
10

Mais conteúdo relacionado

Destaque

O que comemos (plasticina)
O que comemos (plasticina)O que comemos (plasticina)
O que comemos (plasticina)
donatorosa
 

Destaque (10)

multi-view vehicle detection and tracking in
multi-view vehicle detection and tracking inmulti-view vehicle detection and tracking in
multi-view vehicle detection and tracking in
 
Becoming a Digital Hero - Using Data and Stories to Create Heroic Digital Exp...
Becoming a Digital Hero - Using Data and Stories to Create Heroic Digital Exp...Becoming a Digital Hero - Using Data and Stories to Create Heroic Digital Exp...
Becoming a Digital Hero - Using Data and Stories to Create Heroic Digital Exp...
 
My CV
My CVMy CV
My CV
 
Alexis torres
Alexis torresAlexis torres
Alexis torres
 
Internet telcel 5 GB
Internet telcel 5 GBInternet telcel 5 GB
Internet telcel 5 GB
 
O que comemos (plasticina)
O que comemos (plasticina)O que comemos (plasticina)
O que comemos (plasticina)
 
Fármacos antimibianos
Fármacos antimibianosFármacos antimibianos
Fármacos antimibianos
 
ArtemisProjects
ArtemisProjectsArtemisProjects
ArtemisProjects
 
Legislación informática
Legislación informáticaLegislación informática
Legislación informática
 
Oil and Gas Flow Solutions by Badger Meter
Oil and Gas Flow Solutions by Badger MeterOil and Gas Flow Solutions by Badger Meter
Oil and Gas Flow Solutions by Badger Meter
 

Semelhante a Trice-SeniorCapstone

6101-Project Report
6101-Project Report6101-Project Report
6101-Project Report
Love Tyagi
 
Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...
butest
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Eman magdy
 

Semelhante a Trice-SeniorCapstone (20)

6101-Project Report
6101-Project Report6101-Project Report
6101-Project Report
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
SFBay Area Bike Share Analysis
SFBay Area Bike Share AnalysisSFBay Area Bike Share Analysis
SFBay Area Bike Share Analysis
 
Weather forcasting.pptx
Weather forcasting.pptxWeather forcasting.pptx
Weather forcasting.pptx
 
Final presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab KothariFinal presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab Kothari
 
Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
 
How to do accurate RE forecasting & scheduling
How to do accurate RE forecasting & scheduling How to do accurate RE forecasting & scheduling
How to do accurate RE forecasting & scheduling
 
Dynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using TimeDynamically Evolving Systems: Cluster Analysis Using Time
Dynamically Evolving Systems: Cluster Analysis Using Time
 
Time series analysis in Stata
Time series analysis in StataTime series analysis in Stata
Time series analysis in Stata
 
SevillaR meetup: dplyr and magrittr
SevillaR meetup: dplyr and magrittrSevillaR meetup: dplyr and magrittr
SevillaR meetup: dplyr and magrittr
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
From the proposal to ECMAScript – Step by Step
From the proposal to ECMAScript – Step by StepFrom the proposal to ECMAScript – Step by Step
From the proposal to ECMAScript – Step by Step
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1V2 i9 ijertv2is90699-1
V2 i9 ijertv2is90699-1
 
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
Anais Dotis-Georgiou [InfluxData] | Learn Flux by Example | InfluxDays NA 2021
 
construction resource optimization 1.pptx
construction resource optimization 1.pptxconstruction resource optimization 1.pptx
construction resource optimization 1.pptx
 
Vote Solar 2012 Annual Report
Vote Solar 2012 Annual ReportVote Solar 2012 Annual Report
Vote Solar 2012 Annual Report
 

Trice-SeniorCapstone

  • 1. Forecasting Bike-Share Usage via Conditional Inference Trees and Random Forests Andrew Trice April 29, 2015 1 Introduction A bike-share is a system in which bike are rented on a on-demand basis. They can be returned when and where they are no longer needed. They are paid for through online kiosks that record arrival location, trip length, trip duration, and departure location. This data is important because it can give researchers an indication of the transportation patterns of a city. These transportation patterns can be extrapolated to the city’s vitality, naunces, and improvements that are needed 2 Materials This is a purely research based competition hosted on the competitive Data Science platform, Kaggle. The dataset was provided by Hadi Fanaee Tork using data from Capital One Bikeshare and is hosted by the UCI machine learning library. Capital One Bikeshare in Washington DC presents bikeshare rental data from January 1st 2011 to December 31st 2012. The season, whether it was a holiday, whether it was a workingday, severity of the weather, temperature, ad- justed temperature, humidity, windspeed, casual rentals, and registered rentals are all included as features. The goal of the competition is to predict bicycle rental count for each individual hour. We are using the R open-source statistical programming language and Rstu- dio software to conduct analysis. The data is already partitioned into training and testing sets in csv or comma separated value format. We import the “party” and “randomForests” packages from R. 3 Methods 3.1 Feature Engineering All data given was numerical and different regression methods could suffice to forecast bike rentals. But we want to apply novel methods of forecasting demand by first using a decision tree to establish as a benchmark and then by creating a random forests ensemble. 1
  • 2. First though we must feature engineer our data into more useful variables. The datetime feature contains the date in y-m-d form followed by the time. We substring the time from datetime and then an integer variable hour for the particular hour, we convert hour back to a factor. We use the weekdays and as.date functions on the datetime variable to create a day of the week factor variable. We factor the weather, season, and working day variables. We notice that there are minimal instances of weather = 4 so we reassign those variables to weather = 3 to increase model robustness. With these newly engineered variables we can view average rental times by hour and day of the week and understand their effect on bike rentals. 3.2 Initial Plot Analysis We notice that Saturday and Sunday have significantly different times than the weekdays. The weekdays have the most pronounced rental periods at 8AM and 5PM where the weekend has a gradual period at 10AM through 5PM. We proceed to build our our initial decision tree. Figure 1: Bike Rental Averages by Day of Week and Hour 3.3 Conditional Inference Trees 3.3.1 Decision Tree Overview Decision trees are like flowcharts in which each node represents the class label and each branch represents the decision made. Extrapolated to our case we may see a decision tree node of weather. If the weather is 3 or 4 then we will see a reduction in the number of bikes rented. A tree may have multiple nodes and branches to make a conclusion. Decision trees are commonly used in operations research to determine opti- mal strategy but have applications to many fields. They are especially useful for determining variable importance in our forecasting methods. 2
  • 3. 3.3.2 Conditional Inference Tree Overview We choose to implement a conditional inference tree, or ctree, as a benchmark over the traditional inference tree. The ctree avoids some of the variable selec- tion bias of the traditional decision tree by using a significance test procedure to select variables as opposed to selecting the variables that maximize information gain. 3.3.3 Conditional Inference Tree Implementation Our ctree is built with the variables: season, holiday, workingday, weather, temp, atemp, humidity, and hour. This builds an extensive tree with 393 indi- vidual nodes. 3.4 Random Forests 3.4.1 Random Forests Overview Random forests construct many decision trees and output the mean prediction of the decision trees it builds. Each decision tree is constructed optimally from a random subset of the features in what is called “feature bagging” or “bagging”. This helps correct for overfitting or when a model simply memorizes the training data as opposed to detecting the dependent variable relationships. Figure 2: Random Forests Unseen Predictions Figure 3: Random Forests Variable Importance 3.4.2 Random Forests Implementation We use the same variables as before except we are going to create two formulas: one for the casual rentals and one for the registered rentals. We hypothesize that the random forests can pick up on the nuances of the two rental types. We set the seed to be 192 or the mean rental of the training data, number of trees in the Forests to be 2000 each, the number of variables per level to be 2, which is the default, and for importance = true meaning high variable importance. 3
  • 4. 4 Results 4.1 Conditional Inference Tree The submission of the conditional inference tree received a rank of 950 out of 2219 distinct teams at the time of submission. This makes it better than 57% of teams. The average bike rental value is 189.9 which differs only marginally from the training data which is 191.6. This is an appropriate benchmark and sets a reasonable standard for the random forest implementation. 4.2 Random Forests At the time of this publication the random forests receives a placement of 479 out of 2778. This makes it better than 82% of teams. We can see that our random forests receives a significantly improved score over the conditional inference tree benchmark and we can observe the variables’ importance. 4.3 Variable Importance Figure 4: Casual Rental Variable Importance Figure 5: Registered Rental Variable Importance Notice that in order of importance the variables for casual are: hour, work- ingday, humidity, season, temp, atemp, weather, and holiday. For registered the 4
  • 5. variables are: hour, workingday, season, humidity, temp, weather, atemp, and holiday. The registered has a far greater importance on hour than the casual riders which primarily comes at the expense of the workingday feature. 5 References Cˆome Etienne and Oukhellou Latifa. 2014. Model-Based Count Series Clus- tering for Bike Sharing System Usage Mining: A Case Study with the V´elib’ System of Paris. ACM Trans. Intell. Syst. Technol. 5, 3, Article 39 (July 2014), 21 pages. DOI=10.1145/2560188 http://doi.acm.org/10.1145/2560188 Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detec- tors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg. Rixey, R. Alexander. ”Station-Level Fore- casting of Bikesharing Ridership.”Transportation Research Record: Journal of the Transportation Research Board2387.1 (2013): 46-55. 6 Appendix 6.1 Acknowledgments • Brandon Harris for his excellent blog post on feature engineering and developing a basic Conditional Inference Tree. • Kaggle user bruschkov for his forum post on a random forests tutorial. • Beata Strubel for her ggplot rental times imaging. • And the entire Kaggle community for the competition and knowledge fos- tering environment. 6.2 Conditional Inference Tree R Code 1 #package party i s used to implement c o n d i t i o n a l i n f e r e n c e t r e e 2 #package ggplot2 i s used f o r v i s u a l i z a t i o n s 3 l i b r a r y ( ’ party ’ ) 4 l i b r a r y ( ’ ggplot2 ’ ) 5 6 #s e t working d i r e c t o r y 7 setwd ( ”/ users / andrewtrice /Desktop/Data Science Capstone” ) 8 9 #read in t r a i n / t e s t 10 t r a i n <− read . csv ( ” t r a i n . csv ” ) 11 t e s t <− read . csv ( ” t e s t . csv ” ) 12 13 s t r ( t r a i n ) 14 15 #f a c t o r i z e t r a i n i n g s e t 16 t r a i n f a c t o r <− t r a i n 17 t r a i n f a c t o r $ weather <− f a c t o r ( t r a i n $ weather ) 18 t r a i n f a c t o r $ holiday <− f a c t o r ( t r a i n $ holiday ) 5
  • 6. 19 t r a i n f a c t o r $workingday <− f a c t o r ( t r a i n $workingday ) 20 t r a i n f a c t o r $ season <− f a c t o r ( t r a i n $ season ) 21 22 #f a c t o r i z e t e s t s e t 23 t e s t f a c t o r <− t e s t 24 t e s t f a c t o r $ weather <− f a c t o r ( t e s t $ weather ) 25 t e s t f a c t o r $ holiday <− f a c t o r ( t e s t $ holiday ) 26 t e s t f a c t o r $workingday <− f a c t o r ( t e s t $workingday ) 27 t e s t f a c t o r $ season <− f a c t o r ( t e s t $ season ) 28 29 #create time column by s t r i p p i n g out timestamp 30 t r a i n f a c t o r $time <− substring ( t r a i n $ datetime ,12 ,20) 31 t e s t f a c t o r $time <− substring ( t e s t $ datetime ,12 ,20) 32 33 #f a c t o r i z e new timestamp column 34 t r a i n f a c t o r $time <− f a c t o r ( t r a i n f a c t o r $time ) 35 t e s t f a c t o r $time <− f a c t o r ( t e s t f a c t o r $time ) 36 37 #create day of week column 38 t r a i n f a c t o r $day <− weekdays ( as . Date ( t r a i n f a c t o r $ datetime ) ) 39 t r a i n f a c t o r $day <− as . f a c t o r ( t r a i n f a c t o r $day ) 40 t e s t f a c t o r $day <− weekdays ( as . Date ( t e s t f a c t o r $ datetime ) ) 41 t e s t f a c t o r $day <− as . f a c t o r ( t e s t f a c t o r $day ) 42 43 aggregate ( t r a i n f a c t o r [ , ” count ” ] , l i s t ( t r a i n f a c t o r $day ) ,mean) 44 45 #create Sunday v a r i a b l e 46 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day == ”Sunday” ] <− ”1” 47 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day != ”1” ] <− ”0” 48 49 t e s t f a c t o r $sunday [ t e s t f a c t o r $day == ”Sunday” ] <− ”1” 50 t e s t f a c t o r $sunday [ t e s t f a c t o r $day != ”1” ] <− ”0” 51 52 #convert to f a c t o r 53 t r a i n f a c t o r $sunday <− as . f a c t o r ( t r a i n f a c t o r $sunday ) 54 t e s t f a c t o r $sunday <− as . f a c t o r ( t e s t f a c t o r $sunday ) 55 56 #convert time and create $hour as i n t e g e r to evaluate f o r daypart 57 t r a i n f a c t o r $hour<− as . numeric ( substr ( t r a i n f a c t o r $time , 1 , 2 ) ) 58 t e s t f a c t o r $hour<− as . numeric ( substr ( t e s t f a c t o r $time , 1 , 2 ) ) 59 60 #create daypart column , d e f a u l t to 4 to make things e a s i e r f o r o u r s e l v e s 61 t r a i n f a c t o r $ daypart <− ”4” 62 t e s t f a c t o r $ daypart <− ”4” 63 64 #4AM − 10AM = 1 65 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 10) & ( t r a i n f a c t o r $hour > 3) ] <− 1 66 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 10) & ( t e s t f a c t o r $hour > 3) ] <− 1 67 68 #11AM − 3PM = 2 69 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 16) & ( t r a i n f a c t o r $hour > 9) ] <− 2 70 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 16) & ( t e s t f a c t o r $hour > 9) ] <− 2 71 72 #4PM − 9PM = 3 73 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 22) & ( t r a i n f a c t o r $hour > 15) ] <− 3 6
  • 7. 74 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 22) & ( t e s t f a c t o r $hour > 15) ] <− 3 75 76 #convert daypart to f a c t o r 77 t r a i n f a c t o r $ daypart <− as . f a c t o r ( t r a i n f a c t o r $ daypart ) 78 t e s t f a c t o r $ daypart <− as . f a c t o r ( t e s t f a c t o r $ daypart ) 79 80 #convert hour back to f a c t o r 81 t r a i n f a c t o r $hour <− as . f a c t o r ( t r a i n f a c t o r $hour ) 82 t e s t f a c t o r $hour <− as . f a c t o r ( t e s t f a c t o r $hour ) 83 84 #create formulas f o r casual and r e g i s t e r e d r i d e r s 85 formula <− count ˜ season + holiday + workingday + weather + temp + atemp + humidity + hour + daypart + sunday 86 87 #c o n d i t i o n a l i n f e r e n c e t r e e 88 f i t . c t r e e <− c t r e e ( formula , data=t r a i n f a c t o r ) 89 90 #examine model f o r v a r i a b l e importance 91 f i t . c t r e e 92 93 #run model against t e s t data s e t 94 p r e d i c t . c t r e e <− p r e d i c t ( f i t . ctree , t e s t f a c t o r ) 95 96 #l a b e l importance 97 imp <− importance ( f i t . ctree , type=1) 98 featureImportance <− data . frame ( Feature=row . names (imp) , Importance= imp [ , 1 ] ) 99 100 #plot v a r i a b l e importance 101 p <− ggplot ( featureImportance , aes (x=reorder ( Feature , Importance ) , y=Importance ) ) + 102 geom bar ( s t a t=” i d e n t i t y ” , f i l l =”#12 c f f f ” ) + 103 coord f l i p () + 104 theme l i g h t ( base s i z e =20) + 105 xlab ( ” Importance ” ) + 106 ylab ( ”” ) + 107 g g t i t l e ( ”Random Forest Feature Importance n” ) + 108 theme ( plot . t i t l e=element text ( s i z e =18) ) 109 110 #build a dataframe with our r e s u l t s 111 submit . c t r e e <− data . frame ( datetime = t e s t $ datetime , count=c t r e e . count ) 112 113 #write r e s u l t s to . csv f o r submission 114 write . table ( submit . ctree , f i l e=”submit c t r e e v2 . csv ” , sep=” , ” , row . names=FALSE, c o l . names=v ) 6.3 Random Forests R Code 1 #import necessary packages 2 l i b r a r y ( ’ randomForest ’ ) 3 l i b r a r y ( ’ ggplot2 ’ ) 4 5 #s e t working d i r e c t o r y 6 setwd ( ”/ users / andrewtrice /Desktop/Data Science Capstone” ) 7 8 #import datasets from working d i r e c t o r y 9 t r a i n <− read . csv ( ” t r a i n . csv ” ) #use nrows=1000 rows f o r speed during f e a t u r e engineering 10 t e s t <− read . csv ( ” t e s t . csv ” ) #use nrows=1000 rows f o r speed during f e a t u r e engineering 7
  • 8. 11 12 s t r ( t r a i n ) 13 14 #f a c t o r i z e t r a i n i n g s e t 15 t r a i n f a c t o r <− t r a i n 16 t r a i n f a c t o r $ weather <− f a c t o r ( t r a i n $ weather ) 17 t r a i n f a c t o r $ holiday <− f a c t o r ( t r a i n $ holiday ) 18 t r a i n f a c t o r $workingday <− f a c t o r ( t r a i n $workingday ) 19 t r a i n f a c t o r $ season <− f a c t o r ( t r a i n $ season ) 20 21 #f a c t o r i z e t e s t s e t 22 t e s t f a c t o r <− t e s t 23 t e s t f a c t o r $ weather <− f a c t o r ( t e s t $ weather ) 24 t e s t f a c t o r $ holiday <− f a c t o r ( t e s t $ holiday ) 25 t e s t f a c t o r $workingday <− f a c t o r ( t e s t $workingday ) 26 t e s t f a c t o r $ season <− f a c t o r ( t e s t $ season ) 27 28 #create time column by s t r i p p i n g out timestamp 29 t r a i n f a c t o r $time <− substring ( t r a i n $ datetime ,12 ,20) 30 t e s t f a c t o r $time <− substring ( t e s t $ datetime ,12 ,20) 31 32 #f a c t o r i z e new timestamp column 33 t r a i n f a c t o r $time <− f a c t o r ( t r a i n f a c t o r $time ) 34 t e s t f a c t o r $time <− f a c t o r ( t e s t f a c t o r $time ) 35 36 #create day of week column 37 t r a i n f a c t o r $day <− weekdays ( as . Date ( t r a i n f a c t o r $ datetime ) ) 38 t r a i n f a c t o r $day <− as . f a c t o r ( t r a i n f a c t o r $day ) 39 t e s t f a c t o r $day <− weekdays ( as . Date ( t e s t f a c t o r $ datetime ) ) 40 t e s t f a c t o r $day <− as . f a c t o r ( t e s t f a c t o r $day ) 41 42 aggregate ( t r a i n f a c t o r [ , ” count ” ] , l i s t ( t r a i n f a c t o r $day ) ,mean) 43 44 #create Sunday v a r i a b l e 45 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day == ”Sunday” ] <− ”1” 46 t r a i n f a c t o r $sunday [ t r a i n f a c t o r $day != ”1” ] <− ”0” 47 48 t e s t f a c t o r $sunday [ t e s t f a c t o r $day == ”Sunday” ] <− ”1” 49 t e s t f a c t o r $sunday [ t e s t f a c t o r $day != ”1” ] <− ”0” 50 51 #convert to f a c t o r 52 t r a i n f a c t o r $sunday <− as . f a c t o r ( t r a i n f a c t o r $sunday ) 53 t e s t f a c t o r $sunday <− as . f a c t o r ( t e s t f a c t o r $sunday ) 54 55 #convert time and create $hour as i n t e g e r to evaluate f o r daypart 56 t r a i n f a c t o r $hour<− as . numeric ( substr ( t r a i n f a c t o r $time , 1 , 2 ) ) 57 t e s t f a c t o r $hour<− as . numeric ( substr ( t e s t f a c t o r $time , 1 , 2 ) ) 58 59 #create daypart column , d e f a u l t to 4 to make things e a s i e r f o r o u r s e l v e s 60 t r a i n f a c t o r $ daypart <− ”4” 61 t e s t f a c t o r $ daypart <− ”4” 62 63 #4AM − 10AM = 1 64 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 10) & ( t r a i n f a c t o r $hour > 3) ] <− 1 65 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 10) & ( t e s t f a c t o r $hour > 3) ] <− 1 66 67 #11AM − 3PM = 2 68 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 16) & ( t r a i n f a c t o r $hour > 9) ] <− 2 8
  • 9. 69 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 16) & ( t e s t f a c t o r $hour > 9) ] <− 2 70 71 #4PM − 9PM = 3 72 t r a i n f a c t o r $ daypart [ ( t r a i n f a c t o r $hour < 22) & ( t r a i n f a c t o r $hour > 15) ] <− 3 73 t e s t f a c t o r $ daypart [ ( t e s t f a c t o r $hour < 22) & ( t e s t f a c t o r $hour > 15) ] <− 3 74 75 #convert daypart to f a c t o r 76 t r a i n f a c t o r $ daypart <− as . f a c t o r ( t r a i n f a c t o r $ daypart ) 77 t e s t f a c t o r $ daypart <− as . f a c t o r ( t e s t f a c t o r $ daypart ) 78 79 #convert hour back to f a c t o r 80 t r a i n f a c t o r $hour <− as . f a c t o r ( t r a i n f a c t o r $hour ) 81 t e s t f a c t o r $hour <− as . f a c t o r ( t e s t f a c t o r $hour ) 82 83 #get r i d of weather 4 84 t r a i n $ weather [ t r a i n $ weather==4] <− 3 85 t e s t $ weather [ t e s t $ weather==4] <− 3 86 87 #v a r i a b l e s 88 myNtree = 500 89 myMtry = 7 90 myImportance = TRUE 91 92 #s e t the seed to mean f o r model r e p r o d u c i b i l i t y 93 s e t . seed (192) 94 95 #f i t and p r e d i c t casual 96 casualFit <− randomForest ( casual ˜ season + holiday + workingday + weather + temp + atemp + humidity + hour , data=t r a i n factor , ntree=myNtree , mtry=myMtry , importance=myImportance ) 97 predictCasual <− p r e d i c t ( casualFit , t e s t f a c t o r ) 98 99 #f i t and p r e d i c t r e g i s t e r e d 100 r e g i s t e r e d F i t <− randomForest ( r e g i s t e r e d ˜ season + holiday + workingday + weather + temp + atemp + humidity + hour , data= t r a i n factor , ntree=myNtree , mtry=myMtry , importance= myImportance ) 101 p r e d i c t R e g i s t e r e d <− p r e d i c t ( r e g i s t e r e d F i t , t e s t f a c t o r ) 102 103 #add both columns into f i n a l count , round to whole number 104 t e s t $ count <− casualFit + t e s t $ r e g i s t e r e d 105 106 #t e s t p l o t 107 plot ( t r a i n $ count ) 108 plot ( t e s t $ count ) 109 110 #write output to csv f o r submission 111 submit <− data . frame ( datetime = t e s t $ datetime , count = t e s t $ count ) 112 write . csv ( submit , f i l e = ”randomForest Prediction . csv ” , row . names= FALSE) 113 114 #l a b e l v a r i a b l e importances f o r both casual and r e g i s t e r e d 115 imp1 <− importance ( casualFit , type=1) 116 imp2 <− importance ( r e g i s t e r e d F i t , type=1) 117 featureImportance1 <− data . frame ( Feature=row . names ( imp1 ) , Importance=imp1 [ , 1 ] ) 118 featureImportance2 <− data . frame ( Feature=row . names ( imp2 ) , Importance=imp2 [ , 1 ] ) 119 9
  • 10. 120 #plot v a r i a b l e importances 121 casualImportance <− ggplot ( featureImportance1 , aes (x=reorder ( Feature , Importance ) , y=Importance ) ) + 122 geom bar ( s t a t=” i d e n t i t y ” , f i l l =” blue ” ) + 123 coord f l i p () + 124 theme l i g h t ( base s i z e =20) + 125 xlab ( ”” ) + 126 ylab ( ”” ) + 127 g g t i t l e ( ” Casual Rental Feature Importance n” ) + 128 theme ( plot . t i t l e=element text ( s i z e =18) ) 129 130 registeredImportance <− ggplot ( featureImportance2 , aes (x=reorder ( Feature , Importance ) , y=Importance ) ) + 131 geom bar ( s t a t=” i d e n t i t y ” , f i l l =” blue ” ) + 132 coord f l i p () + 133 theme l i g h t ( base s i z e =20) + 134 xlab ( ”” ) + 135 ylab ( ”” ) + 136 g g t i t l e ( ” Registered Rental Feature Importance n” ) + 137 theme ( plot . t i t l e=element text ( s i z e =18) ) 10