Nyc open data project ii -- predict where to get and return my citibike

CCiittiibbiikkeeddaattaaaannddpprreeddiiccttiioonn
WhichstationshouldIchoose?
Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He)
Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
1 of 58 6/12/14, 5:37 PM

Overview
Overview1.
Citibike Data2.
Scrapping3.
Data Description4.
Modeling5.
2/58
2 of 58 6/12/14, 5:37 PM

Citibike
Citibike is hosting a public bike service.
There are many bike stations in NYC.
People want to take bike from a station with at least one available bike.
And when they get to the destination, they want to return bike to a station with at least one
available slot.
Our goal is to predict where to rent and where to return
3/58
3 of 58 6/12/14, 5:37 PM

Citibike
4/58
4 of 58 6/12/14, 5:37 PM

Citibike data
Where are the data sets?
citibike is great in opening their datasets.
They provide previous datasets about trips.
But that's not what we are looking for now.
5/58
5 of 58 6/12/14, 5:37 PM

Citibike data
Where to find data for each stations bikes and slots?
We can visit http://citibikenyc.com/stations/json to see the current data.
With historical data, We want to provide prediction and guide people with a better choice.
6/58
6 of 58 6/12/14, 5:37 PM

Historical data
we want to scrap data from the website every 5 minutes.
How to do that in R?
7/58
7 of 58 6/12/14, 5:37 PM

Data scraping
We use the following code:
This is the time we get the data.
require(rjson)
jsonURL = "http://citibikenyc.com/stations/json"
json_data = fromJSON(file = jsonURL)
names(json_data)
## [1] "executionTime" "stationBeanList"
json_data$executionTime
## [1] "2014-04-24 11:11:03 AM"
8/58
8 of 58 6/12/14, 5:37 PM

Data scraping
Our data is in the form of list. We want to change it into data.frame.
What can we get from this data?
names(json_data$stationBeanList[[1]])
## [1] "id" "stationName"
## [3] "availableDocks" "totalDocks"
## [5] "latitude" "longitude"
## [7] "statusValue" "statusKey"
## [9] "availableBikes" "stAddress1"
## [11] "stAddress2" "city"
## [13] "postalCode" "location"
## [15] "altitude" "testStation"
## [17] "lastCommunicationTime" "landMark"
9/58
9 of 58 6/12/14, 5:37 PM

Data scraping
We just need id, availableDocks, availableBikes, and executionTime.
executionTime = json_data$executionTime
ids = sapply(json_data$stationBeanList, function(x) x$id)
free = sapply(json_data$stationBeanList, function(x) x$availableDocks)
bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes)
data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes)
10/58
10 of 58 6/12/14, 5:37 PM

Data scraping
And we can get something like this:
head(data)
## time station_id free bikes
## 1 2014-04-24 11:11:03 AM 72 19 18
## 2 2014-04-24 11:11:03 AM 79 13 15
## 3 2014-04-24 11:11:03 AM 82 10 17
## 4 2014-04-24 11:11:03 AM 83 44 17
## 5 2014-04-24 11:11:03 AM 116 8 30
## 6 2014-04-24 11:11:03 AM 119 16 2
11/58
11 of 58 6/12/14, 5:37 PM

CRON
We use cron to schedule our tasks, including our web scrapper.
The log service for cron is off by default. We can first
and delete the '#' before '#cron.*'. Then we restart rsyslog with
And now we have successfully enable the log management system of cron.
Use this to check the log of cron:
sudo vi /etc/rsyslog.d/50-default.conf
sudo service rsyslog restart
sudo vi /var/log/cron.log
12/58
12 of 58 6/12/14, 5:37 PM

CRON
Then we can restart the CRON service.
If the following command return a pid, then our cron service is on.
Or you can use this alternative command:
sudo service cron restart
pgrep cron
ps aux | grep 'cron'
13/58
13 of 58 6/12/14, 5:37 PM

CRON
The simplest way to add tasks is create a .sh script.
For example, we create a shell script named "citibike.sh".
It is preferred to use the absolute path.
/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R
/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R
14/58
14 of 58 6/12/14, 5:37 PM

CRON
The final step is to add our script to the list of cron tasks.
And we can add the following line to the end of crontab:
And restart cron to validate our operation.
Here, the first parameter "*/5" means do it every 5 minutes.
Next four parameters correspond to hour, day, month, weekday.
And finally is the command to run.
sudo vi /etc/crontab
*/5 * * * * root /home/vivianzhang/citibike/citibike.sh
15/58
15 of 58 6/12/14, 5:37 PM

CRON
Other Examples for cron tasks.
0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00
This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00.
what if we want to cron every 30 minutes?
0 23-7/2,8 * * * echo "Have a good dream：）" >> /tmp/test.txt
0 0,3,6,9,12,15,18,21 ...
30 1,4,7,10,13,16,19,22 ...
16/58
16 of 58 6/12/14, 5:37 PM

CRONTAB
On Apple MAC machine, we use crontab.
Create a file, or open an existing file to put your task description. such as 'crontest'1.
Edit your tasks as stated previously.2.
Start crontab, and list running tasks.3.
Check whether it run correctly4.
You can remove all the cron tasks after you are done5.
17/58
17 of 58 6/12/14, 5:37 PM

CRONTAB
# make a new crontab file
sudo touch /etc/crontest
# change the content into this
sudo vi /etc/crontest
# content of the file
# solution to cron every minute
*/1 * * * * echo "test cron" >> /tmp/test.txt
# run the job into your cron task list
crontab /etc/crontest
# check crontab list
crontab -l
# check whether the log is written to your temp file
vi /tmp/test.txt
18/58
18 of 58 6/12/14, 5:37 PM

CRONTAB
# you should see a few works in the file
# remove the cron job
crontab -r
# double check to see if the job is removed
crontab -l
19/58
19 of 58 6/12/14, 5:37 PM

PostgreSQL
We choose PostgreSQL as the database, which is open-sourced and R-friendly.
We can easily connect to it with a command like this:
require(RPostgreSQL)
conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456",
dbname = "station_all", host = "127.0.0.1", port = "5432")
20/58
20 of 58 6/12/14, 5:37 PM

PostgreSQL
Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000
records/fetch is okay.
The following code enable us extract the first 100 records in table:
And we can fetch 101th record to 10,000th record in the table
res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000")
data1 <- fetch(res, n = 100)
data2 <- fetch(res, n = -1)
21/58
21 of 58 6/12/14, 5:37 PM

PostgreSQL
The size of the table may be larger than the memory.
An alternative method is to directly play with PostgreSQL. We can copy the table to a local file.
First we need to use a valid database user.
To use the default user in PostgreSQL, one can
Then in the interactive interface, use the following SQL command to export the table.
sudo su - postgres
psql
c station_all
copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER
22/58
22 of 58 6/12/14, 5:37 PM

Data preprocessing
It is easy to handle date type of data with the following code:
Our data is clean, and useful information includes
dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S")
time
available bikes
available spots.
·
·
·
23/58
23 of 58 6/12/14, 5:37 PM

Data preprocessing
We extract data from a single station, and name it "data_all". This is what we are gonna use:
Let us explore first 10,000 records.
load("data_all.rda")
head(data_all)
## station_time bikes free
## 1 2013-08-21 14:10:00 1 37
## 2 2013-08-21 14:15:00 2 36
## 3 2013-08-21 14:20:00 2 36
## 4 2013-08-21 14:25:00 2 36
## 5 2013-08-21 14:30:00 2 36
## 6 2013-08-21 14:35:00 3 35
data = data_all[1:10000, ]
24/58
24 of 58 6/12/14, 5:37 PM

Time Series Model
We would like to predict the ratio of bikes in this station.
data$total <- data$bikes + data$free
data$ratio <- data$bikes/data$total
head(data)
## station_time bikes free total ratio
## 1 2013-08-21 14:10:00 1 37 38 0.02632
## 2 2013-08-21 14:15:00 2 36 38 0.05263
## 3 2013-08-21 14:20:00 2 36 38 0.05263
## 4 2013-08-21 14:25:00 2 36 38 0.05263
## 5 2013-08-21 14:30:00 2 36 38 0.05263
## 6 2013-08-21 14:35:00 3 35 38 0.07895
25/58
25 of 58 6/12/14, 5:37 PM

Time Series Model
The time interval between our data points is 5 minutes. Let's check if there's any trends:
five_day_ind = 1:(288 * 5)
plot(data$ratio[five_day_ind], type = "l")
26/58
26 of 58 6/12/14, 5:37 PM

Time Series Model
Then we turn it into a time series object with frequency=288
Let's check our data
There is an NA value in our sequence.
data.ts <- ts(data$ratio, start = 1, frequency = 288)
sum(is.na(data.ts))
## [1] 1
27/58
27 of 58 6/12/14, 5:37 PM

Time Series Model
Use the following code to fill them with the previous value.
na.position <- which(is.na(data.ts))
data.ts[na.position] <- data.ts[na.position - 1]
any(is.na(data.ts))
## [1] FALSE
28/58
28 of 58 6/12/14, 5:37 PM

Time Series Model
The "seasonal" trend is obvious. We need to make use of this information.
It is a smooth function, extract seasonal pattern and enable us to focus on the higher-level
trends.
fit <- stl(data.ts, "periodic")
colnames(fit$time.series)
## [1] "seasonal" "trend" "remainder"
29/58
29 of 58 6/12/14, 5:37 PM

Time Series Model
The fitted result looks like:
head(fit$time.series)
## seasonal trend remainder
## [1,] -0.2251 0.2772 -0.025791
## [2,] -0.2133 0.2784 -0.012396
## [3,] -0.2126 0.2795 -0.014250
## [4,] -0.2156 0.2806 -0.012383
## [5,] -0.2067 0.2817 -0.022373
## [6,] -0.2089 0.2828 0.005042
30/58
30 of 58 6/12/14, 5:37 PM

Time Series Model
Black line is original data showing how much percentage of bikes are available at each time
point. Red line is extracted seasonal effect.
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1], col = 2)
leg.txt = c("origin", "seasonal")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2)
31/58
31 of 58 6/12/14, 5:37 PM

Time Series Model
The green line is the trend:
leg.txt = c("origin", "seasonal", "trends")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3)
32/58
32 of 58 6/12/14, 5:37 PM

Time Series Model
We get an approximation of our data by adding trend and seasonal effects. Blue line shows the
mixed effect of trend and seasonal. The remaining difference is the remainder.
lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4)
leg.txt = c("origin", "approx")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4))
33/58
33 of 58 6/12/14, 5:37 PM

Time Series Model
Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional
charges for a journey over 30 minutes.
We want to focus on the prediction for next 30 minutes, given the update happens every 5
minutes, we will fit 6 data points.
34/58
34 of 58 6/12/14, 5:37 PM

Time Series Model
With the R package 'forecast', we can do time series prediction easily.
library(forecast)
# h is number of periods for forecasting
pred = as.numeric(forecast(fit, h = 6)$mean)
35/58
35 of 58 6/12/14, 5:37 PM

Machine Learning Model
Machine learning could also be applied to the time series data.
Here we are going to use GBM for demonstration.
Before we apply gbm to our data. We need to extract some more time related features.
Especially, we need to use previous values to predict.
36/58
36 of 58 6/12/14, 5:37 PM

Feature extraction
traindata = data[1:2000, ]
traindata = traindata[c("station_time", "ratio")]
names(traindata) <- c("time", "y")
head(traindata)
## time y
## 1 2013-08-21 14:10:00 0.02632
## 2 2013-08-21 14:15:00 0.05263
## 3 2013-08-21 14:20:00 0.05263
## 4 2013-08-21 14:25:00 0.05263
## 5 2013-08-21 14:30:00 0.05263
## 6 2013-08-21 14:35:00 0.07895
37/58
37 of 58 6/12/14, 5:37 PM

Feature extraction
Time points to make prediction:
h = 6
new_time <- seq(from=traindata$time[nrow(traindata)],
by='5 min', length.out=h+1)[-1]
new_time
## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST"
## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST"
## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST"
38/58
38 of 58 6/12/14, 5:37 PM

Feature extraction
Let's combind our train and test data for further features.
test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h)
traindata <- rbind(traindata, data.frame(time = new_time, y = NA))
test_id
## [1] 2001 2002 2003 2004 2005 2006
39/58
39 of 58 6/12/14, 5:37 PM

Feature extraction
Of course, this service may be popular in weekends than weekdays. So we need a variable to
mark it.
traindata$weekday <- as.factor(weekdays(traindata$time))
head(traindata$weekday)
## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday
## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
40/58
40 of 58 6/12/14, 5:37 PM

Feature extraction
Time stamp is useful:
hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST"))
mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST"))
ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST"))
traindata$time_hms <- hh + 60 * mm + 3600 * ss
head(traindata)
## time y weekday time_hms
## 1 2013-08-21 14:10:00 0.02632 Wednesday 614
## 2 2013-08-21 14:15:00 0.05263 Wednesday 914
## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214
## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514
## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814
## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114
41/58
41 of 58 6/12/14, 5:37 PM

Feature extraction
How to combine previous information? We need to compute a lagged time series.
A lagged time series is a "delayed" time series, as shown below
f_lag <- function(x, lag=0)
c(rep(NA, lag), x[1:(length(x)-lag)])
f_lag(1:10, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
f_lag(1:10, 4)
## [1] NA NA NA NA 1 2 3 4 5 6
42/58
42 of 58 6/12/14, 5:37 PM

Feature extraction
To use the information from 12:30 in 12:40, we can do it with lagged time series.
for (lag in 1:12) {
traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag)
}
traindata[1:3, ]
## time y weekday time_hms lag_1 lag_2 lag_3
## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA
## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA
## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA
## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12
## 1 NA NA NA NA NA NA NA NA NA
43/58
43 of 58 6/12/14, 5:37 PM

Feature extraction
Don't worry about those NAs! They are inevitable in a lagged series.
traindata[1:10, 5:7]
## lag_1 lag_2 lag_3
## 1 NA NA NA
## 2 0.02632 NA NA
## 3 0.05263 0.02632 NA
## 4 0.05263 0.05263 0.02632
## 5 0.05263 0.05263 0.05263
## 6 0.05263 0.05263 0.05263
## 7 0.07895 0.05263 0.05263
## 8 0.05263 0.07895 0.05263
## 9 0.05263 0.05263 0.07895
## 10 0.05263 0.05263 0.05263
44/58
44 of 58 6/12/14, 5:37 PM

Feature extraction
Finally, we have our data
test <- traindata[test_id, -1]
train <- traindata[-test_id, -1]
train <- train[!is.na(train$y), ]
head(train)
## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6
## 1 0.02632 Wednesday 614 NA NA NA NA NA NA
## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA
## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA
## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA
## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA
## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA
## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12
## 1 NA NA NA NA NA NA
45/58
45 of 58 6/12/14, 5:37 PM

Now we can use gbm to do prediction.
Wait, what is gbm?
46/58
46 of 58 6/12/14, 5:37 PM

gbm refers to a certain supervised learning algorithm. It has a lot of names.
In the original publication, "gbm" is short for "Gradient Boosting Machine".
In the R package, it is short for "Generalized Boosting Model".
And its wiki page names it as "Gradient boosting".
·
·
·
47/58
47 of 58 6/12/14, 5:37 PM

gbm is derived from a relatively simple principle.
Briefly speaking, it is "hundreds of heads are better than one".
This algorithm generate many regression trees and combine their results for the final model.
48/58
48 of 58 6/12/14, 5:37 PM

With the following code, we can calculate the model:
Here n.trees is the number of "heads"(trees) for this problem.
model <- gbm(formula=y~.,
data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))],
distribution='gaussian', n.trees=2000,
interaction.depth=5, shrinkage=0.01,
cv.folds=0, keep.data=F)
49/58
49 of 58 6/12/14, 5:37 PM

In prediction, using too many trees may cause overfitting problem.
Therefore we need to use cross-validation to choose the number of trees to avoid it.
gbm provide us a convenient tool, here OOB means "Out Of Bag":
best_ntree <- gbm.perf(model, method = "OOB")
50/58
50 of 58 6/12/14, 5:37 PM

Then we can make the prediction:
best_ntree
## [1] 539
predict(model, as.data.frame(test[1,,drop=F]),
n.trees=best_ntree, type='response')
## [1] 0.1287
51/58
51 of 58 6/12/14, 5:37 PM

Performance testing
How to compare these two models? We set up a test.
Every day we will get 288 data points. And now we want to predict next 6 points with data from
the previous week, i.e. 2016 data points.
We randomly choose 50 time points and make prediction for the next 30 minutes.
Then compare their performance with RMSE:
rmse = function(pred, real) sqrt(mean((pred - real)^2))
52/58
52 of 58 6/12/14, 5:37 PM

Performance testing
Here is the result:
We can see that gbm is slightly better than the time series prediction.
stl_precision
## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698
gbm_precision
## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258
53/58
53 of 58 6/12/14, 5:37 PM

Performance testing
However, our performance is not ideal.
We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's the
result?
Why is this happenning?
y_precision
## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311
54/58
54 of 58 6/12/14, 5:37 PM

Performance testing
This picture have some hints.
plot(diff(data.ts), type = "l")
55/58
55 of 58 6/12/14, 5:37 PM

Performance testing
We can see that this data is tend to stay the same in the next 5 minutes, or even longer.
There are so many 5-minutes that nobody come to this station. Therefore the most straight-
forward prediction out-performed those two advanced methods.
sum(diff(data.ts) == 0)
## [1] 6622
56/58
56 of 58 6/12/14, 5:37 PM

More to do
There are many things to do in the future:
The sky is the limit!
Apply other algorithms to this problem, like neural networks.
Use information from nearby station: empty nearby stations will lead people come to this
one.
Combine with weather record: nobody ride in rainy day!
Path finding: design the whole trip for people.
·
·
·
·
57/58
57 of 58 6/12/14, 5:37 PM

Our Packages
We are developing an R package for citibike, including
There was an app written in Ruby-On-Rails here, offering our prediction service. Our heroku
went to sleep since the service didn't get much traffic, but one of our meetup member spent
sometime to make it live today and emailed me the link! here 2
Data scraping
Database interaction and retrieve
Time Series prediction
GBM prediction
·
·
·
·
58/58
58 of 58 6/12/14, 5:37 PM

Nyc open data project ii -- predict where to get and return my citibike

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Nyc open data project ii -- predict where to get and return my citibike

Semelhante a Nyc open data project ii -- predict where to get and return my citibike (20)

Mais de Vivian S. Zhang

Mais de Vivian S. Zhang (20)

Último

Último (20)

Nyc open data project ii -- predict where to get and return my citibike