SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
CCiittiibbiikkeeddaattaaaannddpprreeddiiccttiioonn
WhichstationshouldIchoose?
Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He)
Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
1 of 58 6/12/14, 5:37 PM
Overview
Overview1.
Citibike Data2.
Scrapping3.
Data Description4.
Modeling5.
2/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
2 of 58 6/12/14, 5:37 PM
Citibike
Citibike is hosting a public bike service.
There are many bike stations in NYC.
People want to take bike from a station with at least one available bike.
And when they get to the destination, they want to return bike to a station with at least one
available slot.
Our goal is to predict where to rent and where to return
3/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
3 of 58 6/12/14, 5:37 PM
Citibike
4/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
4 of 58 6/12/14, 5:37 PM
Citibike data
Where are the data sets?
citibike is great in opening their datasets.
They provide previous datasets about trips.
But that's not what we are looking for now.
5/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
5 of 58 6/12/14, 5:37 PM
Citibike data
Where to find data for each stations bikes and slots?
We can visit http://citibikenyc.com/stations/json to see the current data.
With historical data, We want to provide prediction and guide people with a better choice.
6/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
6 of 58 6/12/14, 5:37 PM
Historical data
we want to scrap data from the website every 5 minutes.
How to do that in R?
7/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
7 of 58 6/12/14, 5:37 PM
Data scraping
We use the following code:
This is the time we get the data.
require(rjson)
jsonURL = "http://citibikenyc.com/stations/json"
json_data = fromJSON(file = jsonURL)
names(json_data)
## [1] "executionTime" "stationBeanList"
json_data$executionTime
## [1] "2014-04-24 11:11:03 AM"
8/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
8 of 58 6/12/14, 5:37 PM
Data scraping
Our data is in the form of list. We want to change it into data.frame.
What can we get from this data?
names(json_data$stationBeanList[[1]])
## [1] "id" "stationName"
## [3] "availableDocks" "totalDocks"
## [5] "latitude" "longitude"
## [7] "statusValue" "statusKey"
## [9] "availableBikes" "stAddress1"
## [11] "stAddress2" "city"
## [13] "postalCode" "location"
## [15] "altitude" "testStation"
## [17] "lastCommunicationTime" "landMark"
9/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
9 of 58 6/12/14, 5:37 PM
Data scraping
We just need id, availableDocks, availableBikes, and executionTime.
executionTime = json_data$executionTime
ids = sapply(json_data$stationBeanList, function(x) x$id)
free = sapply(json_data$stationBeanList, function(x) x$availableDocks)
bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes)
data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes)
10/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
10 of 58 6/12/14, 5:37 PM
Data scraping
And we can get something like this:
head(data)
## time station_id free bikes
## 1 2014-04-24 11:11:03 AM 72 19 18
## 2 2014-04-24 11:11:03 AM 79 13 15
## 3 2014-04-24 11:11:03 AM 82 10 17
## 4 2014-04-24 11:11:03 AM 83 44 17
## 5 2014-04-24 11:11:03 AM 116 8 30
## 6 2014-04-24 11:11:03 AM 119 16 2
11/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
11 of 58 6/12/14, 5:37 PM
CRON
We use cron to schedule our tasks, including our web scrapper.
The log service for cron is off by default. We can first
and delete the '#' before '#cron.*'. Then we restart rsyslog with
And now we have successfully enable the log management system of cron.
Use this to check the log of cron:
sudo vi /etc/rsyslog.d/50-default.conf
sudo service rsyslog restart
sudo vi /var/log/cron.log
12/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
12 of 58 6/12/14, 5:37 PM
CRON
Then we can restart the CRON service.
If the following command return a pid, then our cron service is on.
Or you can use this alternative command:
sudo service cron restart
pgrep cron
ps aux | grep 'cron'
13/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
13 of 58 6/12/14, 5:37 PM
CRON
The simplest way to add tasks is create a .sh script.
For example, we create a shell script named "citibike.sh".
It is preferred to use the absolute path.
/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R
/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R
14/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
14 of 58 6/12/14, 5:37 PM
CRON
The final step is to add our script to the list of cron tasks.
And we can add the following line to the end of crontab:
And restart cron to validate our operation.
Here, the first parameter "*/5" means do it every 5 minutes.
Next four parameters correspond to hour, day, month, weekday.
And finally is the command to run.
sudo vi /etc/crontab
*/5 * * * * root /home/vivianzhang/citibike/citibike.sh
15/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
15 of 58 6/12/14, 5:37 PM
CRON
Other Examples for cron tasks.
0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00
This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00.
what if we want to cron every 30 minutes?
0 23-7/2,8 * * * echo "Have a good dream:)" >> /tmp/test.txt
0 0,3,6,9,12,15,18,21 ...
30 1,4,7,10,13,16,19,22 ...
16/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
16 of 58 6/12/14, 5:37 PM
CRONTAB
On Apple MAC machine, we use crontab.
Create a file, or open an existing file to put your task description. such as 'crontest'1.
Edit your tasks as stated previously.2.
Start crontab, and list running tasks.3.
Check whether it run correctly4.
You can remove all the cron tasks after you are done5.
17/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
17 of 58 6/12/14, 5:37 PM
CRONTAB
# make a new crontab file
sudo touch /etc/crontest
# change the content into this
sudo vi /etc/crontest
# content of the file
# solution to cron every minute
*/1 * * * * echo "test cron" >> /tmp/test.txt
# run the job into your cron task list
crontab /etc/crontest
# check crontab list
crontab -l
# check whether the log is written to your temp file
vi /tmp/test.txt
18/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
18 of 58 6/12/14, 5:37 PM
CRONTAB
# you should see a few works in the file
# remove the cron job
crontab -r
# double check to see if the job is removed
crontab -l
19/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
19 of 58 6/12/14, 5:37 PM
PostgreSQL
We choose PostgreSQL as the database, which is open-sourced and R-friendly.
We can easily connect to it with a command like this:
require(RPostgreSQL)
conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456",
dbname = "station_all", host = "127.0.0.1", port = "5432")
20/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
20 of 58 6/12/14, 5:37 PM
PostgreSQL
Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000
records/fetch is okay.
The following code enable us extract the first 100 records in table:
And we can fetch 101th record to 10,000th record in the table
res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000")
data1 <- fetch(res, n = 100)
data2 <- fetch(res, n = -1)
21/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
21 of 58 6/12/14, 5:37 PM
PostgreSQL
The size of the table may be larger than the memory.
An alternative method is to directly play with PostgreSQL. We can copy the table to a local file.
First we need to use a valid database user.
To use the default user in PostgreSQL, one can
Then in the interactive interface, use the following SQL command to export the table.
sudo su - postgres
psql
c station_all
copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER
22/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
22 of 58 6/12/14, 5:37 PM
Data preprocessing
It is easy to handle date type of data with the following code:
Our data is clean, and useful information includes
dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S")
time
available bikes
available spots.
·
·
·
23/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
23 of 58 6/12/14, 5:37 PM
Data preprocessing
We extract data from a single station, and name it "data_all". This is what we are gonna use:
Let us explore first 10,000 records.
load("data_all.rda")
head(data_all)
## station_time bikes free
## 1 2013-08-21 14:10:00 1 37
## 2 2013-08-21 14:15:00 2 36
## 3 2013-08-21 14:20:00 2 36
## 4 2013-08-21 14:25:00 2 36
## 5 2013-08-21 14:30:00 2 36
## 6 2013-08-21 14:35:00 3 35
data = data_all[1:10000, ]
24/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
24 of 58 6/12/14, 5:37 PM
Time Series Model
We would like to predict the ratio of bikes in this station.
data$total <- data$bikes + data$free
data$ratio <- data$bikes/data$total
head(data)
## station_time bikes free total ratio
## 1 2013-08-21 14:10:00 1 37 38 0.02632
## 2 2013-08-21 14:15:00 2 36 38 0.05263
## 3 2013-08-21 14:20:00 2 36 38 0.05263
## 4 2013-08-21 14:25:00 2 36 38 0.05263
## 5 2013-08-21 14:30:00 2 36 38 0.05263
## 6 2013-08-21 14:35:00 3 35 38 0.07895
25/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
25 of 58 6/12/14, 5:37 PM
Time Series Model
The time interval between our data points is 5 minutes. Let's check if there's any trends:
five_day_ind = 1:(288 * 5)
plot(data$ratio[five_day_ind], type = "l")
26/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
26 of 58 6/12/14, 5:37 PM
Time Series Model
Then we turn it into a time series object with frequency=288
Let's check our data
There is an NA value in our sequence.
data.ts <- ts(data$ratio, start = 1, frequency = 288)
sum(is.na(data.ts))
## [1] 1
27/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
27 of 58 6/12/14, 5:37 PM
Time Series Model
Use the following code to fill them with the previous value.
na.position <- which(is.na(data.ts))
data.ts[na.position] <- data.ts[na.position - 1]
any(is.na(data.ts))
## [1] FALSE
28/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
28 of 58 6/12/14, 5:37 PM
Time Series Model
The "seasonal" trend is obvious. We need to make use of this information.
It is a smooth function, extract seasonal pattern and enable us to focus on the higher-level
trends.
fit <- stl(data.ts, "periodic")
colnames(fit$time.series)
## [1] "seasonal" "trend" "remainder"
29/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
29 of 58 6/12/14, 5:37 PM
Time Series Model
The fitted result looks like:
head(fit$time.series)
## seasonal trend remainder
## [1,] -0.2251 0.2772 -0.025791
## [2,] -0.2133 0.2784 -0.012396
## [3,] -0.2126 0.2795 -0.014250
## [4,] -0.2156 0.2806 -0.012383
## [5,] -0.2067 0.2817 -0.022373
## [6,] -0.2089 0.2828 0.005042
30/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
30 of 58 6/12/14, 5:37 PM
Time Series Model
Black line is original data showing how much percentage of bikes are available at each time
point. Red line is extracted seasonal effect.
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1], col = 2)
leg.txt = c("origin", "seasonal")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2)
31/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
31 of 58 6/12/14, 5:37 PM
Time Series Model
The green line is the trend:
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1], col = 2)
lines(fit$time.series[five_day_ind, 2], col = 3)
leg.txt = c("origin", "seasonal", "trends")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3)
32/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
32 of 58 6/12/14, 5:37 PM
Time Series Model
We get an approximation of our data by adding trend and seasonal effects. Blue line shows the
mixed effect of trend and seasonal. The remaining difference is the remainder.
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4)
leg.txt = c("origin", "approx")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4))
33/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
33 of 58 6/12/14, 5:37 PM
Time Series Model
Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional
charges for a journey over 30 minutes.
We want to focus on the prediction for next 30 minutes, given the update happens every 5
minutes, we will fit 6 data points.
34/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
34 of 58 6/12/14, 5:37 PM
Time Series Model
With the R package 'forecast', we can do time series prediction easily.
library(forecast)
# h is number of periods for forecasting
pred = as.numeric(forecast(fit, h = 6)$mean)
35/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
35 of 58 6/12/14, 5:37 PM
Machine Learning Model
Machine learning could also be applied to the time series data.
Here we are going to use GBM for demonstration.
Before we apply gbm to our data. We need to extract some more time related features.
Especially, we need to use previous values to predict.
36/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
36 of 58 6/12/14, 5:37 PM
Feature extraction
traindata = data[1:2000, ]
traindata = traindata[c("station_time", "ratio")]
names(traindata) <- c("time", "y")
head(traindata)
## time y
## 1 2013-08-21 14:10:00 0.02632
## 2 2013-08-21 14:15:00 0.05263
## 3 2013-08-21 14:20:00 0.05263
## 4 2013-08-21 14:25:00 0.05263
## 5 2013-08-21 14:30:00 0.05263
## 6 2013-08-21 14:35:00 0.07895
37/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
37 of 58 6/12/14, 5:37 PM
Feature extraction
Time points to make prediction:
h = 6
new_time <- seq(from=traindata$time[nrow(traindata)],
by='5 min', length.out=h+1)[-1]
new_time
## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST"
## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST"
## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST"
38/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
38 of 58 6/12/14, 5:37 PM
Feature extraction
Let's combind our train and test data for further features.
test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h)
traindata <- rbind(traindata, data.frame(time = new_time, y = NA))
test_id
## [1] 2001 2002 2003 2004 2005 2006
39/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
39 of 58 6/12/14, 5:37 PM
Feature extraction
Of course, this service may be popular in weekends than weekdays. So we need a variable to
mark it.
traindata$weekday <- as.factor(weekdays(traindata$time))
head(traindata$weekday)
## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday
## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
40/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
40 of 58 6/12/14, 5:37 PM
Feature extraction
Time stamp is useful:
hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST"))
mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST"))
ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST"))
traindata$time_hms <- hh + 60 * mm + 3600 * ss
head(traindata)
## time y weekday time_hms
## 1 2013-08-21 14:10:00 0.02632 Wednesday 614
## 2 2013-08-21 14:15:00 0.05263 Wednesday 914
## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214
## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514
## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814
## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114
41/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
41 of 58 6/12/14, 5:37 PM
Feature extraction
How to combine previous information? We need to compute a lagged time series.
A lagged time series is a "delayed" time series, as shown below
f_lag <- function(x, lag=0)
c(rep(NA, lag), x[1:(length(x)-lag)])
f_lag(1:10, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
f_lag(1:10, 4)
## [1] NA NA NA NA 1 2 3 4 5 6
42/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
42 of 58 6/12/14, 5:37 PM
Feature extraction
To use the information from 12:30 in 12:40, we can do it with lagged time series.
for (lag in 1:12) {
traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag)
}
traindata[1:3, ]
## time y weekday time_hms lag_1 lag_2 lag_3
## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA
## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA
## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA
## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA
43/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
43 of 58 6/12/14, 5:37 PM
Feature extraction
Don't worry about those NAs! They are inevitable in a lagged series.
traindata[1:10, 5:7]
## lag_1 lag_2 lag_3
## 1 NA NA NA
## 2 0.02632 NA NA
## 3 0.05263 0.02632 NA
## 4 0.05263 0.05263 0.02632
## 5 0.05263 0.05263 0.05263
## 6 0.05263 0.05263 0.05263
## 7 0.07895 0.05263 0.05263
## 8 0.05263 0.07895 0.05263
## 9 0.05263 0.05263 0.07895
## 10 0.05263 0.05263 0.05263
44/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
44 of 58 6/12/14, 5:37 PM
Feature extraction
Finally, we have our data
test <- traindata[test_id, -1]
train <- traindata[-test_id, -1]
train <- train[!is.na(train$y), ]
head(train)
## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6
## 1 0.02632 Wednesday 614 NA NA NA NA NA NA
## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA
## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA
## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA
## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA
## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA
## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
45/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
45 of 58 6/12/14, 5:37 PM
Machine Learning Model
Now we can use gbm to do prediction.
Wait, what is gbm?
46/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
46 of 58 6/12/14, 5:37 PM
Machine Learning Model
gbm refers to a certain supervised learning algorithm. It has a lot of names.
In the original publication, "gbm" is short for "Gradient Boosting Machine".
In the R package, it is short for "Generalized Boosting Model".
And its wiki page names it as "Gradient boosting".
·
·
·
47/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
47 of 58 6/12/14, 5:37 PM
Machine Learning Model
gbm is derived from a relatively simple principle.
Briefly speaking, it is "hundreds of heads are better than one".
This algorithm generate many regression trees and combine their results for the final model.
48/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
48 of 58 6/12/14, 5:37 PM
Machine Learning Model
With the following code, we can calculate the model:
Here n.trees is the number of "heads"(trees) for this problem.
model <- gbm(formula=y~.,
data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))],
distribution='gaussian', n.trees=2000,
interaction.depth=5, shrinkage=0.01,
cv.folds=0, keep.data=F)
49/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
49 of 58 6/12/14, 5:37 PM
Machine Learning Model
In prediction, using too many trees may cause overfitting problem.
Therefore we need to use cross-validation to choose the number of trees to avoid it.
gbm provide us a convenient tool, here OOB means "Out Of Bag":
best_ntree <- gbm.perf(model, method = "OOB")
50/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
50 of 58 6/12/14, 5:37 PM
Machine Learning Model
Then we can make the prediction:
best_ntree
## [1] 539
predict(model, as.data.frame(test[1,,drop=F]),
n.trees=best_ntree, type='response')
## [1] 0.1287
51/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
51 of 58 6/12/14, 5:37 PM
Performance testing
How to compare these two models? We set up a test.
Every day we will get 288 data points. And now we want to predict next 6 points with data from
the previous week, i.e. 2016 data points.
We randomly choose 50 time points and make prediction for the next 30 minutes.
Then compare their performance with RMSE:
rmse = function(pred, real) sqrt(mean((pred - real)^2))
52/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
52 of 58 6/12/14, 5:37 PM
Performance testing
Here is the result:
We can see that gbm is slightly better than the time series prediction.
stl_precision
## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698
gbm_precision
## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258
53/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
53 of 58 6/12/14, 5:37 PM
Performance testing
However, our performance is not ideal.
We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's the
result?
Why is this happenning?
y_precision
## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311
54/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
54 of 58 6/12/14, 5:37 PM
Performance testing
This picture have some hints.
plot(diff(data.ts), type = "l")
55/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
55 of 58 6/12/14, 5:37 PM
Performance testing
We can see that this data is tend to stay the same in the next 5 minutes, or even longer.
There are so many 5-minutes that nobody come to this station. Therefore the most straight-
forward prediction out-performed those two advanced methods.
sum(diff(data.ts) == 0)
## [1] 6622
56/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
56 of 58 6/12/14, 5:37 PM
More to do
There are many things to do in the future:
The sky is the limit!
Apply other algorithms to this problem, like neural networks.
Use information from nearby station: empty nearby stations will lead people come to this
one.
Combine with weather record: nobody ride in rainy day!
Path finding: design the whole trip for people.
·
·
·
·
57/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
57 of 58 6/12/14, 5:37 PM
Our Packages
We are developing an R package for citibike, including
There was an app written in Ruby-On-Rails here, offering our prediction service. Our heroku
went to sleep since the service didn't get much traffic, but one of our meetup member spent
sometime to make it live today and emailed me the link! here 2
Data scraping
Database interaction and retrieve
Time Series prediction
GBM prediction
·
·
·
·
58/58
Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1
58 of 58 6/12/14, 5:37 PM

Mais conteúdo relacionado

Mais procurados

The power of streams in node js
The power of streams in node jsThe power of streams in node js
The power of streams in node jsJawahar
 
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)David Evans
 
Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)Maarten Mulders
 
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingDavid Evans
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-dataDavid Peyruc
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
 
It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?MongoDB
 
Crossing into Kernel Space
Crossing into Kernel SpaceCrossing into Kernel Space
Crossing into Kernel SpaceDavid Evans
 
Making a Process
Making a ProcessMaking a Process
Making a ProcessDavid Evans
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Building OpenDNS Stats
Building OpenDNS StatsBuilding OpenDNS Stats
Building OpenDNS StatsGeorge Ang
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
 
Jgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-componentJgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-componentCIAT
 
Smarter Scheduling
Smarter SchedulingSmarter Scheduling
Smarter SchedulingDavid Evans
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
PostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analyticsPostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analyticsHans-Jürgen Schönig
 
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin airKonstantine Krutiy
 
Raw system logs processing with hive
Raw system logs processing with hiveRaw system logs processing with hive
Raw system logs processing with hiveArpit Patil
 

Mais procurados (20)

Scheduling
SchedulingScheduling
Scheduling
 
The power of streams in node js
The power of streams in node jsThe power of streams in node js
The power of streams in node js
 
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)
 
Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)
 
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and Scheduling
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?
 
Crossing into Kernel Space
Crossing into Kernel SpaceCrossing into Kernel Space
Crossing into Kernel Space
 
Making a Process
Making a ProcessMaking a Process
Making a Process
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Building OpenDNS Stats
Building OpenDNS StatsBuilding OpenDNS Stats
Building OpenDNS Stats
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Jgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-componentJgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-component
 
Smarter Scheduling
Smarter SchedulingSmarter Scheduling
Smarter Scheduling
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
PostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analyticsPostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analytics
 
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin air
 
Raw system logs processing with hive
Raw system logs processing with hiveRaw system logs processing with hive
Raw system logs processing with hive
 

Semelhante a Nyc open data project ii -- predict where to get and return my citibike

How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logsStefan Krawczyk
 
Troubleshooting real production problems
Troubleshooting real production problemsTroubleshooting real production problems
Troubleshooting real production problemsTier1 app
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingMax Kleiner
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wyciekówKonrad Kokosa
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Advanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien PouliotAdvanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien PouliotXamarin
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisC4Media
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...Andrey Karpov
 
Build your own_map_by_yourself
Build your own_map_by_yourselfBuild your own_map_by_yourself
Build your own_map_by_yourselfMarc Huang
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Fastly
 

Semelhante a Nyc open data project ii -- predict where to get and return my citibike (20)

How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
 
Troubleshooting real production problems
Troubleshooting real production problemsTroubleshooting real production problems
Troubleshooting real production problems
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
C++ Coroutines
C++ CoroutinesC++ Coroutines
C++ Coroutines
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Advanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien PouliotAdvanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien Pouliot
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
 
Build your own_map_by_yourself
Build your own_map_by_yourselfBuild your own_map_by_yourself
Build your own_map_by_yourself
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 

Mais de Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 

Mais de Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Xgboost
XgboostXgboost
Xgboost
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 

Último

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 

Último (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 

Nyc open data project ii -- predict where to get and return my citibike

  • 1. CCiittiibbiikkeeddaattaaaannddpprreeddiiccttiioonn WhichstationshouldIchoose? Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He) Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 1 of 58 6/12/14, 5:37 PM
  • 2. Overview Overview1. Citibike Data2. Scrapping3. Data Description4. Modeling5. 2/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 2 of 58 6/12/14, 5:37 PM
  • 3. Citibike Citibike is hosting a public bike service. There are many bike stations in NYC. People want to take bike from a station with at least one available bike. And when they get to the destination, they want to return bike to a station with at least one available slot. Our goal is to predict where to rent and where to return 3/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 3 of 58 6/12/14, 5:37 PM
  • 4. Citibike 4/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 4 of 58 6/12/14, 5:37 PM
  • 5. Citibike data Where are the data sets? citibike is great in opening their datasets. They provide previous datasets about trips. But that's not what we are looking for now. 5/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 5 of 58 6/12/14, 5:37 PM
  • 6. Citibike data Where to find data for each stations bikes and slots? We can visit http://citibikenyc.com/stations/json to see the current data. With historical data, We want to provide prediction and guide people with a better choice. 6/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 6 of 58 6/12/14, 5:37 PM
  • 7. Historical data we want to scrap data from the website every 5 minutes. How to do that in R? 7/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 7 of 58 6/12/14, 5:37 PM
  • 8. Data scraping We use the following code: This is the time we get the data. require(rjson) jsonURL = "http://citibikenyc.com/stations/json" json_data = fromJSON(file = jsonURL) names(json_data) ## [1] "executionTime" "stationBeanList" json_data$executionTime ## [1] "2014-04-24 11:11:03 AM" 8/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 8 of 58 6/12/14, 5:37 PM
  • 9. Data scraping Our data is in the form of list. We want to change it into data.frame. What can we get from this data? names(json_data$stationBeanList[[1]]) ## [1] "id" "stationName" ## [3] "availableDocks" "totalDocks" ## [5] "latitude" "longitude" ## [7] "statusValue" "statusKey" ## [9] "availableBikes" "stAddress1" ## [11] "stAddress2" "city" ## [13] "postalCode" "location" ## [15] "altitude" "testStation" ## [17] "lastCommunicationTime" "landMark" 9/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 9 of 58 6/12/14, 5:37 PM
  • 10. Data scraping We just need id, availableDocks, availableBikes, and executionTime. executionTime = json_data$executionTime ids = sapply(json_data$stationBeanList, function(x) x$id) free = sapply(json_data$stationBeanList, function(x) x$availableDocks) bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes) data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes) 10/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 10 of 58 6/12/14, 5:37 PM
  • 11. Data scraping And we can get something like this: head(data) ## time station_id free bikes ## 1 2014-04-24 11:11:03 AM 72 19 18 ## 2 2014-04-24 11:11:03 AM 79 13 15 ## 3 2014-04-24 11:11:03 AM 82 10 17 ## 4 2014-04-24 11:11:03 AM 83 44 17 ## 5 2014-04-24 11:11:03 AM 116 8 30 ## 6 2014-04-24 11:11:03 AM 119 16 2 11/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 11 of 58 6/12/14, 5:37 PM
  • 12. CRON We use cron to schedule our tasks, including our web scrapper. The log service for cron is off by default. We can first and delete the '#' before '#cron.*'. Then we restart rsyslog with And now we have successfully enable the log management system of cron. Use this to check the log of cron: sudo vi /etc/rsyslog.d/50-default.conf sudo service rsyslog restart sudo vi /var/log/cron.log 12/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 12 of 58 6/12/14, 5:37 PM
  • 13. CRON Then we can restart the CRON service. If the following command return a pid, then our cron service is on. Or you can use this alternative command: sudo service cron restart pgrep cron ps aux | grep 'cron' 13/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 13 of 58 6/12/14, 5:37 PM
  • 14. CRON The simplest way to add tasks is create a .sh script. For example, we create a shell script named "citibike.sh". It is preferred to use the absolute path. /usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R /usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R 14/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 14 of 58 6/12/14, 5:37 PM
  • 15. CRON The final step is to add our script to the list of cron tasks. And we can add the following line to the end of crontab: And restart cron to validate our operation. Here, the first parameter "*/5" means do it every 5 minutes. Next four parameters correspond to hour, day, month, weekday. And finally is the command to run. sudo vi /etc/crontab */5 * * * * root /home/vivianzhang/citibike/citibike.sh 15/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 15 of 58 6/12/14, 5:37 PM
  • 16. CRON Other Examples for cron tasks. 0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00 This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00. what if we want to cron every 30 minutes? 0 23-7/2,8 * * * echo "Have a good dream:)" >> /tmp/test.txt 0 0,3,6,9,12,15,18,21 ... 30 1,4,7,10,13,16,19,22 ... 16/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 16 of 58 6/12/14, 5:37 PM
  • 17. CRONTAB On Apple MAC machine, we use crontab. Create a file, or open an existing file to put your task description. such as 'crontest'1. Edit your tasks as stated previously.2. Start crontab, and list running tasks.3. Check whether it run correctly4. You can remove all the cron tasks after you are done5. 17/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 17 of 58 6/12/14, 5:37 PM
  • 18. CRONTAB # make a new crontab file sudo touch /etc/crontest # change the content into this sudo vi /etc/crontest # content of the file # solution to cron every minute */1 * * * * echo "test cron" >> /tmp/test.txt # run the job into your cron task list crontab /etc/crontest # check crontab list crontab -l # check whether the log is written to your temp file vi /tmp/test.txt 18/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 18 of 58 6/12/14, 5:37 PM
  • 19. CRONTAB # you should see a few works in the file # remove the cron job crontab -r # double check to see if the job is removed crontab -l 19/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 19 of 58 6/12/14, 5:37 PM
  • 20. PostgreSQL We choose PostgreSQL as the database, which is open-sourced and R-friendly. We can easily connect to it with a command like this: require(RPostgreSQL) conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456", dbname = "station_all", host = "127.0.0.1", port = "5432") 20/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 20 of 58 6/12/14, 5:37 PM
  • 21. PostgreSQL Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000 records/fetch is okay. The following code enable us extract the first 100 records in table: And we can fetch 101th record to 10,000th record in the table res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000") data1 <- fetch(res, n = 100) data2 <- fetch(res, n = -1) 21/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 21 of 58 6/12/14, 5:37 PM
  • 22. PostgreSQL The size of the table may be larger than the memory. An alternative method is to directly play with PostgreSQL. We can copy the table to a local file. First we need to use a valid database user. To use the default user in PostgreSQL, one can Then in the interactive interface, use the following SQL command to export the table. sudo su - postgres psql c station_all copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER 22/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 22 of 58 6/12/14, 5:37 PM
  • 23. Data preprocessing It is easy to handle date type of data with the following code: Our data is clean, and useful information includes dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S") time available bikes available spots. · · · 23/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 23 of 58 6/12/14, 5:37 PM
  • 24. Data preprocessing We extract data from a single station, and name it "data_all". This is what we are gonna use: Let us explore first 10,000 records. load("data_all.rda") head(data_all) ## station_time bikes free ## 1 2013-08-21 14:10:00 1 37 ## 2 2013-08-21 14:15:00 2 36 ## 3 2013-08-21 14:20:00 2 36 ## 4 2013-08-21 14:25:00 2 36 ## 5 2013-08-21 14:30:00 2 36 ## 6 2013-08-21 14:35:00 3 35 data = data_all[1:10000, ] 24/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 24 of 58 6/12/14, 5:37 PM
  • 25. Time Series Model We would like to predict the ratio of bikes in this station. data$total <- data$bikes + data$free data$ratio <- data$bikes/data$total head(data) ## station_time bikes free total ratio ## 1 2013-08-21 14:10:00 1 37 38 0.02632 ## 2 2013-08-21 14:15:00 2 36 38 0.05263 ## 3 2013-08-21 14:20:00 2 36 38 0.05263 ## 4 2013-08-21 14:25:00 2 36 38 0.05263 ## 5 2013-08-21 14:30:00 2 36 38 0.05263 ## 6 2013-08-21 14:35:00 3 35 38 0.07895 25/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 25 of 58 6/12/14, 5:37 PM
  • 26. Time Series Model The time interval between our data points is 5 minutes. Let's check if there's any trends: five_day_ind = 1:(288 * 5) plot(data$ratio[five_day_ind], type = "l") 26/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 26 of 58 6/12/14, 5:37 PM
  • 27. Time Series Model Then we turn it into a time series object with frequency=288 Let's check our data There is an NA value in our sequence. data.ts <- ts(data$ratio, start = 1, frequency = 288) sum(is.na(data.ts)) ## [1] 1 27/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 27 of 58 6/12/14, 5:37 PM
  • 28. Time Series Model Use the following code to fill them with the previous value. na.position <- which(is.na(data.ts)) data.ts[na.position] <- data.ts[na.position - 1] any(is.na(data.ts)) ## [1] FALSE 28/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 28 of 58 6/12/14, 5:37 PM
  • 29. Time Series Model The "seasonal" trend is obvious. We need to make use of this information. It is a smooth function, extract seasonal pattern and enable us to focus on the higher-level trends. fit <- stl(data.ts, "periodic") colnames(fit$time.series) ## [1] "seasonal" "trend" "remainder" 29/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 29 of 58 6/12/14, 5:37 PM
  • 30. Time Series Model The fitted result looks like: head(fit$time.series) ## seasonal trend remainder ## [1,] -0.2251 0.2772 -0.025791 ## [2,] -0.2133 0.2784 -0.012396 ## [3,] -0.2126 0.2795 -0.014250 ## [4,] -0.2156 0.2806 -0.012383 ## [5,] -0.2067 0.2817 -0.022373 ## [6,] -0.2089 0.2828 0.005042 30/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 30 of 58 6/12/14, 5:37 PM
  • 31. Time Series Model Black line is original data showing how much percentage of bikes are available at each time point. Red line is extracted seasonal effect. plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1], col = 2) leg.txt = c("origin", "seasonal") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2) 31/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 31 of 58 6/12/14, 5:37 PM
  • 32. Time Series Model The green line is the trend: plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1], col = 2) lines(fit$time.series[five_day_ind, 2], col = 3) leg.txt = c("origin", "seasonal", "trends") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3) 32/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 32 of 58 6/12/14, 5:37 PM
  • 33. Time Series Model We get an approximation of our data by adding trend and seasonal effects. Blue line shows the mixed effect of trend and seasonal. The remaining difference is the remainder. plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4) leg.txt = c("origin", "approx") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4)) 33/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 33 of 58 6/12/14, 5:37 PM
  • 34. Time Series Model Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional charges for a journey over 30 minutes. We want to focus on the prediction for next 30 minutes, given the update happens every 5 minutes, we will fit 6 data points. 34/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 34 of 58 6/12/14, 5:37 PM
  • 35. Time Series Model With the R package 'forecast', we can do time series prediction easily. library(forecast) # h is number of periods for forecasting pred = as.numeric(forecast(fit, h = 6)$mean) 35/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 35 of 58 6/12/14, 5:37 PM
  • 36. Machine Learning Model Machine learning could also be applied to the time series data. Here we are going to use GBM for demonstration. Before we apply gbm to our data. We need to extract some more time related features. Especially, we need to use previous values to predict. 36/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 36 of 58 6/12/14, 5:37 PM
  • 37. Feature extraction traindata = data[1:2000, ] traindata = traindata[c("station_time", "ratio")] names(traindata) <- c("time", "y") head(traindata) ## time y ## 1 2013-08-21 14:10:00 0.02632 ## 2 2013-08-21 14:15:00 0.05263 ## 3 2013-08-21 14:20:00 0.05263 ## 4 2013-08-21 14:25:00 0.05263 ## 5 2013-08-21 14:30:00 0.05263 ## 6 2013-08-21 14:35:00 0.07895 37/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 37 of 58 6/12/14, 5:37 PM
  • 38. Feature extraction Time points to make prediction: h = 6 new_time <- seq(from=traindata$time[nrow(traindata)], by='5 min', length.out=h+1)[-1] new_time ## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST" ## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST" ## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST" 38/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 38 of 58 6/12/14, 5:37 PM
  • 39. Feature extraction Let's combind our train and test data for further features. test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h) traindata <- rbind(traindata, data.frame(time = new_time, y = NA)) test_id ## [1] 2001 2002 2003 2004 2005 2006 39/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 39 of 58 6/12/14, 5:37 PM
  • 40. Feature extraction Of course, this service may be popular in weekends than weekdays. So we need a variable to mark it. traindata$weekday <- as.factor(weekdays(traindata$time)) head(traindata$weekday) ## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday ## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday 40/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 40 of 58 6/12/14, 5:37 PM
  • 41. Feature extraction Time stamp is useful: hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST")) mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST")) ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST")) traindata$time_hms <- hh + 60 * mm + 3600 * ss head(traindata) ## time y weekday time_hms ## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 ## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 ## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 ## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514 ## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814 ## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114 41/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 41 of 58 6/12/14, 5:37 PM
  • 42. Feature extraction How to combine previous information? We need to compute a lagged time series. A lagged time series is a "delayed" time series, as shown below f_lag <- function(x, lag=0) c(rep(NA, lag), x[1:(length(x)-lag)]) f_lag(1:10, 1) ## [1] NA 1 2 3 4 5 6 7 8 9 f_lag(1:10, 4) ## [1] NA NA NA NA 1 2 3 4 5 6 42/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 42 of 58 6/12/14, 5:37 PM
  • 43. Feature extraction To use the information from 12:30 in 12:40, we can do it with lagged time series. for (lag in 1:12) { traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag) } traindata[1:3, ] ## time y weekday time_hms lag_1 lag_2 lag_3 ## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA ## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA ## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA ## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12 ## 1 NA NA NA NA NA NA NA NA NA ## 2 NA NA NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA NA NA 43/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 43 of 58 6/12/14, 5:37 PM
  • 44. Feature extraction Don't worry about those NAs! They are inevitable in a lagged series. traindata[1:10, 5:7] ## lag_1 lag_2 lag_3 ## 1 NA NA NA ## 2 0.02632 NA NA ## 3 0.05263 0.02632 NA ## 4 0.05263 0.05263 0.02632 ## 5 0.05263 0.05263 0.05263 ## 6 0.05263 0.05263 0.05263 ## 7 0.07895 0.05263 0.05263 ## 8 0.05263 0.07895 0.05263 ## 9 0.05263 0.05263 0.07895 ## 10 0.05263 0.05263 0.05263 44/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 44 of 58 6/12/14, 5:37 PM
  • 45. Feature extraction Finally, we have our data test <- traindata[test_id, -1] train <- traindata[-test_id, -1] train <- train[!is.na(train$y), ] head(train) ## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 ## 1 0.02632 Wednesday 614 NA NA NA NA NA NA ## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA ## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA ## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA ## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA ## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA ## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12 ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 NA NA NA NA NA NA ## 4 NA NA NA NA NA NA ## 5 NA NA NA NA NA NA ## 6 NA NA NA NA NA NA 45/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 45 of 58 6/12/14, 5:37 PM
  • 46. Machine Learning Model Now we can use gbm to do prediction. Wait, what is gbm? 46/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 46 of 58 6/12/14, 5:37 PM
  • 47. Machine Learning Model gbm refers to a certain supervised learning algorithm. It has a lot of names. In the original publication, "gbm" is short for "Gradient Boosting Machine". In the R package, it is short for "Generalized Boosting Model". And its wiki page names it as "Gradient boosting". · · · 47/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 47 of 58 6/12/14, 5:37 PM
  • 48. Machine Learning Model gbm is derived from a relatively simple principle. Briefly speaking, it is "hundreds of heads are better than one". This algorithm generate many regression trees and combine their results for the final model. 48/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 48 of 58 6/12/14, 5:37 PM
  • 49. Machine Learning Model With the following code, we can calculate the model: Here n.trees is the number of "heads"(trees) for this problem. model <- gbm(formula=y~., data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))], distribution='gaussian', n.trees=2000, interaction.depth=5, shrinkage=0.01, cv.folds=0, keep.data=F) 49/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 49 of 58 6/12/14, 5:37 PM
  • 50. Machine Learning Model In prediction, using too many trees may cause overfitting problem. Therefore we need to use cross-validation to choose the number of trees to avoid it. gbm provide us a convenient tool, here OOB means "Out Of Bag": best_ntree <- gbm.perf(model, method = "OOB") 50/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 50 of 58 6/12/14, 5:37 PM
  • 51. Machine Learning Model Then we can make the prediction: best_ntree ## [1] 539 predict(model, as.data.frame(test[1,,drop=F]), n.trees=best_ntree, type='response') ## [1] 0.1287 51/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 51 of 58 6/12/14, 5:37 PM
  • 52. Performance testing How to compare these two models? We set up a test. Every day we will get 288 data points. And now we want to predict next 6 points with data from the previous week, i.e. 2016 data points. We randomly choose 50 time points and make prediction for the next 30 minutes. Then compare their performance with RMSE: rmse = function(pred, real) sqrt(mean((pred - real)^2)) 52/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 52 of 58 6/12/14, 5:37 PM
  • 53. Performance testing Here is the result: We can see that gbm is slightly better than the time series prediction. stl_precision ## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698 gbm_precision ## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258 53/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 53 of 58 6/12/14, 5:37 PM
  • 54. Performance testing However, our performance is not ideal. We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's the result? Why is this happenning? y_precision ## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311 54/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 54 of 58 6/12/14, 5:37 PM
  • 55. Performance testing This picture have some hints. plot(diff(data.ts), type = "l") 55/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 55 of 58 6/12/14, 5:37 PM
  • 56. Performance testing We can see that this data is tend to stay the same in the next 5 minutes, or even longer. There are so many 5-minutes that nobody come to this station. Therefore the most straight- forward prediction out-performed those two advanced methods. sum(diff(data.ts) == 0) ## [1] 6622 56/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 56 of 58 6/12/14, 5:37 PM
  • 57. More to do There are many things to do in the future: The sky is the limit! Apply other algorithms to this problem, like neural networks. Use information from nearby station: empty nearby stations will lead people come to this one. Combine with weather record: nobody ride in rainy day! Path finding: design the whole trip for people. · · · · 57/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 57 of 58 6/12/14, 5:37 PM
  • 58. Our Packages We are developing an R package for citibike, including There was an app written in Ruby-On-Rails here, offering our prediction service. Our heroku went to sleep since the service didn't get much traffic, but one of our meetup member spent sometime to make it live today and emailed me the link! here 2 Data scraping Database interaction and retrieve Time Series prediction GBM prediction · · · · 58/58 Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 58 of 58 6/12/14, 5:37 PM