SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He)
Check out our blog and newsletters at and
Citibike data and prediction
1 of 58 6/12/14, 5:37 PM
Citibike Data2.
Data Description4.
Citibike data and prediction
2 of 58 6/12/14, 5:37 PM
Citibike is hosting a public bike service.
There are many bike stations in NYC.
People want to take bike from a station with at least one available bike.
And when they get to the destination, they want to return bike to a station with at least one
available slot.
Our goal is to predict where to rent and where to return
Citibike data and prediction
3 of 58 6/12/14, 5:37 PM
Citibike data and prediction
4 of 58 6/12/14, 5:37 PM
Citibike data
Where are the data sets?
citibike is great in opening their datasets.
They provide previous datasets about trips.
But that's not what we are looking for now.
Citibike data and prediction
5 of 58 6/12/14, 5:37 PM
Citibike data
Where to find data for each stations bikes and slots?
We can visit to see the current data.
With historical data, We want to provide prediction and guide people with a better choice.
Citibike data and prediction
6 of 58 6/12/14, 5:37 PM
Historical data
we want to scrap data from the website every 5 minutes.
How to do that in R?
Citibike data and prediction
7 of 58 6/12/14, 5:37 PM
Data scraping
We use the following code:
This is the time we get the data.
jsonURL = ""
json_data = fromJSON(file = jsonURL)
## [1] "executionTime" "stationBeanList"
## [1] "2014-04-24 11:11:03 AM"
Citibike data and prediction
8 of 58 6/12/14, 5:37 PM
Data scraping
Our data is in the form of list. We want to change it into data.frame.
What can we get from this data?
## [1] "id" "stationName"
## [3] "availableDocks" "totalDocks"
## [5] "latitude" "longitude"
## [7] "statusValue" "statusKey"
## [9] "availableBikes" "stAddress1"
## [11] "stAddress2" "city"
## [13] "postalCode" "location"
## [15] "altitude" "testStation"
## [17] "lastCommunicationTime" "landMark"
Citibike data and prediction
9 of 58 6/12/14, 5:37 PM
Data scraping
We just need id, availableDocks, availableBikes, and executionTime.
executionTime = json_data$executionTime
ids = sapply(json_data$stationBeanList, function(x) x$id)
free = sapply(json_data$stationBeanList, function(x) x$availableDocks)
bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes)
data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes)
Citibike data and prediction
10 of 58 6/12/14, 5:37 PM
Data scraping
And we can get something like this:
## time station_id free bikes
## 1 2014-04-24 11:11:03 AM 72 19 18
## 2 2014-04-24 11:11:03 AM 79 13 15
## 3 2014-04-24 11:11:03 AM 82 10 17
## 4 2014-04-24 11:11:03 AM 83 44 17
## 5 2014-04-24 11:11:03 AM 116 8 30
## 6 2014-04-24 11:11:03 AM 119 16 2
Citibike data and prediction
11 of 58 6/12/14, 5:37 PM
We use cron to schedule our tasks, including our web scrapper.
The log service for cron is off by default. We can first
and delete the '#' before '#cron.*'. Then we restart rsyslog with
And now we have successfully enable the log management system of cron.
Use this to check the log of cron:
sudo vi /etc/rsyslog.d/50-default.conf
sudo service rsyslog restart
sudo vi /var/log/cron.log
Citibike data and prediction
12 of 58 6/12/14, 5:37 PM
Then we can restart the CRON service.
If the following command return a pid, then our cron service is on.
Or you can use this alternative command:
sudo service cron restart
pgrep cron
ps aux | grep 'cron'
Citibike data and prediction
13 of 58 6/12/14, 5:37 PM
The simplest way to add tasks is create a .sh script.
For example, we create a shell script named "".
It is preferred to use the absolute path.
/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R
/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R
Citibike data and prediction
14 of 58 6/12/14, 5:37 PM
The final step is to add our script to the list of cron tasks.
And we can add the following line to the end of crontab:
And restart cron to validate our operation.
Here, the first parameter "*/5" means do it every 5 minutes.
Next four parameters correspond to hour, day, month, weekday.
And finally is the command to run.
sudo vi /etc/crontab
*/5 * * * * root /home/vivianzhang/citibike/
Citibike data and prediction
15 of 58 6/12/14, 5:37 PM
Other Examples for cron tasks.
0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00
This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00.
what if we want to cron every 30 minutes?
0 23-7/2,8 * * * echo "Have a good dream:)" >> /tmp/test.txt
0 0,3,6,9,12,15,18,21 ...
30 1,4,7,10,13,16,19,22 ...
Citibike data and prediction
16 of 58 6/12/14, 5:37 PM
On Apple MAC machine, we use crontab.
Create a file, or open an existing file to put your task description. such as 'crontest'1.
Edit your tasks as stated previously.2.
Start crontab, and list running tasks.3.
Check whether it run correctly4.
You can remove all the cron tasks after you are done5.
Citibike data and prediction
17 of 58 6/12/14, 5:37 PM
# make a new crontab file
sudo touch /etc/crontest
# change the content into this
sudo vi /etc/crontest
# content of the file
# solution to cron every minute
*/1 * * * * echo "test cron" >> /tmp/test.txt
# run the job into your cron task list
crontab /etc/crontest
# check crontab list
crontab -l
# check whether the log is written to your temp file
vi /tmp/test.txt
Citibike data and prediction
18 of 58 6/12/14, 5:37 PM
# you should see a few works in the file
# remove the cron job
crontab -r
# double check to see if the job is removed
crontab -l
Citibike data and prediction
19 of 58 6/12/14, 5:37 PM
We choose PostgreSQL as the database, which is open-sourced and R-friendly.
We can easily connect to it with a command like this:
conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456",
dbname = "station_all", host = "", port = "5432")
Citibike data and prediction
20 of 58 6/12/14, 5:37 PM
Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000
records/fetch is okay.
The following code enable us extract the first 100 records in table:
And we can fetch 101th record to 10,000th record in the table
res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000")
data1 <- fetch(res, n = 100)
data2 <- fetch(res, n = -1)
Citibike data and prediction
21 of 58 6/12/14, 5:37 PM
The size of the table may be larger than the memory.
An alternative method is to directly play with PostgreSQL. We can copy the table to a local file.
First we need to use a valid database user.
To use the default user in PostgreSQL, one can
Then in the interactive interface, use the following SQL command to export the table.
sudo su - postgres
c station_all
copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER
Citibike data and prediction
22 of 58 6/12/14, 5:37 PM
Data preprocessing
It is easy to handle date type of data with the following code:
Our data is clean, and useful information includes
dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S")
available bikes
available spots.
Citibike data and prediction
23 of 58 6/12/14, 5:37 PM
Data preprocessing
We extract data from a single station, and name it "data_all". This is what we are gonna use:
Let us explore first 10,000 records.
## station_time bikes free
## 1 2013-08-21 14:10:00 1 37
## 2 2013-08-21 14:15:00 2 36
## 3 2013-08-21 14:20:00 2 36
## 4 2013-08-21 14:25:00 2 36
## 5 2013-08-21 14:30:00 2 36
## 6 2013-08-21 14:35:00 3 35
data = data_all[1:10000, ]
Citibike data and prediction
24 of 58 6/12/14, 5:37 PM
Time Series Model
We would like to predict the ratio of bikes in this station.
data$total <- data$bikes + data$free
data$ratio <- data$bikes/data$total
## station_time bikes free total ratio
## 1 2013-08-21 14:10:00 1 37 38 0.02632
## 2 2013-08-21 14:15:00 2 36 38 0.05263
## 3 2013-08-21 14:20:00 2 36 38 0.05263
## 4 2013-08-21 14:25:00 2 36 38 0.05263
## 5 2013-08-21 14:30:00 2 36 38 0.05263
## 6 2013-08-21 14:35:00 3 35 38 0.07895
Citibike data and prediction
25 of 58 6/12/14, 5:37 PM
Time Series Model
The time interval between our data points is 5 minutes. Let's check if there's any trends:
five_day_ind = 1:(288 * 5)
plot(data$ratio[five_day_ind], type = "l")
Citibike data and prediction
26 of 58 6/12/14, 5:37 PM
Time Series Model
Then we turn it into a time series object with frequency=288
Let's check our data
There is an NA value in our sequence.
data.ts <- ts(data$ratio, start = 1, frequency = 288)
## [1] 1
Citibike data and prediction
27 of 58 6/12/14, 5:37 PM
Time Series Model
Use the following code to fill them with the previous value.
na.position <- which(
data.ts[na.position] <- data.ts[na.position - 1]
## [1] FALSE
Citibike data and prediction
28 of 58 6/12/14, 5:37 PM
Time Series Model
The "seasonal" trend is obvious. We need to make use of this information.
It is a smooth function, extract seasonal pattern and enable us to focus on the higher-level
fit <- stl(data.ts, "periodic")
## [1] "seasonal" "trend" "remainder"
Citibike data and prediction
29 of 58 6/12/14, 5:37 PM
Time Series Model
The fitted result looks like:
## seasonal trend remainder
## [1,] -0.2251 0.2772 -0.025791
## [2,] -0.2133 0.2784 -0.012396
## [3,] -0.2126 0.2795 -0.014250
## [4,] -0.2156 0.2806 -0.012383
## [5,] -0.2067 0.2817 -0.022373
## [6,] -0.2089 0.2828 0.005042
Citibike data and prediction
30 of 58 6/12/14, 5:37 PM
Time Series Model
Black line is original data showing how much percentage of bikes are available at each time
point. Red line is extracted seasonal effect.
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1], col = 2)
leg.txt = c("origin", "seasonal")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2)
Citibike data and prediction
31 of 58 6/12/14, 5:37 PM
Time Series Model
The green line is the trend:
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1], col = 2)
lines(fit$time.series[five_day_ind, 2], col = 3)
leg.txt = c("origin", "seasonal", "trends")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3)
Citibike data and prediction
32 of 58 6/12/14, 5:37 PM
Time Series Model
We get an approximation of our data by adding trend and seasonal effects. Blue line shows the
mixed effect of trend and seasonal. The remaining difference is the remainder.
plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))
lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4)
leg.txt = c("origin", "approx")
legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4))
Citibike data and prediction
33 of 58 6/12/14, 5:37 PM
Time Series Model
Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional
charges for a journey over 30 minutes.
We want to focus on the prediction for next 30 minutes, given the update happens every 5
minutes, we will fit 6 data points.
Citibike data and prediction
34 of 58 6/12/14, 5:37 PM
Time Series Model
With the R package 'forecast', we can do time series prediction easily.
# h is number of periods for forecasting
pred = as.numeric(forecast(fit, h = 6)$mean)
Citibike data and prediction
35 of 58 6/12/14, 5:37 PM
Machine Learning Model
Machine learning could also be applied to the time series data.
Here we are going to use GBM for demonstration.
Before we apply gbm to our data. We need to extract some more time related features.
Especially, we need to use previous values to predict.
Citibike data and prediction
36 of 58 6/12/14, 5:37 PM
Feature extraction
traindata = data[1:2000, ]
traindata = traindata[c("station_time", "ratio")]
names(traindata) <- c("time", "y")
## time y
## 1 2013-08-21 14:10:00 0.02632
## 2 2013-08-21 14:15:00 0.05263
## 3 2013-08-21 14:20:00 0.05263
## 4 2013-08-21 14:25:00 0.05263
## 5 2013-08-21 14:30:00 0.05263
## 6 2013-08-21 14:35:00 0.07895
Citibike data and prediction
37 of 58 6/12/14, 5:37 PM
Feature extraction
Time points to make prediction:
h = 6
new_time <- seq(from=traindata$time[nrow(traindata)],
by='5 min', length.out=h+1)[-1]
## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST"
## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST"
## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST"
Citibike data and prediction
38 of 58 6/12/14, 5:37 PM
Feature extraction
Let's combind our train and test data for further features.
test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h)
traindata <- rbind(traindata, data.frame(time = new_time, y = NA))
## [1] 2001 2002 2003 2004 2005 2006
Citibike data and prediction
39 of 58 6/12/14, 5:37 PM
Feature extraction
Of course, this service may be popular in weekends than weekdays. So we need a variable to
mark it.
traindata$weekday <- as.factor(weekdays(traindata$time))
## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday
## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Citibike data and prediction
40 of 58 6/12/14, 5:37 PM
Feature extraction
Time stamp is useful:
hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST"))
mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST"))
ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST"))
traindata$time_hms <- hh + 60 * mm + 3600 * ss
## time y weekday time_hms
## 1 2013-08-21 14:10:00 0.02632 Wednesday 614
## 2 2013-08-21 14:15:00 0.05263 Wednesday 914
## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214
## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514
## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814
## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114
Citibike data and prediction
41 of 58 6/12/14, 5:37 PM
Feature extraction
How to combine previous information? We need to compute a lagged time series.
A lagged time series is a "delayed" time series, as shown below
f_lag <- function(x, lag=0)
c(rep(NA, lag), x[1:(length(x)-lag)])
f_lag(1:10, 1)
## [1] NA 1 2 3 4 5 6 7 8 9
f_lag(1:10, 4)
## [1] NA NA NA NA 1 2 3 4 5 6
Citibike data and prediction
42 of 58 6/12/14, 5:37 PM
Feature extraction
To use the information from 12:30 in 12:40, we can do it with lagged time series.
for (lag in 1:12) {
traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag)
traindata[1:3, ]
## time y weekday time_hms lag_1 lag_2 lag_3
## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA
## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA
## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA
## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12
Citibike data and prediction
43 of 58 6/12/14, 5:37 PM
Feature extraction
Don't worry about those NAs! They are inevitable in a lagged series.
traindata[1:10, 5:7]
## lag_1 lag_2 lag_3
## 1 NA NA NA
## 2 0.02632 NA NA
## 3 0.05263 0.02632 NA
## 4 0.05263 0.05263 0.02632
## 5 0.05263 0.05263 0.05263
## 6 0.05263 0.05263 0.05263
## 7 0.07895 0.05263 0.05263
## 8 0.05263 0.07895 0.05263
## 9 0.05263 0.05263 0.07895
## 10 0.05263 0.05263 0.05263
Citibike data and prediction
44 of 58 6/12/14, 5:37 PM
Feature extraction
Finally, we have our data
test <- traindata[test_id, -1]
train <- traindata[-test_id, -1]
train <- train[!$y), ]
## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6
## 1 0.02632 Wednesday 614 NA NA NA NA NA NA
## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA
## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA
## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA
## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA
## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA
## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12
Citibike data and prediction
45 of 58 6/12/14, 5:37 PM
Machine Learning Model
Now we can use gbm to do prediction.
Wait, what is gbm?
Citibike data and prediction
46 of 58 6/12/14, 5:37 PM
Machine Learning Model
gbm refers to a certain supervised learning algorithm. It has a lot of names.
In the original publication, "gbm" is short for "Gradient Boosting Machine".
In the R package, it is short for "Generalized Boosting Model".
And its wiki page names it as "Gradient boosting".
Citibike data and prediction
47 of 58 6/12/14, 5:37 PM
Machine Learning Model
gbm is derived from a relatively simple principle.
Briefly speaking, it is "hundreds of heads are better than one".
This algorithm generate many regression trees and combine their results for the final model.
Citibike data and prediction
48 of 58 6/12/14, 5:37 PM
Machine Learning Model
With the following code, we can calculate the model:
Here n.trees is the number of "heads"(trees) for this problem.
model <- gbm(formula=y~.,
data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))],
distribution='gaussian', n.trees=2000,
interaction.depth=5, shrinkage=0.01,
Citibike data and prediction
49 of 58 6/12/14, 5:37 PM
Machine Learning Model
In prediction, using too many trees may cause overfitting problem.
Therefore we need to use cross-validation to choose the number of trees to avoid it.
gbm provide us a convenient tool, here OOB means "Out Of Bag":
best_ntree <- gbm.perf(model, method = "OOB")
Citibike data and prediction
50 of 58 6/12/14, 5:37 PM
Machine Learning Model
Then we can make the prediction:
## [1] 539
n.trees=best_ntree, type='response')
## [1] 0.1287
Citibike data and prediction
51 of 58 6/12/14, 5:37 PM
Performance testing
How to compare these two models? We set up a test.
Every day we will get 288 data points. And now we want to predict next 6 points with data from
the previous week, i.e. 2016 data points.
We randomly choose 50 time points and make prediction for the next 30 minutes.
Then compare their performance with RMSE:
rmse = function(pred, real) sqrt(mean((pred - real)^2))
Citibike data and prediction
52 of 58 6/12/14, 5:37 PM
Performance testing
Here is the result:
We can see that gbm is slightly better than the time series prediction.
## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698
## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258
Citibike data and prediction
53 of 58 6/12/14, 5:37 PM
Performance testing
However, our performance is not ideal.
We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's the
Why is this happenning?
## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311
Citibike data and prediction
54 of 58 6/12/14, 5:37 PM
Performance testing
This picture have some hints.
plot(diff(data.ts), type = "l")
Citibike data and prediction
55 of 58 6/12/14, 5:37 PM
Performance testing
We can see that this data is tend to stay the same in the next 5 minutes, or even longer.
There are so many 5-minutes that nobody come to this station. Therefore the most straight-
forward prediction out-performed those two advanced methods.
sum(diff(data.ts) == 0)
## [1] 6622
Citibike data and prediction
56 of 58 6/12/14, 5:37 PM
More to do
There are many things to do in the future:
The sky is the limit!
Apply other algorithms to this problem, like neural networks.
Use information from nearby station: empty nearby stations will lead people come to this
Combine with weather record: nobody ride in rainy day!
Path finding: design the whole trip for people.
Citibike data and prediction
57 of 58 6/12/14, 5:37 PM
Our Packages
We are developing an R package for citibike, including
There was an app written in Ruby-On-Rails here, offering our prediction service. Our heroku
went to sleep since the service didn't get much traffic, but one of our meetup member spent
sometime to make it live today and emailed me the link! here 2
Data scraping
Database interaction and retrieve
Time Series prediction
GBM prediction
Citibike data and prediction
58 of 58 6/12/14, 5:37 PM

Mais conteúdo relacionado

Mais procurados

The power of streams in node js
The power of streams in node jsThe power of streams in node js
The power of streams in node jsJawahar
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)David Evans
Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)Maarten Mulders
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingDavid Evans
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-dataDavid Peyruc
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?MongoDB
Crossing into Kernel Space
Crossing into Kernel SpaceCrossing into Kernel Space
Crossing into Kernel SpaceDavid Evans
Making a Process
Making a ProcessMaking a Process
Making a ProcessDavid Evans
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Building OpenDNS Stats
Building OpenDNS StatsBuilding OpenDNS Stats
Building OpenDNS StatsGeorge Ang
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
Jgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-componentJgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-componentCIAT
Smarter Scheduling
Smarter SchedulingSmarter Scheduling
Smarter SchedulingDavid Evans
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
PostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analyticsPostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analyticsHans-Jürgen Schönig
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin airKonstantine Krutiy
Raw system logs processing with hive
Raw system logs processing with hiveRaw system logs processing with hive
Raw system logs processing with hiveArpit Patil

Mais procurados (20)

The power of streams in node js
The power of streams in node jsThe power of streams in node js
The power of streams in node js
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)
Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)Building a DSL with GraalVM (CodeOne)
Building a DSL with GraalVM (CodeOne)
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and Scheduling
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-datatranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?It's 10pm: Do You Know Where Your Writes Are?
It's 10pm: Do You Know Where Your Writes Are?
Crossing into Kernel Space
Crossing into Kernel SpaceCrossing into Kernel Space
Crossing into Kernel Space
Making a Process
Making a ProcessMaking a Process
Making a Process
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
Building OpenDNS Stats
Building OpenDNS StatsBuilding OpenDNS Stats
Building OpenDNS Stats
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
Jgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-componentJgrassnewage digital-watershed-model-component
Jgrassnewage digital-watershed-model-component
Smarter Scheduling
Smarter SchedulingSmarter Scheduling
Smarter Scheduling
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
PostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analyticsPostgreSQL: Data analysis and analytics
PostgreSQL: Data analysis and analytics
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin air
Raw system logs processing with hive
Raw system logs processing with hiveRaw system logs processing with hive
Raw system logs processing with hive

Semelhante a Nyc open data project ii -- predict where to get and return my citibike

How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logsStefan Krawczyk
Troubleshooting real production problems
Troubleshooting real production problemsTroubleshooting real production problems
Troubleshooting real production problemsTier1 app
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingMax Kleiner
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wyciekówKonrad Kokosa
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
Advanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien PouliotAdvanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien PouliotXamarin
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisC4Media
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...Andrey Karpov
Build your own_map_by_yourself
Build your own_map_by_yourselfBuild your own_map_by_yourself
Build your own_map_by_yourselfMarc Huang
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Fastly

Semelhante a Nyc open data project ii -- predict where to get and return my citibike (20)

How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
Troubleshooting real production problems
Troubleshooting real production problemsTroubleshooting real production problems
Troubleshooting real production problems
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
C++ Coroutines
C++ CoroutinesC++ Coroutines
C++ Coroutines
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Advanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien PouliotAdvanced iOS Build Mechanics, Sebastien Pouliot
Advanced iOS Build Mechanics, Sebastien Pouliot
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
PVS-Studio and Continuous Integration: TeamCity. Analysis of the Open RollerC...
Build your own_map_by_yourself
Build your own_map_by_yourselfBuild your own_map_by_yourself
Build your own_map_by_yourself
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge

Mais de Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang

Mais de Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r


Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank M.Gokilavani
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank M.Gokilavani
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank M.Gokilavani
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1

Último (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management

Nyc open data project ii -- predict where to get and return my citibike

  • 1. CCiittiibbiikkeeddaattaaaannddpprreeddiiccttiioonn WhichstationshouldIchoose? Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He) Check out our blog and newsletters at and Citibike data and prediction 1 of 58 6/12/14, 5:37 PM
  • 2. Overview Overview1. Citibike Data2. Scrapping3. Data Description4. Modeling5. 2/58 Citibike data and prediction 2 of 58 6/12/14, 5:37 PM
  • 3. Citibike Citibike is hosting a public bike service. There are many bike stations in NYC. People want to take bike from a station with at least one available bike. And when they get to the destination, they want to return bike to a station with at least one available slot. Our goal is to predict where to rent and where to return 3/58 Citibike data and prediction 3 of 58 6/12/14, 5:37 PM
  • 4. Citibike 4/58 Citibike data and prediction 4 of 58 6/12/14, 5:37 PM
  • 5. Citibike data Where are the data sets? citibike is great in opening their datasets. They provide previous datasets about trips. But that's not what we are looking for now. 5/58 Citibike data and prediction 5 of 58 6/12/14, 5:37 PM
  • 6. Citibike data Where to find data for each stations bikes and slots? We can visit to see the current data. With historical data, We want to provide prediction and guide people with a better choice. 6/58 Citibike data and prediction 6 of 58 6/12/14, 5:37 PM
  • 7. Historical data we want to scrap data from the website every 5 minutes. How to do that in R? 7/58 Citibike data and prediction 7 of 58 6/12/14, 5:37 PM
  • 8. Data scraping We use the following code: This is the time we get the data. require(rjson) jsonURL = "" json_data = fromJSON(file = jsonURL) names(json_data) ## [1] "executionTime" "stationBeanList" json_data$executionTime ## [1] "2014-04-24 11:11:03 AM" 8/58 Citibike data and prediction 8 of 58 6/12/14, 5:37 PM
  • 9. Data scraping Our data is in the form of list. We want to change it into data.frame. What can we get from this data? names(json_data$stationBeanList[[1]]) ## [1] "id" "stationName" ## [3] "availableDocks" "totalDocks" ## [5] "latitude" "longitude" ## [7] "statusValue" "statusKey" ## [9] "availableBikes" "stAddress1" ## [11] "stAddress2" "city" ## [13] "postalCode" "location" ## [15] "altitude" "testStation" ## [17] "lastCommunicationTime" "landMark" 9/58 Citibike data and prediction 9 of 58 6/12/14, 5:37 PM
  • 10. Data scraping We just need id, availableDocks, availableBikes, and executionTime. executionTime = json_data$executionTime ids = sapply(json_data$stationBeanList, function(x) x$id) free = sapply(json_data$stationBeanList, function(x) x$availableDocks) bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes) data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes) 10/58 Citibike data and prediction 10 of 58 6/12/14, 5:37 PM
  • 11. Data scraping And we can get something like this: head(data) ## time station_id free bikes ## 1 2014-04-24 11:11:03 AM 72 19 18 ## 2 2014-04-24 11:11:03 AM 79 13 15 ## 3 2014-04-24 11:11:03 AM 82 10 17 ## 4 2014-04-24 11:11:03 AM 83 44 17 ## 5 2014-04-24 11:11:03 AM 116 8 30 ## 6 2014-04-24 11:11:03 AM 119 16 2 11/58 Citibike data and prediction 11 of 58 6/12/14, 5:37 PM
  • 12. CRON We use cron to schedule our tasks, including our web scrapper. The log service for cron is off by default. We can first and delete the '#' before '#cron.*'. Then we restart rsyslog with And now we have successfully enable the log management system of cron. Use this to check the log of cron: sudo vi /etc/rsyslog.d/50-default.conf sudo service rsyslog restart sudo vi /var/log/cron.log 12/58 Citibike data and prediction 12 of 58 6/12/14, 5:37 PM
  • 13. CRON Then we can restart the CRON service. If the following command return a pid, then our cron service is on. Or you can use this alternative command: sudo service cron restart pgrep cron ps aux | grep 'cron' 13/58 Citibike data and prediction 13 of 58 6/12/14, 5:37 PM
  • 14. CRON The simplest way to add tasks is create a .sh script. For example, we create a shell script named "". It is preferred to use the absolute path. /usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R /usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R 14/58 Citibike data and prediction 14 of 58 6/12/14, 5:37 PM
  • 15. CRON The final step is to add our script to the list of cron tasks. And we can add the following line to the end of crontab: And restart cron to validate our operation. Here, the first parameter "*/5" means do it every 5 minutes. Next four parameters correspond to hour, day, month, weekday. And finally is the command to run. sudo vi /etc/crontab */5 * * * * root /home/vivianzhang/citibike/ 15/58 Citibike data and prediction 15 of 58 6/12/14, 5:37 PM
  • 16. CRON Other Examples for cron tasks. 0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00 This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00. what if we want to cron every 30 minutes? 0 23-7/2,8 * * * echo "Have a good dream:)" >> /tmp/test.txt 0 0,3,6,9,12,15,18,21 ... 30 1,4,7,10,13,16,19,22 ... 16/58 Citibike data and prediction 16 of 58 6/12/14, 5:37 PM
  • 17. CRONTAB On Apple MAC machine, we use crontab. Create a file, or open an existing file to put your task description. such as 'crontest'1. Edit your tasks as stated previously.2. Start crontab, and list running tasks.3. Check whether it run correctly4. You can remove all the cron tasks after you are done5. 17/58 Citibike data and prediction 17 of 58 6/12/14, 5:37 PM
  • 18. CRONTAB # make a new crontab file sudo touch /etc/crontest # change the content into this sudo vi /etc/crontest # content of the file # solution to cron every minute */1 * * * * echo "test cron" >> /tmp/test.txt # run the job into your cron task list crontab /etc/crontest # check crontab list crontab -l # check whether the log is written to your temp file vi /tmp/test.txt 18/58 Citibike data and prediction 18 of 58 6/12/14, 5:37 PM
  • 19. CRONTAB # you should see a few works in the file # remove the cron job crontab -r # double check to see if the job is removed crontab -l 19/58 Citibike data and prediction 19 of 58 6/12/14, 5:37 PM
  • 20. PostgreSQL We choose PostgreSQL as the database, which is open-sourced and R-friendly. We can easily connect to it with a command like this: require(RPostgreSQL) conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456", dbname = "station_all", host = "", port = "5432") 20/58 Citibike data and prediction 20 of 58 6/12/14, 5:37 PM
  • 21. PostgreSQL Our server has limited memory of 1GB, we can’t fetch too many records at once. 10000 records/fetch is okay. The following code enable us extract the first 100 records in table: And we can fetch 101th record to 10,000th record in the table res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000") data1 <- fetch(res, n = 100) data2 <- fetch(res, n = -1) 21/58 Citibike data and prediction 21 of 58 6/12/14, 5:37 PM
  • 22. PostgreSQL The size of the table may be larger than the memory. An alternative method is to directly play with PostgreSQL. We can copy the table to a local file. First we need to use a valid database user. To use the default user in PostgreSQL, one can Then in the interactive interface, use the following SQL command to export the table. sudo su - postgres psql c station_all copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER 22/58 Citibike data and prediction 22 of 58 6/12/14, 5:37 PM
  • 23. Data preprocessing It is easy to handle date type of data with the following code: Our data is clean, and useful information includes dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S") time available bikes available spots. · · · 23/58 Citibike data and prediction 23 of 58 6/12/14, 5:37 PM
  • 24. Data preprocessing We extract data from a single station, and name it "data_all". This is what we are gonna use: Let us explore first 10,000 records. load("data_all.rda") head(data_all) ## station_time bikes free ## 1 2013-08-21 14:10:00 1 37 ## 2 2013-08-21 14:15:00 2 36 ## 3 2013-08-21 14:20:00 2 36 ## 4 2013-08-21 14:25:00 2 36 ## 5 2013-08-21 14:30:00 2 36 ## 6 2013-08-21 14:35:00 3 35 data = data_all[1:10000, ] 24/58 Citibike data and prediction 24 of 58 6/12/14, 5:37 PM
  • 25. Time Series Model We would like to predict the ratio of bikes in this station. data$total <- data$bikes + data$free data$ratio <- data$bikes/data$total head(data) ## station_time bikes free total ratio ## 1 2013-08-21 14:10:00 1 37 38 0.02632 ## 2 2013-08-21 14:15:00 2 36 38 0.05263 ## 3 2013-08-21 14:20:00 2 36 38 0.05263 ## 4 2013-08-21 14:25:00 2 36 38 0.05263 ## 5 2013-08-21 14:30:00 2 36 38 0.05263 ## 6 2013-08-21 14:35:00 3 35 38 0.07895 25/58 Citibike data and prediction 25 of 58 6/12/14, 5:37 PM
  • 26. Time Series Model The time interval between our data points is 5 minutes. Let's check if there's any trends: five_day_ind = 1:(288 * 5) plot(data$ratio[five_day_ind], type = "l") 26/58 Citibike data and prediction 26 of 58 6/12/14, 5:37 PM
  • 27. Time Series Model Then we turn it into a time series object with frequency=288 Let's check our data There is an NA value in our sequence. data.ts <- ts(data$ratio, start = 1, frequency = 288) sum( ## [1] 1 27/58 Citibike data and prediction 27 of 58 6/12/14, 5:37 PM
  • 28. Time Series Model Use the following code to fill them with the previous value. na.position <- which( data.ts[na.position] <- data.ts[na.position - 1] any( ## [1] FALSE 28/58 Citibike data and prediction 28 of 58 6/12/14, 5:37 PM
  • 29. Time Series Model The "seasonal" trend is obvious. We need to make use of this information. It is a smooth function, extract seasonal pattern and enable us to focus on the higher-level trends. fit <- stl(data.ts, "periodic") colnames(fit$time.series) ## [1] "seasonal" "trend" "remainder" 29/58 Citibike data and prediction 29 of 58 6/12/14, 5:37 PM
  • 30. Time Series Model The fitted result looks like: head(fit$time.series) ## seasonal trend remainder ## [1,] -0.2251 0.2772 -0.025791 ## [2,] -0.2133 0.2784 -0.012396 ## [3,] -0.2126 0.2795 -0.014250 ## [4,] -0.2156 0.2806 -0.012383 ## [5,] -0.2067 0.2817 -0.022373 ## [6,] -0.2089 0.2828 0.005042 30/58 Citibike data and prediction 30 of 58 6/12/14, 5:37 PM
  • 31. Time Series Model Black line is original data showing how much percentage of bikes are available at each time point. Red line is extracted seasonal effect. plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1], col = 2) leg.txt = c("origin", "seasonal") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2) 31/58 Citibike data and prediction 31 of 58 6/12/14, 5:37 PM
  • 32. Time Series Model The green line is the trend: plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1], col = 2) lines(fit$time.series[five_day_ind, 2], col = 3) leg.txt = c("origin", "seasonal", "trends") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3) 32/58 Citibike data and prediction 32 of 58 6/12/14, 5:37 PM
  • 33. Time Series Model We get an approximation of our data by adding trend and seasonal effects. Blue line shows the mixed effect of trend and seasonal. The remaining difference is the remainder. plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500)) lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4) leg.txt = c("origin", "approx") legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4)) 33/58 Citibike data and prediction 33 of 58 6/12/14, 5:37 PM
  • 34. Time Series Model Generally, a single trip with citibike is around 30 minutes. And normal user will pay additional charges for a journey over 30 minutes. We want to focus on the prediction for next 30 minutes, given the update happens every 5 minutes, we will fit 6 data points. 34/58 Citibike data and prediction 34 of 58 6/12/14, 5:37 PM
  • 35. Time Series Model With the R package 'forecast', we can do time series prediction easily. library(forecast) # h is number of periods for forecasting pred = as.numeric(forecast(fit, h = 6)$mean) 35/58 Citibike data and prediction 35 of 58 6/12/14, 5:37 PM
  • 36. Machine Learning Model Machine learning could also be applied to the time series data. Here we are going to use GBM for demonstration. Before we apply gbm to our data. We need to extract some more time related features. Especially, we need to use previous values to predict. 36/58 Citibike data and prediction 36 of 58 6/12/14, 5:37 PM
  • 37. Feature extraction traindata = data[1:2000, ] traindata = traindata[c("station_time", "ratio")] names(traindata) <- c("time", "y") head(traindata) ## time y ## 1 2013-08-21 14:10:00 0.02632 ## 2 2013-08-21 14:15:00 0.05263 ## 3 2013-08-21 14:20:00 0.05263 ## 4 2013-08-21 14:25:00 0.05263 ## 5 2013-08-21 14:30:00 0.05263 ## 6 2013-08-21 14:35:00 0.07895 37/58 Citibike data and prediction 37 of 58 6/12/14, 5:37 PM
  • 38. Feature extraction Time points to make prediction: h = 6 new_time <- seq(from=traindata$time[nrow(traindata)], by='5 min', length.out=h+1)[-1] new_time ## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST" ## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST" ## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST" 38/58 Citibike data and prediction 38 of 58 6/12/14, 5:37 PM
  • 39. Feature extraction Let's combind our train and test data for further features. test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h) traindata <- rbind(traindata, data.frame(time = new_time, y = NA)) test_id ## [1] 2001 2002 2003 2004 2005 2006 39/58 Citibike data and prediction 39 of 58 6/12/14, 5:37 PM
  • 40. Feature extraction Of course, this service may be popular in weekends than weekdays. So we need a variable to mark it. traindata$weekday <- as.factor(weekdays(traindata$time)) head(traindata$weekday) ## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday ## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday 40/58 Citibike data and prediction 40 of 58 6/12/14, 5:37 PM
  • 41. Feature extraction Time stamp is useful: hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST")) mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST")) ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST")) traindata$time_hms <- hh + 60 * mm + 3600 * ss head(traindata) ## time y weekday time_hms ## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 ## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 ## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 ## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514 ## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814 ## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114 41/58 Citibike data and prediction 41 of 58 6/12/14, 5:37 PM
  • 42. Feature extraction How to combine previous information? We need to compute a lagged time series. A lagged time series is a "delayed" time series, as shown below f_lag <- function(x, lag=0) c(rep(NA, lag), x[1:(length(x)-lag)]) f_lag(1:10, 1) ## [1] NA 1 2 3 4 5 6 7 8 9 f_lag(1:10, 4) ## [1] NA NA NA NA 1 2 3 4 5 6 42/58 Citibike data and prediction 42 of 58 6/12/14, 5:37 PM
  • 43. Feature extraction To use the information from 12:30 in 12:40, we can do it with lagged time series. for (lag in 1:12) { traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag) } traindata[1:3, ] ## time y weekday time_hms lag_1 lag_2 lag_3 ## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA ## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA ## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA ## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12 ## 1 NA NA NA NA NA NA NA NA NA ## 2 NA NA NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA NA NA 43/58 Citibike data and prediction 43 of 58 6/12/14, 5:37 PM
  • 44. Feature extraction Don't worry about those NAs! They are inevitable in a lagged series. traindata[1:10, 5:7] ## lag_1 lag_2 lag_3 ## 1 NA NA NA ## 2 0.02632 NA NA ## 3 0.05263 0.02632 NA ## 4 0.05263 0.05263 0.02632 ## 5 0.05263 0.05263 0.05263 ## 6 0.05263 0.05263 0.05263 ## 7 0.07895 0.05263 0.05263 ## 8 0.05263 0.07895 0.05263 ## 9 0.05263 0.05263 0.07895 ## 10 0.05263 0.05263 0.05263 44/58 Citibike data and prediction 44 of 58 6/12/14, 5:37 PM
  • 45. Feature extraction Finally, we have our data test <- traindata[test_id, -1] train <- traindata[-test_id, -1] train <- train[!$y), ] head(train) ## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 ## 1 0.02632 Wednesday 614 NA NA NA NA NA NA ## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA ## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA ## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA ## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA ## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA ## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12 ## 1 NA NA NA NA NA NA ## 2 NA NA NA NA NA NA ## 3 NA NA NA NA NA NA ## 4 NA NA NA NA NA NA ## 5 NA NA NA NA NA NA ## 6 NA NA NA NA NA NA 45/58 Citibike data and prediction 45 of 58 6/12/14, 5:37 PM
  • 46. Machine Learning Model Now we can use gbm to do prediction. Wait, what is gbm? 46/58 Citibike data and prediction 46 of 58 6/12/14, 5:37 PM
  • 47. Machine Learning Model gbm refers to a certain supervised learning algorithm. It has a lot of names. In the original publication, "gbm" is short for "Gradient Boosting Machine". In the R package, it is short for "Generalized Boosting Model". And its wiki page names it as "Gradient boosting". · · · 47/58 Citibike data and prediction 47 of 58 6/12/14, 5:37 PM
  • 48. Machine Learning Model gbm is derived from a relatively simple principle. Briefly speaking, it is "hundreds of heads are better than one". This algorithm generate many regression trees and combine their results for the final model. 48/58 Citibike data and prediction 48 of 58 6/12/14, 5:37 PM
  • 49. Machine Learning Model With the following code, we can calculate the model: Here n.trees is the number of "heads"(trees) for this problem. model <- gbm(formula=y~., data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))], distribution='gaussian', n.trees=2000, interaction.depth=5, shrinkage=0.01, cv.folds=0, 49/58 Citibike data and prediction 49 of 58 6/12/14, 5:37 PM
  • 50. Machine Learning Model In prediction, using too many trees may cause overfitting problem. Therefore we need to use cross-validation to choose the number of trees to avoid it. gbm provide us a convenient tool, here OOB means "Out Of Bag": best_ntree <- gbm.perf(model, method = "OOB") 50/58 Citibike data and prediction 50 of 58 6/12/14, 5:37 PM
  • 51. Machine Learning Model Then we can make the prediction: best_ntree ## [1] 539 predict(model,[1,,drop=F]), n.trees=best_ntree, type='response') ## [1] 0.1287 51/58 Citibike data and prediction 51 of 58 6/12/14, 5:37 PM
  • 52. Performance testing How to compare these two models? We set up a test. Every day we will get 288 data points. And now we want to predict next 6 points with data from the previous week, i.e. 2016 data points. We randomly choose 50 time points and make prediction for the next 30 minutes. Then compare their performance with RMSE: rmse = function(pred, real) sqrt(mean((pred - real)^2)) 52/58 Citibike data and prediction 52 of 58 6/12/14, 5:37 PM
  • 53. Performance testing Here is the result: We can see that gbm is slightly better than the time series prediction. stl_precision ## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698 gbm_precision ## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258 53/58 Citibike data and prediction 53 of 58 6/12/14, 5:37 PM
  • 54. Performance testing However, our performance is not ideal. We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's the result? Why is this happenning? y_precision ## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311 54/58 Citibike data and prediction 54 of 58 6/12/14, 5:37 PM
  • 55. Performance testing This picture have some hints. plot(diff(data.ts), type = "l") 55/58 Citibike data and prediction 55 of 58 6/12/14, 5:37 PM
  • 56. Performance testing We can see that this data is tend to stay the same in the next 5 minutes, or even longer. There are so many 5-minutes that nobody come to this station. Therefore the most straight- forward prediction out-performed those two advanced methods. sum(diff(data.ts) == 0) ## [1] 6622 56/58 Citibike data and prediction 56 of 58 6/12/14, 5:37 PM
  • 57. More to do There are many things to do in the future: The sky is the limit! Apply other algorithms to this problem, like neural networks. Use information from nearby station: empty nearby stations will lead people come to this one. Combine with weather record: nobody ride in rainy day! Path finding: design the whole trip for people. · · · · 57/58 Citibike data and prediction 57 of 58 6/12/14, 5:37 PM
  • 58. Our Packages We are developing an R package for citibike, including There was an app written in Ruby-On-Rails here, offering our prediction service. Our heroku went to sleep since the service didn't get much traffic, but one of our meetup member spent sometime to make it live today and emailed me the link! here 2 Data scraping Database interaction and retrieve Time Series prediction GBM prediction · · · · 58/58 Citibike data and prediction 58 of 58 6/12/14, 5:37 PM