Good Enough Analytics

Good Enough Analytics
by Kai Xin

The Good Enough Stuff
Analytical Tools

Analytical Tools are like spoons

Usefulness
Point of
stupidity
Point of
stupidity

Point of
stupidity
What is stupid today,
might not be stupid
tomorrow

Big data analytics using cost efficient tools

Ensembles of good enough models

Point of stupidity: The perfect model
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
3
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
3
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
3
A “perfect” model is too
complex, too costly to build,
too hard to maintain and not
flexible to change.

“There are known knowns;
there are things we know that we know.
There are known unknowns;
there are things that we now know we don't know.
But there are also unknown unknowns;
there are things we do not know we don't know.”
By Donald Rumsfeld, United States Secretary of Defense
and Potential Data Scientist
Why the perfect model is stupid

“In statistics and machine
learning, ensemble
methods use multiple
models to obtain better
predictive performance
than could be obtained
from any of the
constituent models”
Good Enough Analytics: Ensembles
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4
9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6
3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6
4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3
8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5
8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1
6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9
3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1
2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2
5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7
3 7 7 8 2 5 4 7 7 1 2 7 4 6 6
+
1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9
3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3
7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2
6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4
8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1
2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6
5 2 3 1 5 7 4 9 4
+
1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3
7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9
2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5
2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7
2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1
9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8
7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5
5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1
6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1
7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2
4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6
1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4
4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6

scholarpedia.org
Refer to References

The Serious Stuff
…beyond theorycraft

Simple Ensembles – GLM Bootstrap aggregating
(bagging)
predictions<-foreach(1:1000,.combine=cbind) %dopar%{
training_positions <- sample(nrow(train),
size=floor((nrow(train)*0.9)),replace = TRUE)
train_pos<-1:nrow(train) %in% training_positions
glmMod<-rxLinMod(eqn, train[train_pos,])
rxPredict(glmMod,test, type="response")
}
result<-rowMeans(predictions)

Simple Ensembles – Gradient Boosting Machines
gbmMod<-gbm(eqn, train,n.trees=10000,
shrinkage=0.002, distribution="gaussian",
interaction.depth=7,
bag.fraction=0.9,
n.minobsinnode = 50
)
Similar to bagging, boosting also creates an ensemble of
classifiers by resampling the data, which are then combined
by majority voting. However, in boosting, resampling is
strategically geared to provide the most informative training
data for each consecutive classifier.

Simple Ensembles - Random Forest
rf <- foreach(ntree=rep(333,3), .combine=combine,
.packages='randomForest')%dopar%
randomForest(train[,3:length(train)],
train$Act, ntree=ntree, do.trace=1000,
mtry=round(colNumber/3), replace=FALSE,
nodesize = 5, na.action=na.omit)

Ensemble of Ensembles
1. Mean(RF+GBM+BagGLM)
2. Median(RF+GBM+BagGLM)
3. 0.4*RF+0.4*GBM+0.2*BagGLM

Ensembles – Why it matters
Improve accuracy
Ensembles tend to yield better results than its constituent models
when there is a significant diversity among the models
Developing multiple simple model is faster
attempting to develop the perfect model
More resistance to over fitting
Less reliant on any single model
Concurrent development
Different models can be run and developed on different
instances/machines by different data scientist

Ensembles – point of stupidity
Netflix prize 1 million dollar winner:
Ensemble of 107 models for 10% improvement
Too complicated, costly and inflexible to change
Actual deployment:
Ensemble of 2 models for 8.43% improvement
Moral of story:
Good Enough Ensemble is good enough

and good enough ensemble of models

Data Optimization

Data cleaning vs Data optimization
Important but I
assume you know
Done AFTER
data cleaning

Kaggle Medical Drug Competition
15 sets of data
Each data set:
1,000 to 2,000 Attributes
500 to 20,000 Rows
Qn: Identify rogue drugs

Point of stupidity:
Trying to run analysis on all attributes
Drug Rogue
%
Company Color Component
1
Component
2…2000
A 0.0400 XYZ Red 200 30
B 0.0002 XYZ Green 920 50
C 0.8000 XYZ Blue 30 1000
D ? XYZ Red 340 800

Drug Rogue
%
Company Color Component
1
Component
2…2000
A 0.0400 XYZ Red 200 30
B 0.0002 XYZ Green 920 50
C 0.8000 XYZ Blue 30 1000
D ? XYZ Red 340 800
Not all attributes are born equal
No
Variance Irrelevant Too many attributes

Drug Rogue
%
Company
A 0.0400 XYZ
B 0.0002 XYZ
C 0.8000 XYZ
D ? XYZ
R code:
Library(caret)
healthdata[nearZeroVar(healthdata, freqCut = 95/5,
uniqueCut = 10)]<-list(NULL)
<- this attribute does not
help in differentiating
between the drugs
Remove no variance /
near zero variance attributes

Drug Rogue
%
Color
A 0.0400 Red
B 0.0002 Green
C 0.8000 Blue
D ? Red
R code for Random Forest:
importanceScore <- importance(myMod)
R code for GBM:
importanceScore <- summary.gbm (myMod, ntree)
<- this attribute has no
relevance to % rouge drug
Remove not important attributes

Drug Rogue % Component
1
Component
2…2000
A 0.0400 200 30
B 0.0002 920 50
C 0.8000 30 1000
D ? 340 800
R code:
pc <- prcomp(train[, 2:length(train)],tol=0.12)
<- too many
attributes
takes very
long to run
analysis
Attribute reduction using Principal Component Analysis

Andrew Ng: Always try analysis without PCA first.
X XXXX X
Attribute 1
Attribute 2
Andrew Ng: Machine Learning Course
Refer to References

Andrew Ng: Always try analysis without PCA first.
X XXXX X
Principal Component
Refer to References

X
X
X
X
X
X
Attribute 1
Attribute 2
Refer to References

The 1D red line and points are now
representative of the 2D graph
Principal
Component
0
0
0
0
0
0
Refer to References

Data Optimization – Why it matters
Performance Improvement (importance,nearZeroVar)
Cut down attributes which are useless or not “good
enough”. More accurate and complex models can
be built on attributes that matters.
Cost Savings (PCA)
Less data needs to be processed, faster turnover for
models and results.

and good enough ensemble of models based
on optimized data

Scaling on cloud

Why use Cloud
How often do you really need a
multimillion machine to be on standby
24/7 to churn data?
Do you really need real time analytics or
is hourly/daily/weekly/monthly report
good enough?

Cloud – Why it matters
Excellent bang for the buck
<$5/hr to rent million dollar worth of power. No need to
purchase/maintain hardware. Scale on demand
Great for Ensemble Modeling
You can start multiple instance, each instance running one
simple model and ensemble them
But beware of data security and privacy laws
Not suitable for all kinds of data/application
For example, Amazon Web Service is HIPAA compliant but
Rackspace is not.

Name Age Income Postal
Peter 23 $2,000 400573
Sally 11 $0 520028
Paul 70 $500 521201
Mark 30 $8,000 247392
Prepare data for the cloud

Name Age Age
Group
Income Income
Range
Postal Postal
Area
Peter 23 Youth $2,000 $1,000-
$3,000
400*** Eunos
Sally 11 Child $0 $0 520*** Simei
Paul 70 Senior $500 $1-$1,000 521*** Tampines
Mark 30 Adult $8,000 >$5,000 247*** Tanglin
Prepare data for the cloud
Remove
Identity
Use general
category
Reference: Dr. Yap Ghim Eng (A*Star)
Use range
category Masking Rollup

on optimized data, scaled on cloud services

…that we have no time for
Amazon Web Service

sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blas
sudo yum install -y lapack-devel blas-devel
wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gz
tar -xf R-2.15.2.tar.gz
cd R-2.15.2
./configure --with-x=no
sudo make
PATH=$PATH:~/R-2.15.2/bin/
cd ..
wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/download
tar -xzf numpy-1.6.2.tar.gz
cd numpy-1.6.2
sudo python setup.py install
cd ..
wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/download
tar -xzf scipy-0.11.0.tar.gz
cd scipy-0.11.0
cd ..
wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817
tar -xzf nose-1.1.2.tar.gz
cd nose-1.1.2
Basic code to setup Amazon instance for analytics
=after sudo-ing and running R, type=
install.packages('gbm')
install.packages('randomForest')
To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"

Amazon EC2 Spot Instance
Cluster Compute Eight Extra Large
60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of
local instance storage, 64-bit platform, 10 Gigabit
Ethernet
$0.27 per hour
High-Memory Quadruple Extra Large Instance
68.4 GiB of memory, 26 EC2 Compute Units (8 virtual
cores with 3.25 EC2 Compute Units each), 1690 GB of
local instance storage, 64-bit platform
$0.14 per hour

Weakness of Spot Instance
Bidding system. If your bid < spot instance
price, instance will be terminated.
Solutions:
1) Put master on normal cloud instance
and slave on spot instance
2) Heartbeat + Queue with Checkpoint

PCA with KNN

library(FNN)
train <- read.csv("train.csv", header=TRUE)
test <- read.csv("test.csv", header=TRUE)
pc <- prcomp(train[, 2:length(train)],tol=0.12)
mydata <- data.frame(label = train[, "label"], pc$x)
labels <- mydata[,1]
mydata2 <- mydata[,-1]
test.p <- predict(pc, newdata = test)
results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1,
algorithm="cover_tree")]
write(results, file="knn_PCA.csv", ncolumns=1)
Principal Component Analysis -
With K-Nearest Neighbor

Data Chunking

Data Chunking– Revolution R
Loosely based on NoSQL
The XDF format is a binary file format that
stores data in blocks and processes data in
chunks (groups of blocks) for efficient reading of
arbitrary columns and contiguous rows
Use a format called XDF
For more details, visit RevR website

Data Chunking– Why it matters
# Chunk 6.5GB worth of data onto HDD in XDF
rxImport(inData = trainFile, outFile =
“trainingData.xdf”)
#revR created methods like rxGlm to run huge
Poisson regression directly on XDF file
myPos <- rxGlm(amount2 ~
Mailed+Donated+RR,data="trainingData",
family=poisson())
*This cannot be done using normal R on my laptop, as R tries to load
entire dataset into memory

RAM: Fast but expansive
SSD: ~4x faster than
normal HDD when
chunking
Data Chunking– Speeding it up using SSD
instead of normal HDD

Multicore

Multicore Processing – Revolution R
library(foreach)
library(doSNOW)
cluster <-makeCluster(3, type = "SOCK")
registerDoSNOW(cluster)
setMKLthreads(1)
predictions<-foreach(1:1000,.combine=cbind) %dopar%{
training_positions <- sample(nrow(train),
size=floor((nrow(train)*0.9)),replace = TRUE)
train_pos<-1:nrow(train) %in% training_positions
glmMod<-rxLinMod(eqn, train[train_pos,])
rxPredict(glmMod,test, type="response")
}
result<-rowMeans(predictions)

Multicore Processing – Why it matters
License Cost (Usually charge by per CPU)
1 CPU with 4 core = 1 single user license
Distributed 4 CPUS with 1 core each
= 4 license or group license
Performance Improvement
~2 x performance for 3 core vs 1 core

Good Enough ReferencesRandom Forest
•Obtaining knowledge from a random forest
•Suggestions for speeding up Random Forests
•Random Forest with classes that are very unbalanced
GBM
•Define boosting
•Generalized Boosted Models:A guide to the gbm package
•What are some useful guidelines for GBM parameters?
•R gbm logistic regression
•How to win the KDD Cup Challenge with R and gbm
Ensembles
•Ensemble learning introduction
•Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced
Datasets
•Resources for learning how to implement ensemble methods
•Ensemble methods
•Intro to ensemble learning in R
•Predictive analytics & decision tree

Good Enough References
PCA and NearZero
•Principal Component Analysis in R
•PCA on high dimensional data
•PCA on training and test data
•Nearzero R caret library
Misc
•Andrew Ng’s Machine Learning Course
•A Few Useful Things to Know about Machine Learning
•Creating HIPAA-Compliant Medical Data Applications With AWS
•Amazon EC2 Spot Instances
•Improve Predictive Performance in R with Bagging
•Kaggle: Visualizing dark world
•Kaggle: Visualizing handwriting

Qns? Email me @ thiakx@gmail.com
LinkedIn Profile
Kaggle Profile
on optimized data, scaled on cloud services
Asia?

•Slide 2: http://3.bp.blogspot.com/-
nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG
•Slide 3: http://www.salesmanagementmastery.com/wp-
content/uploads/2010/09/money-flying.jpg
•Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg
•Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html
•Slide 7: http://2.bp.blogspot.com/-
Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg
•Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg
•Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg
•Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank-
icon.jpg
•Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page
•Slide 19-21: www.scholarpedia.org
•Slide 23/25: www.wikipedia.org
•Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg
•Slide 63: www.kaggle.com
Photo Credits

Good Enough Analytics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Destaque

Destaque (8)

Semelhante a Good Enough Analytics

Semelhante a Good Enough Analytics (20)

Último

Último (20)

Good Enough Analytics

Notas do Editor