SlideShare uma empresa Scribd logo
1 de 68
Good Enough Analytics
by Kai Xin
The Good Enough Stuff
Analytical Tools
Analytical Tools are like spoons
Analytical Tools are like spoons
Usefulness
Usefulness
Point of
stupidity
Usefulness
Point of
stupidity
Usefulness
Point of
stupidity
Point of
stupidity
Point of
stupidity
What is stupid today,
might not be stupid
tomorrow
Good Enough Analytics
Big data analytics using cost efficient tools
The Good Enough Stuff
Ensembles of good enough models
Point of stupidity: The perfect model
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
3
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
3
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9
9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1
2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8
7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7
3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6
2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4
8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1
9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1
2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4
9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8
1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5
5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5
8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2
2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6
8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2
3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4
1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2
2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1
8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7
8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8
5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9
8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9
2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9
5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9
5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7
9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2
3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9
3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3
7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7
3
A “perfect” model is too
complex, too costly to build,
too hard to maintain and not
flexible to change.
“There are known knowns;
there are things we know that we know.
There are known unknowns;
there are things that we now know we don't know.
But there are also unknown unknowns;
there are things we do not know we don't know.”
By Donald Rumsfeld, United States Secretary of Defense
and Potential Data Scientist
Why the perfect model is stupid
“In statistics and machine
learning, ensemble
methods use multiple
models to obtain better
predictive performance
than could be obtained
from any of the
constituent models”
Good Enough Analytics: Ensembles
4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4
9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6
3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6
4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3
8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5
8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1
6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9
3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1
2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2
5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7
3 7 7 8 2 5 4 7 7 1 2 7 4 6 6
+
1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9
3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3
7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2
6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4
8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1
2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6
5 2 3 1 5 7 4 9 4
+
1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3
7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9
2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5
2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7
2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1
9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8
7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5
5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1
6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1
7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2
4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6
1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4
4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6
scholarpedia.org
Refer to References
scholarpedia.org
Refer to References
scholarpedia.org
Refer to References
The Serious Stuff
…beyond theorycraft
Simple Ensembles – GLM Bootstrap aggregating
(bagging)
predictions<-foreach(1:1000,.combine=cbind) %dopar%{
training_positions <- sample(nrow(train),
size=floor((nrow(train)*0.9)),replace = TRUE)
train_pos<-1:nrow(train) %in% training_positions
glmMod<-rxLinMod(eqn, train[train_pos,])
rxPredict(glmMod,test, type="response")
}
result<-rowMeans(predictions)
Simple Ensembles – Gradient Boosting Machines
gbmMod<-gbm(eqn, train,n.trees=10000,
shrinkage=0.002, distribution="gaussian",
interaction.depth=7,
bag.fraction=0.9,
n.minobsinnode = 50
)
Similar to bagging, boosting also creates an ensemble of
classifiers by resampling the data, which are then combined
by majority voting. However, in boosting, resampling is
strategically geared to provide the most informative training
data for each consecutive classifier.
Simple Ensembles - Random Forest
rf <- foreach(ntree=rep(333,3), .combine=combine,
.packages='randomForest')%dopar%
randomForest(train[,3:length(train)],
train$Act, ntree=ntree, do.trace=1000,
mtry=round(colNumber/3), replace=FALSE,
nodesize = 5, na.action=na.omit)
Ensemble of Ensembles
1. Mean(RF+GBM+BagGLM)
2. Median(RF+GBM+BagGLM)
3. 0.4*RF+0.4*GBM+0.2*BagGLM
Ensembles – Why it matters
Improve accuracy
Ensembles tend to yield better results than its constituent models
when there is a significant diversity among the models
Developing multiple simple model is faster
attempting to develop the perfect model
More resistance to over fitting
Less reliant on any single model
Concurrent development
Different models can be run and developed on different
instances/machines by different data scientist
Ensembles – point of stupidity
Netflix prize 1 million dollar winner:
Ensemble of 107 models for 10% improvement
Too complicated, costly and inflexible to change
Actual deployment:
Ensemble of 2 models for 8.43% improvement
Moral of story:
Good Enough Ensemble is good enough
Good Enough Analytics
Big data analytics using cost efficient tools
and good enough ensemble of models
The Good Enough Stuff
Data Optimization
Data cleaning vs Data optimization
Important but I
assume you know
Done AFTER
data cleaning
Kaggle Medical Drug Competition
15 sets of data
Each data set:
1,000 to 2,000 Attributes
500 to 20,000 Rows
Qn: Identify rogue drugs
Point of stupidity:
Trying to run analysis on all attributes
Drug Rogue
%
Company Color Component
1
Component
2…2000
A 0.0400 XYZ Red 200 30
B 0.0002 XYZ Green 920 50
C 0.8000 XYZ Blue 30 1000
D ? XYZ Red 340 800
Drug Rogue
%
Company Color Component
1
Component
2…2000
A 0.0400 XYZ Red 200 30
B 0.0002 XYZ Green 920 50
C 0.8000 XYZ Blue 30 1000
D ? XYZ Red 340 800
Not all attributes are born equal
No
Variance Irrelevant Too many attributes
Drug Rogue
%
Company
A 0.0400 XYZ
B 0.0002 XYZ
C 0.8000 XYZ
D ? XYZ
R code:
Library(caret)
healthdata[nearZeroVar(healthdata, freqCut = 95/5,
uniqueCut = 10)]<-list(NULL)
<- this attribute does not
help in differentiating
between the drugs
Remove no variance /
near zero variance attributes
Drug Rogue
%
Color
A 0.0400 Red
B 0.0002 Green
C 0.8000 Blue
D ? Red
R code for Random Forest:
importanceScore <- importance(myMod)
R code for GBM:
importanceScore <- summary.gbm (myMod, ntree)
<- this attribute has no
relevance to % rouge drug
Remove not important attributes
Drug Rogue % Component
1
Component
2…2000
A 0.0400 200 30
B 0.0002 920 50
C 0.8000 30 1000
D ? 340 800
R code:
pc <- prcomp(train[, 2:length(train)],tol=0.12)
<- too many
attributes
takes very
long to run
analysis
Attribute reduction using Principal Component Analysis
Andrew Ng: Always try analysis without PCA first.
X XXXX X
Attribute 1
Attribute 2
Attribute reduction using Principal Component Analysis
Andrew Ng: Machine Learning Course
Refer to References
Andrew Ng: Always try analysis without PCA first.
X XXXX X
Principal Component
Attribute reduction using Principal Component Analysis
Andrew Ng: Machine Learning Course
Refer to References
X
X
X
X
X
X
Attribute 1
Attribute 2
Attribute reduction using Principal Component Analysis
Andrew Ng: Machine Learning Course
Refer to References
The 1D red line and points are now
representative of the 2D graph
Principal
Component
Attribute reduction using Principal Component Analysis
0
0
0
0
0
0
Andrew Ng: Machine Learning Course
Refer to References
Data Optimization – Why it matters
Performance Improvement (importance,nearZeroVar)
Cut down attributes which are useless or not “good
enough”. More accurate and complex models can
be built on attributes that matters.
Cost Savings (PCA)
Less data needs to be processed, faster turnover for
models and results.
Good Enough Analytics
Big data analytics using cost efficient tools
and good enough ensemble of models based
on optimized data
The Good Enough Stuff
Scaling on cloud
Why use Cloud
How often do you really need a
multimillion machine to be on standby
24/7 to churn data?
Do you really need real time analytics or
is hourly/daily/weekly/monthly report
good enough?
Cloud – Why it matters
Excellent bang for the buck
<$5/hr to rent million dollar worth of power. No need to
purchase/maintain hardware. Scale on demand
Great for Ensemble Modeling
You can start multiple instance, each instance running one
simple model and ensemble them
But beware of data security and privacy laws
Not suitable for all kinds of data/application
For example, Amazon Web Service is HIPAA compliant but
Rackspace is not.
Name Age Income Postal
Peter 23 $2,000 400573
Sally 11 $0 520028
Paul 70 $500 521201
Mark 30 $8,000 247392
Prepare data for the cloud
Name Age Age
Group
Income Income
Range
Postal Postal
Area
Peter 23 Youth $2,000 $1,000-
$3,000
400*** Eunos
Sally 11 Child $0 $0 520*** Simei
Paul 70 Senior $500 $1-$1,000 521*** Tampines
Mark 30 Adult $8,000 >$5,000 247*** Tanglin
Prepare data for the cloud
Remove
Identity
Use general
category
Reference: Dr. Yap Ghim Eng (A*Star)
Use range
category Masking Rollup
Good Enough Analytics
Big data analytics using cost efficient tools
and good enough ensemble of models based
on optimized data, scaled on cloud services
The Good Enough Stuff
…that we have no time for
Amazon Web Service
sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blas
sudo yum install -y lapack-devel blas-devel
wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gz
tar -xf R-2.15.2.tar.gz
cd R-2.15.2
./configure --with-x=no
sudo make
PATH=$PATH:~/R-2.15.2/bin/
cd ..
wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/download
tar -xzf numpy-1.6.2.tar.gz
cd numpy-1.6.2
sudo python setup.py install
cd ..
wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/download
tar -xzf scipy-0.11.0.tar.gz
cd scipy-0.11.0
sudo python setup.py install
cd ..
wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817
tar -xzf nose-1.1.2.tar.gz
cd nose-1.1.2
sudo python setup.py install
Basic code to setup Amazon instance for analytics
=after sudo-ing and running R, type=
install.packages('gbm')
install.packages('randomForest')
To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"
Amazon EC2 Spot Instance
Cluster Compute Eight Extra Large
60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of
local instance storage, 64-bit platform, 10 Gigabit
Ethernet
$0.27 per hour
High-Memory Quadruple Extra Large Instance
68.4 GiB of memory, 26 EC2 Compute Units (8 virtual
cores with 3.25 EC2 Compute Units each), 1690 GB of
local instance storage, 64-bit platform
$0.14 per hour
Weakness of Spot Instance
Bidding system. If your bid < spot instance
price, instance will be terminated.
Solutions:
1) Put master on normal cloud instance
and slave on spot instance
2) Heartbeat + Queue with Checkpoint
The Good Enough Stuff
…that we have no time for
PCA with KNN
library(FNN)
train <- read.csv("train.csv", header=TRUE)
test <- read.csv("test.csv", header=TRUE)
pc <- prcomp(train[, 2:length(train)],tol=0.12)
mydata <- data.frame(label = train[, "label"], pc$x)
labels <- mydata[,1]
mydata2 <- mydata[,-1]
test.p <- predict(pc, newdata = test)
results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1,
algorithm="cover_tree")]
write(results, file="knn_PCA.csv", ncolumns=1)
Principal Component Analysis -
With K-Nearest Neighbor
The Good Enough Stuff
…that we have no time for
Data Chunking
Data Chunking– Revolution R
Loosely based on NoSQL
The XDF format is a binary file format that
stores data in blocks and processes data in
chunks (groups of blocks) for efficient reading of
arbitrary columns and contiguous rows
Use a format called XDF
For more details, visit RevR website
Data Chunking– Why it matters
# Chunk 6.5GB worth of data onto HDD in XDF
rxImport(inData = trainFile, outFile =
“trainingData.xdf”)
#revR created methods like rxGlm to run huge
Poisson regression directly on XDF file
myPos <- rxGlm(amount2 ~
Mailed+Donated+RR,data="trainingData",
family=poisson())
*This cannot be done using normal R on my laptop, as R tries to load
entire dataset into memory
RAM: Fast but expansive
SSD: ~4x faster than
normal HDD when
chunking
Data Chunking– Speeding it up using SSD
instead of normal HDD
The Good Enough Stuff
…that we have no time for
Multicore
Multicore Processing – Revolution R
library(foreach)
library(doSNOW)
cluster <-makeCluster(3, type = "SOCK")
registerDoSNOW(cluster)
setMKLthreads(1)
predictions<-foreach(1:1000,.combine=cbind) %dopar%{
training_positions <- sample(nrow(train),
size=floor((nrow(train)*0.9)),replace = TRUE)
train_pos<-1:nrow(train) %in% training_positions
glmMod<-rxLinMod(eqn, train[train_pos,])
rxPredict(glmMod,test, type="response")
}
result<-rowMeans(predictions)
Multicore Processing – Why it matters
License Cost (Usually charge by per CPU)
1 CPU with 4 core = 1 single user license
Distributed 4 CPUS with 1 core each
= 4 license or group license
Performance Improvement
~2 x performance for 3 core vs 1 core
Visualization
Good Enough ReferencesRandom Forest
•Obtaining knowledge from a random forest
•Suggestions for speeding up Random Forests
•Random Forest with classes that are very unbalanced
GBM
•Define boosting
•Generalized Boosted Models:A guide to the gbm package
•What are some useful guidelines for GBM parameters?
•R gbm logistic regression
•How to win the KDD Cup Challenge with R and gbm
Ensembles
•Ensemble learning introduction
•Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced
Datasets
•Resources for learning how to implement ensemble methods
•Ensemble methods
•Intro to ensemble learning in R
•Predictive analytics & decision tree
Good Enough References
PCA and NearZero
•Principal Component Analysis in R
•PCA on high dimensional data
•PCA on training and test data
•Nearzero R caret library
Misc
•Andrew Ng’s Machine Learning Course
•A Few Useful Things to Know about Machine Learning
•Creating HIPAA-Compliant Medical Data Applications With AWS
•Amazon EC2 Spot Instances
•Improve Predictive Performance in R with Bagging
•Kaggle: Visualizing dark world
•Kaggle: Visualizing handwriting
Good Enough Analytics
Big data analytics using cost efficient tools
and good enough ensemble of models based
on optimized data, scaled on cloud services
Qns? Email me @ thiakx@gmail.com
LinkedIn Profile
Kaggle Profile
Good Enough Analytics
Big data analytics using cost efficient tools
and good enough ensemble of models based
on optimized data, scaled on cloud services
Asia?
•Slide 2: http://3.bp.blogspot.com/-
nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG
•Slide 3: http://www.salesmanagementmastery.com/wp-
content/uploads/2010/09/money-flying.jpg
•Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg
•Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html
•Slide 7: http://2.bp.blogspot.com/-
Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg
•Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg
•Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg
•Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank-
icon.jpg
•Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page
•Slide 19-21: www.scholarpedia.org
•Slide 23/25: www.wikipedia.org
•Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg
•Slide 63: www.kaggle.com
Photo Credits

Mais conteúdo relacionado

Mais procurados (6)

MULTIPLICACIONES PARA PRACTICAR CÁLCULO MENTAL
MULTIPLICACIONES PARA PRACTICAR CÁLCULO MENTALMULTIPLICACIONES PARA PRACTICAR CÁLCULO MENTAL
MULTIPLICACIONES PARA PRACTICAR CÁLCULO MENTAL
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
 
Lks siswa
Lks siswaLks siswa
Lks siswa
 
سفارشات وابسته به دورهم اندیشی یهوه پدر من
سفارشات وابسته به دورهم اندیشی یهوه پدر منسفارشات وابسته به دورهم اندیشی یهوه پدر من
سفارشات وابسته به دورهم اندیشی یهوه پدر من
 
How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Mass...
How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Mass...How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Mass...
How Sumo Logic And Anki Build Highly Resilient Services On AWS To Manage Mass...
 
Effective Management
Effective ManagementEffective Management
Effective Management
 

Destaque (8)

Lecture6 - C4.5
Lecture6 - C4.5Lecture6 - C4.5
Lecture6 - C4.5
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 Algorithm
 
Algoritma C4.5 Dalam Data Mining
Algoritma C4.5 Dalam Data MiningAlgoritma C4.5 Dalam Data Mining
Algoritma C4.5 Dalam Data Mining
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 
Belajar mudah algoritma data mining c4.5
Belajar mudah algoritma data mining c4.5Belajar mudah algoritma data mining c4.5
Belajar mudah algoritma data mining c4.5
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 

Semelhante a Good Enough Analytics

4.4 M. Ángeles Rol
4.4 M. Ángeles Rol4.4 M. Ángeles Rol
4.4 M. Ángeles Rol
brnmomentum
 
Baremo course navette
Baremo course navetteBaremo course navette
Baremo course navette
JAESHUANMAR1
 
Resultados 3 série
Resultados 3 sérieResultados 3 série
Resultados 3 série
guest3492c4
 
Resultados 2 série
Resultados 2 sérieResultados 2 série
Resultados 2 série
guest3492c4
 
Resultados 1 série
Resultados 1 sérieResultados 1 série
Resultados 1 série
guest3492c4
 
Master tabel pengetahuan dan sikap (uji validitas)
Master tabel pengetahuan dan sikap (uji validitas)Master tabel pengetahuan dan sikap (uji validitas)
Master tabel pengetahuan dan sikap (uji validitas)
Chenk Alie Patrician
 
03 partes de la cara
03 partes de la cara03 partes de la cara
03 partes de la cara
b0rg1r
 
División por dos cifras
División por dos cifrasDivisión por dos cifras
División por dos cifras
mariagonper
 

Semelhante a Good Enough Analytics (20)

DSD-INT 2016 Urban water modelling - Meijer
DSD-INT 2016 Urban water modelling - MeijerDSD-INT 2016 Urban water modelling - Meijer
DSD-INT 2016 Urban water modelling - Meijer
 
Tableau for statistical graphic and data visualization
Tableau for statistical graphic and data visualizationTableau for statistical graphic and data visualization
Tableau for statistical graphic and data visualization
 
4.4 M. Ángeles Rol
4.4 M. Ángeles Rol4.4 M. Ángeles Rol
4.4 M. Ángeles Rol
 
Don't Forget This!
Don't Forget This!Don't Forget This!
Don't Forget This!
 
Rants by Mac Columns - 2021 - ALL.pdf
Rants by Mac Columns - 2021 - ALL.pdfRants by Mac Columns - 2021 - ALL.pdf
Rants by Mac Columns - 2021 - ALL.pdf
 
Gone Shopping: detailed retail mapping
Gone Shopping: detailed retail mappingGone Shopping: detailed retail mapping
Gone Shopping: detailed retail mapping
 
Baremo course navette
Baremo course navetteBaremo course navette
Baremo course navette
 
Gráfico ponto cruz em PDF
Gráfico ponto cruz em PDFGráfico ponto cruz em PDF
Gráfico ponto cruz em PDF
 
Resultados 3 série
Resultados 3 sérieResultados 3 série
Resultados 3 série
 
Resultados 2 série
Resultados 2 sérieResultados 2 série
Resultados 2 série
 
Resultados 2 série
Resultados 2 sérieResultados 2 série
Resultados 2 série
 
Resultados 1 série
Resultados 1 sérieResultados 1 série
Resultados 1 série
 
Master tabel pengetahuan dan sikap (uji validitas)
Master tabel pengetahuan dan sikap (uji validitas)Master tabel pengetahuan dan sikap (uji validitas)
Master tabel pengetahuan dan sikap (uji validitas)
 
03 partes de la cara
03 partes de la cara03 partes de la cara
03 partes de la cara
 
City hall final
City hall finalCity hall final
City hall final
 
Parametric and non parametric test
Parametric and non parametric testParametric and non parametric test
Parametric and non parametric test
 
RESULTADOS VIERNES
RESULTADOS VIERNES RESULTADOS VIERNES
RESULTADOS VIERNES
 
More Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story MappingMore Reliable Delivery with Monte Carlo & Story Mapping
More Reliable Delivery with Monte Carlo & Story Mapping
 
Tablas de-mulitplicar-con-tapones
Tablas de-mulitplicar-con-taponesTablas de-mulitplicar-con-tapones
Tablas de-mulitplicar-con-tapones
 
División por dos cifras
División por dos cifrasDivisión por dos cifras
División por dos cifras
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Good Enough Analytics

  • 2.
  • 3.
  • 4. The Good Enough Stuff Analytical Tools
  • 5. Analytical Tools are like spoons
  • 6. Analytical Tools are like spoons
  • 7.
  • 8.
  • 13. Point of stupidity What is stupid today, might not be stupid tomorrow
  • 14. Good Enough Analytics Big data analytics using cost efficient tools
  • 15. The Good Enough Stuff Ensembles of good enough models
  • 16. Point of stupidity: The perfect model 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3 A “perfect” model is too complex, too costly to build, too hard to maintain and not flexible to change.
  • 17. “There are known knowns; there are things we know that we know. There are known unknowns; there are things that we now know we don't know. But there are also unknown unknowns; there are things we do not know we don't know.” By Donald Rumsfeld, United States Secretary of Defense and Potential Data Scientist Why the perfect model is stupid
  • 18. “In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models” Good Enough Analytics: Ensembles 4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 + 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 + 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6
  • 23. Simple Ensembles – GLM Bootstrap aggregating (bagging) predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE) train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") } result<-rowMeans(predictions)
  • 24. Simple Ensembles – Gradient Boosting Machines gbmMod<-gbm(eqn, train,n.trees=10000, shrinkage=0.002, distribution="gaussian", interaction.depth=7, bag.fraction=0.9, n.minobsinnode = 50 ) Similar to bagging, boosting also creates an ensemble of classifiers by resampling the data, which are then combined by majority voting. However, in boosting, resampling is strategically geared to provide the most informative training data for each consecutive classifier.
  • 25. Simple Ensembles - Random Forest rf <- foreach(ntree=rep(333,3), .combine=combine, .packages='randomForest')%dopar% randomForest(train[,3:length(train)], train$Act, ntree=ntree, do.trace=1000, mtry=round(colNumber/3), replace=FALSE, nodesize = 5, na.action=na.omit)
  • 26. Ensemble of Ensembles 1. Mean(RF+GBM+BagGLM) 2. Median(RF+GBM+BagGLM) 3. 0.4*RF+0.4*GBM+0.2*BagGLM
  • 27. Ensembles – Why it matters Improve accuracy Ensembles tend to yield better results than its constituent models when there is a significant diversity among the models Developing multiple simple model is faster attempting to develop the perfect model More resistance to over fitting Less reliant on any single model Concurrent development Different models can be run and developed on different instances/machines by different data scientist
  • 28. Ensembles – point of stupidity Netflix prize 1 million dollar winner: Ensemble of 107 models for 10% improvement Too complicated, costly and inflexible to change Actual deployment: Ensemble of 2 models for 8.43% improvement Moral of story: Good Enough Ensemble is good enough
  • 29. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models
  • 30. The Good Enough Stuff Data Optimization
  • 31. Data cleaning vs Data optimization Important but I assume you know Done AFTER data cleaning
  • 32. Kaggle Medical Drug Competition 15 sets of data Each data set: 1,000 to 2,000 Attributes 500 to 20,000 Rows Qn: Identify rogue drugs
  • 33. Point of stupidity: Trying to run analysis on all attributes Drug Rogue % Company Color Component 1 Component 2…2000 A 0.0400 XYZ Red 200 30 B 0.0002 XYZ Green 920 50 C 0.8000 XYZ Blue 30 1000 D ? XYZ Red 340 800
  • 34. Drug Rogue % Company Color Component 1 Component 2…2000 A 0.0400 XYZ Red 200 30 B 0.0002 XYZ Green 920 50 C 0.8000 XYZ Blue 30 1000 D ? XYZ Red 340 800 Not all attributes are born equal No Variance Irrelevant Too many attributes
  • 35. Drug Rogue % Company A 0.0400 XYZ B 0.0002 XYZ C 0.8000 XYZ D ? XYZ R code: Library(caret) healthdata[nearZeroVar(healthdata, freqCut = 95/5, uniqueCut = 10)]<-list(NULL) <- this attribute does not help in differentiating between the drugs Remove no variance / near zero variance attributes
  • 36. Drug Rogue % Color A 0.0400 Red B 0.0002 Green C 0.8000 Blue D ? Red R code for Random Forest: importanceScore <- importance(myMod) R code for GBM: importanceScore <- summary.gbm (myMod, ntree) <- this attribute has no relevance to % rouge drug Remove not important attributes
  • 37. Drug Rogue % Component 1 Component 2…2000 A 0.0400 200 30 B 0.0002 920 50 C 0.8000 30 1000 D ? 340 800 R code: pc <- prcomp(train[, 2:length(train)],tol=0.12) <- too many attributes takes very long to run analysis Attribute reduction using Principal Component Analysis
  • 38. Andrew Ng: Always try analysis without PCA first. X XXXX X Attribute 1 Attribute 2 Attribute reduction using Principal Component Analysis Andrew Ng: Machine Learning Course Refer to References
  • 39. Andrew Ng: Always try analysis without PCA first. X XXXX X Principal Component Attribute reduction using Principal Component Analysis Andrew Ng: Machine Learning Course Refer to References
  • 40. X X X X X X Attribute 1 Attribute 2 Attribute reduction using Principal Component Analysis Andrew Ng: Machine Learning Course Refer to References
  • 41. The 1D red line and points are now representative of the 2D graph Principal Component Attribute reduction using Principal Component Analysis 0 0 0 0 0 0 Andrew Ng: Machine Learning Course Refer to References
  • 42. Data Optimization – Why it matters Performance Improvement (importance,nearZeroVar) Cut down attributes which are useless or not “good enough”. More accurate and complex models can be built on attributes that matters. Cost Savings (PCA) Less data needs to be processed, faster turnover for models and results.
  • 43. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data
  • 44. The Good Enough Stuff Scaling on cloud
  • 45. Why use Cloud How often do you really need a multimillion machine to be on standby 24/7 to churn data? Do you really need real time analytics or is hourly/daily/weekly/monthly report good enough?
  • 46. Cloud – Why it matters Excellent bang for the buck <$5/hr to rent million dollar worth of power. No need to purchase/maintain hardware. Scale on demand Great for Ensemble Modeling You can start multiple instance, each instance running one simple model and ensemble them But beware of data security and privacy laws Not suitable for all kinds of data/application For example, Amazon Web Service is HIPAA compliant but Rackspace is not.
  • 47. Name Age Income Postal Peter 23 $2,000 400573 Sally 11 $0 520028 Paul 70 $500 521201 Mark 30 $8,000 247392 Prepare data for the cloud
  • 48. Name Age Age Group Income Income Range Postal Postal Area Peter 23 Youth $2,000 $1,000- $3,000 400*** Eunos Sally 11 Child $0 $0 520*** Simei Paul 70 Senior $500 $1-$1,000 521*** Tampines Mark 30 Adult $8,000 >$5,000 247*** Tanglin Prepare data for the cloud Remove Identity Use general category Reference: Dr. Yap Ghim Eng (A*Star) Use range category Masking Rollup
  • 49. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data, scaled on cloud services
  • 50. The Good Enough Stuff …that we have no time for Amazon Web Service
  • 51. sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blas sudo yum install -y lapack-devel blas-devel wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gz tar -xf R-2.15.2.tar.gz cd R-2.15.2 ./configure --with-x=no sudo make PATH=$PATH:~/R-2.15.2/bin/ cd .. wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/download tar -xzf numpy-1.6.2.tar.gz cd numpy-1.6.2 sudo python setup.py install cd .. wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/download tar -xzf scipy-0.11.0.tar.gz cd scipy-0.11.0 sudo python setup.py install cd .. wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817 tar -xzf nose-1.1.2.tar.gz cd nose-1.1.2 sudo python setup.py install Basic code to setup Amazon instance for analytics =after sudo-ing and running R, type= install.packages('gbm') install.packages('randomForest') To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"
  • 52. Amazon EC2 Spot Instance Cluster Compute Eight Extra Large 60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet $0.27 per hour High-Memory Quadruple Extra Large Instance 68.4 GiB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform $0.14 per hour
  • 53. Weakness of Spot Instance Bidding system. If your bid < spot instance price, instance will be terminated. Solutions: 1) Put master on normal cloud instance and slave on spot instance 2) Heartbeat + Queue with Checkpoint
  • 54. The Good Enough Stuff …that we have no time for PCA with KNN
  • 55. library(FNN) train <- read.csv("train.csv", header=TRUE) test <- read.csv("test.csv", header=TRUE) pc <- prcomp(train[, 2:length(train)],tol=0.12) mydata <- data.frame(label = train[, "label"], pc$x) labels <- mydata[,1] mydata2 <- mydata[,-1] test.p <- predict(pc, newdata = test) results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1, algorithm="cover_tree")] write(results, file="knn_PCA.csv", ncolumns=1) Principal Component Analysis - With K-Nearest Neighbor
  • 56. The Good Enough Stuff …that we have no time for Data Chunking
  • 57. Data Chunking– Revolution R Loosely based on NoSQL The XDF format is a binary file format that stores data in blocks and processes data in chunks (groups of blocks) for efficient reading of arbitrary columns and contiguous rows Use a format called XDF For more details, visit RevR website
  • 58. Data Chunking– Why it matters # Chunk 6.5GB worth of data onto HDD in XDF rxImport(inData = trainFile, outFile = “trainingData.xdf”) #revR created methods like rxGlm to run huge Poisson regression directly on XDF file myPos <- rxGlm(amount2 ~ Mailed+Donated+RR,data="trainingData", family=poisson()) *This cannot be done using normal R on my laptop, as R tries to load entire dataset into memory
  • 59. RAM: Fast but expansive SSD: ~4x faster than normal HDD when chunking Data Chunking– Speeding it up using SSD instead of normal HDD
  • 60. The Good Enough Stuff …that we have no time for Multicore
  • 61. Multicore Processing – Revolution R library(foreach) library(doSNOW) cluster <-makeCluster(3, type = "SOCK") registerDoSNOW(cluster) setMKLthreads(1) predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE) train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") } result<-rowMeans(predictions)
  • 62. Multicore Processing – Why it matters License Cost (Usually charge by per CPU) 1 CPU with 4 core = 1 single user license Distributed 4 CPUS with 1 core each = 4 license or group license Performance Improvement ~2 x performance for 3 core vs 1 core
  • 64. Good Enough ReferencesRandom Forest •Obtaining knowledge from a random forest •Suggestions for speeding up Random Forests •Random Forest with classes that are very unbalanced GBM •Define boosting •Generalized Boosted Models:A guide to the gbm package •What are some useful guidelines for GBM parameters? •R gbm logistic regression •How to win the KDD Cup Challenge with R and gbm Ensembles •Ensemble learning introduction •Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets •Resources for learning how to implement ensemble methods •Ensemble methods •Intro to ensemble learning in R •Predictive analytics & decision tree
  • 65. Good Enough References PCA and NearZero •Principal Component Analysis in R •PCA on high dimensional data •PCA on training and test data •Nearzero R caret library Misc •Andrew Ng’s Machine Learning Course •A Few Useful Things to Know about Machine Learning •Creating HIPAA-Compliant Medical Data Applications With AWS •Amazon EC2 Spot Instances •Improve Predictive Performance in R with Bagging •Kaggle: Visualizing dark world •Kaggle: Visualizing handwriting
  • 66. Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data, scaled on cloud services
  • 67. Qns? Email me @ thiakx@gmail.com LinkedIn Profile Kaggle Profile Good Enough Analytics Big data analytics using cost efficient tools and good enough ensemble of models based on optimized data, scaled on cloud services Asia?
  • 68. •Slide 2: http://3.bp.blogspot.com/- nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG •Slide 3: http://www.salesmanagementmastery.com/wp- content/uploads/2010/09/money-flying.jpg •Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg •Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html •Slide 7: http://2.bp.blogspot.com/- Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg •Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg •Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg •Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank- icon.jpg •Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page •Slide 19-21: www.scholarpedia.org •Slide 23/25: www.wikipedia.org •Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg •Slide 63: www.kaggle.com Photo Credits

Notas do Editor

  1. Hello, I am here to share a little of what I have learnt so far regarding big data analytics. I do not claim to be a data scientist nor a statistician nor an expert in big data technologies. Which is good, because many experts become too focus on their particular domain, be it data modeling, stats, or big data tools. Instead, I hope I will be able to get everyone to think of big data analytics as a process with numerous components where data scientist, statisticians and big data technologist can work together to achieve the goal of good enough analytics.
  2. Imagine you are a lemonade stand owner. What is good enough analytics for you? Perhaps some simple analytics on excel or open source programs to determine the optimum price, weather condition and position to sell the lemonade? Or perhaps you just set a price and call it a day?
  3. What is good enough analytics if you working in a fraud detection section of a bank? Well you will probably need to run some sophisticated fraud pattern detection algorithms, buy some large, expansive analytical tools hardware and hire a good team of data scientist. So inherently, we do know what is roughly good enough analytics for different scenario. What I am trying to do here is to give good enough analytics a clearer structure for us to think about analytical problems.
  4. To begin, I will like to talk about analytical tools
  5. Analytical tools are like spoons. There are established, standard spoons that we use for everyday purpose
  6. There are also niche, special purpose spoons to bring out the best flavor of the food
  7. Sometimes you need a big spoon
  8. Other times you need a baby spoon. The point is, just like you will never swear by a spoon, you should not be too caught up with picking the best analytical tool. It is important for big data people to be open minded and just like spoons, every analytical tool has its different purpose and usage according to the business scenario. Throughout this presentation, I will try to be as platform agnostic as possible, presenting general ideas that will work not matter you are using SAS, greenplum, IBM, orcale, revR etc
  9. Here we have a usefulness vs cost graph of analytical tools. The graph is shaped this way due to the law of diminishing returns. Initially, you see great returns on investment when you start purchasing new analytical tools. As you progress further up the graph, however, you will need more and more expansive tools and experienced data scientist to uncover the less obvious trends/traits in your data. Axis: *Usefulness in the form of helping business make decisions. *Cost in the form of hardware/software/manpower
  10. And like all good graphs, there is a always a point of stupidity, where you will face rapidly diminishing usefulness for the price you pay.
  11. For the lemonade stand business, the point of stupidity is when you decide to setup a hadoop cluster to counter the other lemonade stalls around you instead of just selling something else or changing location
  12. For the bank, it will be less obvious. Will you buy a super computer? Or a 3rd oracle rack? It all depends. Sometimes it is worth it to push the limits of analytics, other time it is not. It depends on the situation, the company and experience. Nevertheless, it is impt to know the point of stupidity exist and always ask yourself: “am I approaching the point of stupidity or I just need the extra edge for the breakthrough?”
  13. Good enough analytics is simply analytics that lies before the pt of stupidity. It is the cost efficient solution As we saw from the previous example, the pt of stupidity for lemonade stand and bank is very different and hence is their line of good enough analytics. And what is stupid today may not be stupid tomorrow. Things change, more budget come in, new challengers enter market and now, we really need to move the point of stupidity further up the curve. That’s normal
  14. So here we have the first definition of Good Enough Analytics
  15. Moving beyond tools, now I will like to talk about models
  16. The “perfect model” is the kind of model over zealous A+ students come up with while in school. Real world big data, on the other hand, is too complex an animal for anyone to come up with perfect model for.
  17. You cannot build a perfect model because there are things we simply do not know we don’t know. We will never have perfect data nor perfect understanding of the big data. And knowing that is kind of liberating as the goal will no longer be the impossible goal of seeking perfection, instead, it becomes the constant improvement towards a good enough answer. We should fear perfection because if perfection is attained, data scientists will be out of job.
  18. This is a graphical example. Each model has its own decision boundary and errors. When we combine them into an ensemble, their extremities average out and we obtain a much better results than individual model. (The final ensemble in the image shows a perfect result but in real world, we probably wont be so lucky)
  19. In this case, this complex 4 shaped data cannot be easily represented/detected by any single model
  20. An ensemble of multiple models, each forming a piece of the 4 sides, might come close in representing the actual data.
  21. Beyond all the therorycraft and babies, I will like to show you some actual code and statistical theories. I will be using R codes here but really, the concepts presented here can be easily done in python, SAS code or whatever favorite language you like. Again, I am no statistician nor world number one data scientist so I will try my best to give a good enough explanation of the concepts.
  22. Bagging simply means we repeat the model multiple times, each time sampling randomly with replacement a portion of the data to prevent overfitting and get closer to the “truth”. One of the fastest among simple ensemble models, it is also the least accurate unless the data is linear. In the code, what we trying to do is to obtain the mean of 1000 linear models, each linear model built on a random sample of 90% of the data.
  23. GBM have great accuracy but is hard to tweak and understand and is slower (cant find a way to run in concurrently/multinode). It is a stronger version of bagging, so at each step of resampling, instead of always picking 90% of data randomly, it will smartly select the subset of data with the most information gain. In essence, each iteration of boosting creates three weak classifiers: the first classifier C1 is trained with a random subset of the available training data. The training data subset for the second classifier C2 is chosen as the most informative subset, given C1 . Specifically, C2 is trained on a training data only half of which is correctly classified by C1 , and the other half is misclassified. The third classifier C3 is trained with instances on which C1 and C2 disagree. The three classifiers are combined through a three-way majority vote.interaction.depth: The maximum depth of variable interactions. 1 implies an additive model, 2 implies a model with up to 2-way interactions, etc.n.minobsinnode: minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.Shrinkage: Also known as the learning rate or step-size reduction, it modifythe update rule and regularize themodel, smaller shrinkage results in improvement in accuracy but more iterations are needed.Refer to this link for more details: http://www.scholarpedia.org/article/Ensemble_learning#Bagging
  24. Random forest is simply a collection of decision trees. In the code, each decision tree will be based on a random sample of 1/3 of the columns and will be combined using a voting system or mode across the 999 tress. Random forest have a good balance of speed an accuracy
  25. There is still a lot to learn but I got to Top 1% of Kaggle using this. Ensemble of Ensembles method is pretty good. The toughest challenge is in the 3rd method. You will need to run multiple cross validations and have a regression model ontop of it to determine the optimum ratio of the models to minimize the mean square error. It is quite some work.
  26. One of the key point to note here is the need for diversity among constituent models to have better results. The idea here is to look through the problem from different perspective and methodologies .
  27. An example from the kaggle competition I participated in. While the data itself is not strictly big data per se, the challenge lies in the fairly large number of attributes/columns
  28. These columns are not the actual data set, I added some labeling to better illustrate my point.
  29. The main problem will be that these useless columns will be held in RAM/HDD and waste compute powers as models run through them and ignore them. The nearzerovar function in R removes the column with zero (like the column in yellow) or near to zero variance. The near to zero variance part needs some explanation. For example, if out of 1million drugs, one drug happens to be manufactured by company ABC and that drug happens to be a rouge drug (poisons the patient instead of curing him). Can we generalize that all drugs produced by company ABC are rouge drugs? Of course not. By tweaking the freqCut and/or uniqueCut, we can set the cutoff point for such outliers and remove them to increase the accuracy of the model. Although…there are times when we want to keep these outliers, example to train for fraud detection models
  30. A blue pill or red pill does not determine the potency of the drug (not matrix). The idea here is simple. Just a one line code in R and from here, you can customize to select only the important variables and shave out the less important columns. Doing so correctly should slightly increases the accuracy and greatly reduce the time taken to run analysisHere are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.
  31. Next,I want to compress the data, reducing number of columns without removing the attributes completely. Most of the time, the top 10% of principle components account for 90% of the variance in the data, so you can compress 90% of the data and still retain the data model somewhat intact. Although it is only one line of code in R, PCA is very powerful way to greatly decrease analysis time and to visualize complex high dimension data. I will like to explain how it works in the following slides. Tol: a value indicating the magnitude below which components should be omitted. (Components are omitted if their standard deviations are less than or equal to tol times the standard deviation of the first component.) With the default null setting, no components are omitted. Other settings for tol could be tol = 0 or tol = sqrt(.Machine$double.eps), which would omit essentially constant components.
  32. Andrew Ng: Always try analysis without PCA first. The reason is by reducing attributes base on .90, .95, .99 variance, we are still losing accuracy, however small. It does speed up the processing speed considerably and we can thus apply stronger models to compensate for the lost in accuracy. Also, as PCA merge attributes together, it is a lot harder to use/understand as compared to importance()If I ask you to represent the data with a single line, how will you do it? Probably you will just laugh and draw a best fit line across the points (the red line)
  33. The red line (best fit line) is the principal component that represents the two attributes. We can now remove the attributes and we effectively compressed two attributes into one principal component
  34. Now we have a slightly more complex data. Again, to best represent the data, you will probably draw a best fit line across (red line) The blue line is the shortest distance from the points to the best fit line
  35. We now project the points via the blue lines onto the red line and here we can see the principal component that represents the two attributes. Again, we effectively compressed two attributes into one principal component
  36. I will not embarrass myself in front of the cloud experts who presented before me. They have done a great job explaining why do analytics on cloud and how to do it. In essence, unless you need real time analytics, good enough analytics on hourly/daily basis is much more manageable and fits the on-demand nature of cloud. We just need to pay for the time we need to run the analysis.
  37. Here we will talk about preparing data for the cloud. We cannot just ship our entire database onto the cloud and pray that we don’t get sued by our clients.
  38. Here are some ways we can message the data. The main idea is to remove any features that can uniquely identify an individual. Of course the data will also need to be encrypted, both at rest in cloud database and in flight during data transfer.
  39. Amazon spot instance is very hot among kaggle players. Cheap and powerful. *GiB is Gibibyte. Actual useable amount. 60.5GiB ~ 64GB ram
  40. There are limitations of spot instance and here are some common solutions people use.
  41. Sample workable code for KNN built on PCA.
  42. Revolution R data chunking is powerful and allow small-big data analytics on laptops
  43. Full multicore code in R
  44. No time to talk about visualization but they are important:at the start of analysis in helping us understand the data throughout the analysis in helping us understand our modelsAnd at the end when presenting the data/findings
  45. Some useful references for the various topics covered
  46. I have a six sense that good enough analytics will be greatly suitable in Asia. The reason being most analytics tools / methodology are developed in the west, mainly USA, by big companies for big companies. I think there is a lot of market out there to apply good enough analytics on big data on the smaller, leaner Asian companies. Things in asia works differently, where analytics is frankly, still in its infancy. Perhaps I will explore more in future presentations.