This document discusses data science and provides examples. It begins by asking what data science is besides an excuse for food and drinks. It then discusses hacking skills like bash/awk/sed and statistics concepts like probability. Domain expertise and intelligence are also important. The document presents an "intelligence cookbook" approach of first making a solution valuable, then possible, then beautiful, and finally smart. Common machine learning problems and approaches are listed. The document concludes that real data science is hard but should focus on the business problem, not just the science. It provides contact information for questions.
6. Hacking
“Good data scientists understand, in a
deep way, that the heavy lifting of
cleanup and preparation isn’t
something that gets in the way of solving
the problem… it is the problem”
DJ Patil
bash/awk/sed
7. Statistics
What’s the probability that 2 people in
the front 2 rows share a birthday?
1. ~10%
2. ~20%
3. ~50%
4. ~90%
What’s the probability that a 99%
accurate test diagnosed a 1/1000 disease?
1. ~10%
2. ~50%
3. ~90%
4. ~99%
12. Make it valuable
Find a KPI that is correlated
to bottom line revenue
e.g. number of products the
visitor browses through
13. Make it possible
Develop the simplest heuristic
e.g. show the visitor one of the
top 10 selling products
14. Make it beautiful
Create a method to quickly test new
algorithms against old ones
e.g. create a framework that split
tests two models and reports
which one is better
15. Make it smart
Figure out in what field your problem is
and choose an off the shelf algorithm
e.g. recognize that the problem
is product recommendation and
use collaborative filtering
17. To sum it all up
Real data science is hard
but …
Real data science is the last step in data
science, not the first
and besides …
The most important thing in data science is
the business, not the science