6. ❖ “someone who knows more
statistics than a computer scientist
and more computer science than a
statistician”
❖ Someone who extracts insights
from messy data
Data Scientist?
6
9. 2. Tools / Languages
People are still crazy about Python after twenty-five years, which I find hard
to believe.
—Michael Palin
9
10. Tools / Languages
❖ R
❖ Python
❖ Matlab
❖ SQL
❖ Excel
❖ Java
❖ SAS (Statistical Analysis System)
❖ SPSS (Modeler and Analytics)
❖ Hadoop (File System Computing)
10
11. Python
❖ Easy
❖ Python 2.7
❖ Different Libraries for Data mining
Numpy
SciPy
Pandas
Matplotlib
Scikit-learn
11
12. 3. Getting Data
To write it, it took three months; to conceive it, three minutes;
to collect the data in it, all my life.
—F. Scott Fitzgerald
12
13. Different ways of getting data
◉ stdin and stdout
◉ Reading files
◉ Scraping the web
◉ Using APIs
13
14. Using Twitter API
◉ Python 2.7
◉ Python- Twitter libraries (Birdy, TwitterAPI, Twitter search, Twython)
◉ Twython
Pip install twython
◉ Go to https://apps.twitter.com/.
◉ Click Create New App.
◉ Click “Create my access token.”
◉ Run SearchAPI.py
14
15. 4. Linear Algebra
Is there anything more useless or less useful than Algebra?
—Billy Connolly
15
16. Vectors
❖ Vectors are points in some finite-dimensional space
❖ A good way to represent numeric data
❖ Simplest from-scratch approach is to represent vectors as lists of
numbers
Ex :- If you have the heights, weights, and ages of a large number of
people, you can treat your data as three-dimensional vectors
(height, weight, age)
16
17. Matrices
❖ A matrix is a two-dimensional collection of numbers.
❖ We can represent matrices as lists of lists
❖ We can use a matrix to represent a data set consisting of multiple
vectors
Ex :- If you had the heights, weights, and ages of 1,000 people you could put
them in a 1 000 × 3 matrix
17
18. Linear Algebra + Data Science
To extract useful information from large, often unstructured, sets of data,
in some data mining applications huge matrices are used.
Ex :- The task of extracting information from all Web pages available
on the Internet is done by search engines. The core of the Google search
engine is a matrix computation
18
23. Statistics
Statistics refers to the mathematics and techniques with which we
understand data.
Mean
Median
Range
Variance
Standard Deviation……...
23
24. Statistics
Framing questions statistically allow us to leverage data resources to
extract knowledge & obtain better answers.
A statistical framework allows researchers to distinguish between
causation & correlation , thus to identify interventions that will cause
changes in outcomes
To establish methods for prediction & estimation to quantify their degree
of certainty
24
25. Probability
Hard to do data science without some sort of understanding of probability
and its mathematics.
Conditional Probability
Bayes’s Theorem
Random Variables
Continuous Distributions
Normal Distribution………..
In an uncertain world, it can be of immense help to know and understand
chances of various events. You can plan things accordingly.
25
26. 6.Visualizing Data
I believe that visualization is one of the most powerful means of achieving
personal goals.
—Harvey Mackay 26
27. Brain receives
8.96 Megabits
of data from the
eye every
second.
Average person
comprehends
120 words per
minute reading
Visual
Comprehension
speed
Reading
Comprehension
speed
27
31. Current Examples
A Day in the life,NYC Taxis
http://chriswhong.github.io/nyctaxi/
U.S.Gun Deaths in 2013
http://www.guns.periscopic.com/?year=2013
31
32. Tools for Data Visualization
❖ Matplotlib
❖ Seaborn
❖ D3.js
❖ Bokeh
❖ Ggplot
❖ R
32
33. Example with R
◉ Iris data set
◉ Iris is a data frame with 150 cases (rows) and 5 variables (columns)
named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and
Species.
33