SlideShare uma empresa Scribd logo
1 de 49
DATA SCIENCE 101 A Layman’s Tour of Data
Science with Todd Cioffi
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
opendatascicon.com
GOALS FOR THE SESSION:
 Introduce Terminology
 Explain Concepts
 Get You Comfortable
– Understand the conversation
– Even if you don’t know how to do it
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
2
BIG PICTURE
Infrastructure
Big Data: “The 3 (or 4…) Vs”
 Volume
 Velocity
 Variety
Internet of Things (IoT)
Cloud
 NIST in a nutshell
 Requestable
 Available
 Shareable
 Scalable
 Measurable
 IaaS / PaaS / SaaS (vs. SAS) / *aaS
 Plan for Failure
Math
Business Intelligence (BI)
Business Analytics
Data Analytics
xxx Analytics**
Code
Machine Learning
Data Mining
Deep Learning
Data Visualization
: A Business Model, not a
Technology
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
3
DATA
Traditional (‘70s) - RDBMS
 Controlled Input
 Controlled Structure
 SQL: Structured Query Language
 ACID
 Atomic
 Consistent
 Isolated
 Durable
 “Real Time”
 A fiction
Today
 Democratized Input
 Flexible Structure
 NoSQL
 MongoDB / Cassandra / …
 Text
 XML /JSON / XBRL / …
 Multimedia: Images, Audio, Video
 Hadoop: MapReduce^ / Pig / Hive / Flume / …
 Spark / Storm / Kafka / …
 Graph DBs, Semantic Web, …
 CAP Theorem
 Consistency, Availability, Partition tolerance
 BASE
 Basically Available, Soft state, Eventually consistent
 Idempotence: once or many = same resultant state
 Plan for FailureTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
4
STAGES OF ANALYTICS
Descriptive
 What happened?
Predictive
 What is going to happen?
Prescriptive
 How do we influence what is going to happen?
 What do we do?
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
5
SUMMARY
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
6
ANALYTICS DEFINITIONS
“Analytics is defined as the extensive use of data, statistical and quantitative
analysis, exploratory and predictive models, and fact based management to
drive decisions and actions“. - Tom Davenport, Competing on Analytics
“Analytics is the discovery and communication of meaningful patterns in
data. … analytics relies on the simultaneous application of statistics,
computer programming and operations research to quantify performance. …
Analytics is a multi-dimensional discipline. There is extensive use of
mathematics and statistics, the use of descriptive techniques and predictive
models to gain valuable knowledge from data - data analysis. The insights
from data are used to recommend action or to guide decision making rooted
in business context. Thus, analytics is not so much concerned with
individual analyses or analysis steps, but with the entire methodology.“ –
Wikipedia
“By any definition, analytics uses quantitative methods to explore data and
reveal patterns within. Useful patterns can be formulated into reusable
models. Applied to business, these models are then used to derive insight,
prompting data-driven action.” – Todd Cioffi, RMU1
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
7
ANALYTICS TOOLS: A SAMPLE
Enterprise (Scale and Cost)
 SAS
 SPSS
 STATA
 MATLAB
 BlueMix (IBM Watson)
Open Source
 R
 Python
 Weka
 Octave
 RapidMiner*, Knime, …
Freemium (Hybrid)
 Dozens (Gartner, KDnuggets, …)
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
8
DATA VIZ: TYPES AND TOOLS
Scatter: x, y (z)
Beyond Bar, Pie, Stacked Bar, …
 Histogram (not a Bar)
 Box & Whisker, Violin
 Heatmap
 Bubble
 “Spider”
How many axes are you trying to
represent?
What kinds of info do people
understand?
R
 ggplot2
Python
 matplotlib
 seaborn
D3.js
Plot.ly
Tableau
TIBCO Spotfire
Qlikview
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
9
FAMOUS DATA VIZ THRU HISTORY
Snow and Cholera
Nightingale and the Crimea
Minard and Napoleon
Edward Tufte
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
10
CRISP-DM
CRoss
Industry
Standard
Process for
Data
Mining
“CRISP”
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
11
CRISP: DRILL DOWN
Business Understanding:
 Business Objectives
Why are we doing this?
What are we trying to achieve?
 Data Mining Goals
 Definition of success criteria
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
12
CRISP: DRILL DOWN
Data Understanding:
We need to understand the data that we will be using:
 EDA: Exploratory Data Analysis
 What attributes did we collect as data? Customers? Patients?
Events? …
 How are those attributes coded? What do our data points mean?
 How is our data quality?
 How, where, why, and by whom our data was collected may be
important.
 The data that we didn’t collect may also be relevant.
 Data exploration might reveal unexpected, even surprising,
properties.
 Relative importance of various attributesTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
13
CRISP: DRILL DOWN
Data Preparation:
Once we have a handle on our data, we need to prepare it for the
Modeling step. This is where we shape and transform our data into
the appropriate usable format. This includes: selecting columns,
sampling rows, deriving new or compound variables, filtering data,
and merging data sources.
• The representation of data is a key to success. The wrong
representation can hide important patterns.
• Different Modeling approaches need different data representations.
• As we learn more, and/or try new models, we might come back to
this step.
• Expect to spend time on this phase - almost always more than half,
and sometimes even 90%, of total analysis time should be allocatedTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
14
CRISP: DRILL DOWN
Modeling:
This is where we search for patterns in our data. These patterns
winnow out unnecessary data and characterize the influence of
attributes that matter.
From these patterns, we can create a model that is not only
descriptive, but predictive.
• There are many different kinds of models, each looking at the data
from a different perspective.
• We may want to try different models, and different parameters
within algorithms, to find our best results.
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
15
CRISP: DRILL DOWN
The Evaluation phase looks in two directions:
We need to validate our model from the prior CRISP-DM step.
 Precision, applicability, and understandability are all parts of a trade-off
 Understandable models giving deeper insights are often preferred over more
accurate models.
We also need to evaluate our progress towards our business goals.
 Does this model help us meet our success criteria?
 Does new insight here funnel back into our business understanding?
 Should we loop through CRISP-DM again with our new information?
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
16
CRISP: DRILL DOWN
Deployment:
Once we have results that meet our goals, we need to put them into
use, otherwise the effort is lost.
• At any point in the process, we could take our results and gain new
Business Understanding, creating an opportunity to cycle through the
CRISP-DM model again, gaining even more value from our data
Models age…
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
17
MODELING: THE FUN BITS
We want to find patterns in our data, then use these patterns to predict outcomes.
How does that happen?
By analyzing our data, we can derive a set of “rules” or a “formula” that describes
some behavior.
 Examples like “this” tended to fall into this pile. Examples like “that” tended to fall into that pile..
Collectively, the rules we assemble are called a model.
The process of finding and deriving the model is called training.
The data used for training is called training data.
Once we have established our pattern - or model - we can run similar examples
through our rules and predict where they would fall. This is called model application
or applying the model.
Example: based on this customer’s profile, knowing what we know, do we expect
churn or no churn? We could then take that answer and decide whether to take
action in order to hold them.
There are many different approaches used to search for patterns in data. We will see
a handful of them in this session.
When any approach gets developed to the point where it can be described with a
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
18
SO LET’S GET STARTED WITH
MODELING…
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
19
WHAT IS A COYOTE?
Your six-year old nephew thinks that there are only five kinds of
animals:
1) Kitty
2) Puppy
3) Horsey
4) Birdie
5) Fishie
What does he think a coyote is?
Why?
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
20
K-NEAREST NEIGHBOR
k-Nearest Neighbor (k-NN) is a very intuitive approach:
 To find out what something is like, see what the things closest to it are like.
Two key questions:
What is “near”?
 Euclidean Distance
 Cosine Similarity
 Manhattan Distance
Which neighbors? How many?
 K many…
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
21
WHICH DOT IS CLOSER?
10-3
106TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
22
How about
now?
NORMALIZATION
Orders of Magnitude
 Also consider significant digits
Range
Z-Transform
Leaking data: Norm is also a model
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
23
K-NN IN YOUR HEAD…
K = 1
Train on full data set
How accurate?
What did we learn?
Why?
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
24
OVERFIT
The purpose of modeling is
to find a generalizable
pattern that will tell you
about new data.
If your model fits your
current data too closely, it
loses general utility.
Kaggle Titanic
 what about “new” passengers?
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
25
TESTING & VALIDATION
So how do we plan for “new” data when we’re working with one set of
current data?
Hold-Out or Split validation
Cross-Validation
Leave One Out
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
26
CONFUSION MATRIX
Performance Measures
 Accuracy / Error
 What is the value of knowing the ratio of the number right (or wrong) of the total?
 Precision / Recall
 “You have cancer...”
 Precision: how many with positive tests actually have cancer?
 Recall: how many with cancer tested positive?
 Sensitivity / Specificity
 “You have cancer...”
 Sensitivity: how many with cancer tested positive? (see: recall)
 Specificity: how many without cancer tested negative?
 Here is a handy URL to know:
http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf
+ -
+’ A B
-’ C D
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
27
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
28
CONFUSION MATRIX, ARRANGED
Reality
Predicted
+ -
+’ A B
-’ C D
Accuracy = (A+D) / (A+B+C+D)
Error = (B+C) / (A+B+C+D)
or 1 – ( (A+D) / (A+B+C+D) )
Precision = A / (A+B)
Recall = A / (A+C)
Specificity = D / (D+B)
= Sensitivity
You have Cancer...
HTTP://WWW.DAMIENFRANCOIS.BE/BLOG/FILES/MODELPERFCHEATSHEET.PDFTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
29
CORRELATION
Meaning:
 Do things tend to move together?
Range
 To what degree?
 Same or opposite?
 -1 … 1
Not meaning
 “Correlation does not equal Causation”
 http://www.tylervigen.com/
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
30
LINEAR REGRESSION AND OTHER
“LINES”
Y = MX + B
Height / Weight of Dog
y = m1x1 + m2x2 + ... + mnxn + b
Dependent / independent variable
 Cigs / cancer, but not cancer v cigs
SVM: Support Vector Machine
 Line > Plane > Hyperplane
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
31
FUNNY THING ABOUT LINES:
ANSCOMBE’S QUARTET
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Property (in each case) Value
Mean of x 9 (exact)
Sample variance of x 11 (exact)
Mean of y 7.50 (to 2 places)
Sample variance of y
4.122 or 4.127 (to 3
places)
Correlation between x
and y
0.816 (to 3 places)
Linear regression line
y = 3.00 + 0.500x (to
2 and 3 places,
respectively)
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
32
ANSCOMBE’S
QUARTET
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
33
DATA TYPES
Numerical
 Integer
 Real
 Date-time
Nominal
 Binominal (either / or)
 Polynominal (categorical)
 Corpus
Scalar, Ordinal, Categorical
Dummy coding
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
34
NAIVE BAYES
Bayes: Simple probabilistic counting
Smoke Pop
Men 0.65 0.12 0.0780 +
0.88 0.5720 -
Women 0.35 0.07 0.0245 +
0.93 0.3255 -
1 1 1
Smokers 0.1025
P(W|+) 0.2390
Mor N/S 0.9755
Sun, Wind, Precip > play outside
Example contains a given word
What does that mean about future
examples with same word (or word
combo)?
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
35
RULES AND TREES
Rule Induction
 +++++++ ------
 ++-- + ++ -+ ---
Decision Trees
Random Forest TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
36
SAMPLING
Rows, Records, Documents, Examples
Spreadsheets think of data in rows. They are using a two-dimensional ledger
(worksheets = 3-D).
Databases use the term records (or documents) to identify the storage of one
item. The display might seem linear, but the metaphor relating to real life is
capturing more. Think of a medical record, a personnel file, or other such
documents. These are even potentially multi-dimensional.
Data Scientists uses the term examples. Whether a research biologist, a
marketer, or political scientist, they are thinking in terms of populations –
cohorts, customers, voters. Out of a given population, each individual is an
example. From those examples, we find patterns.
Linear, Shuffled, Stratified
Kennard-Stone
Over / Under
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
37
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
38
CHECKERBOARD SET
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
39
CHECKERBOARD SET: SAMPLE 0.05
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
40
CHECKERBOARD SET: SAMPLE 0.05
K-S
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
41
CHECKERBOARD SET : OVER-
/UNDER-SAMPLE
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
42
FEATURE SELECTION
In the same way that spreadsheets use 2-
D columns, and databases use data fields
to make up a record, each example in our
population is described by some number
of attributes - also called properties,
variables, or features.
Forward Selection
Backward Elimination
Evolutionary
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
43
DIMENSIONALITY REDUCTION
Ht/Wt graph
 Food/mo (lbs)
 Toy purchases ($)
 Leash width (mm)
 Property damage ($)
 Stool volume (ml?)
Helmets
Clothes
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
44
SUPERVISED LEARNING
What does it mean?
Target variable / feature / attribute / label
What else could one do?
Unsupervised learning
AKA Classification and Clustering
Not the same thing, but one can feed the other
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
45
K-MEANS CLUSTERING
Clustering modeler
Iterative distance-based assessment
• Start w/ Random Seeds
• Assign each point to closest seed
• Move seed to center of cluster
• Lather, rinse, repeat until mean doesn’t move (or oscillates) and clusters don’t
change.
How many clusters?
 k many
Then what happens?
• Could turn cluster assignments into classification labels
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
46
OUTLIER DETECTION
Distance
Density
LOF: Localized Density
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
47
SCALE
Began with Big Data
PA at scale – how are algorithms impacted?
Memory and Calculation constraints
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
48
NO FREE LUNCH
No single algorithm is the “best” for all data sets
Different algorithms are often used in different situations
 Naïve Bayes is common in Spam filters
 Outlier Detection is helpful with Fraud
 Clustering works well for Recommendation engines and identifying other marketing
demos
TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE -
#ODSC - BOSTON 2015
49

Mais conteúdo relacionado

Mais procurados

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-surveyAdam Rabinovitch
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyPeter Kua
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —swethaT16
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectEugene Mandel
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 

Mais procurados (20)

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Hadoop Meets Scrum
Hadoop Meets ScrumHadoop Meets Scrum
Hadoop Meets Scrum
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
2015 data-science-salary-survey
2015 data-science-salary-survey2015 data-science-salary-survey
2015 data-science-salary-survey
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi Periasamy
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science Project
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 

Destaque

DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyRoger Barnes
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Stefan Urbanek
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explainedStefan Urbanek
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Sunil Nair
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Destaque (14)

DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemy
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Introduction to pattern recognition
Introduction to pattern recognitionIntroduction to pattern recognition
Introduction to pattern recognition
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Semelhante a Data Science 101

Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Miningtobiemuir
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackDomino Data Lab
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CulturePauline Chow
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Gabriel Moreira
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data Spain
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfPoornimaShetty27
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfSreenivasa Harish
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Rady School Master of Science Business Analytics (MSBA) Program Overview
Rady School Master of Science Business Analytics (MSBA) Program OverviewRady School Master of Science Business Analytics (MSBA) Program Overview
Rady School Master of Science Business Analytics (MSBA) Program OverviewUC San Diego Rady School of Management
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1Roger Barga
 
3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компаниюantishmanti
 
7 ideas on encouraging advanced analytics
7 ideas on encouraging advanced analytics7 ideas on encouraging advanced analytics
7 ideas on encouraging advanced analyticsMark Tabladillo
 
PPT1-Buss Intel Analytics.pptx
PPT1-Buss Intel  Analytics.pptxPPT1-Buss Intel  Analytics.pptx
PPT1-Buss Intel Analytics.pptxssuser28b150
 
AMES 2016 - The Human Side of Analytics
AMES 2016 - The Human Side of AnalyticsAMES 2016 - The Human Side of Analytics
AMES 2016 - The Human Side of AnalyticsStephen Tracy
 
data analytics lecture2.pptx
data analytics lecture2.pptxdata analytics lecture2.pptx
data analytics lecture2.pptxNamrataBhatt8
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 

Semelhante a Data Science 101 (20)

Around Data Science
Around Data ScienceAround Data Science
Around Data Science
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science Stack
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven Culture
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Rady School Master of Science Business Analytics (MSBA) Program Overview
Rady School Master of Science Business Analytics (MSBA) Program OverviewRady School Master of Science Business Analytics (MSBA) Program Overview
Rady School Master of Science Business Analytics (MSBA) Program Overview
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1
 
3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию
 
7 ideas on encouraging advanced analytics
7 ideas on encouraging advanced analytics7 ideas on encouraging advanced analytics
7 ideas on encouraging advanced analytics
 
PPT1-Buss Intel Analytics.pptx
PPT1-Buss Intel  Analytics.pptxPPT1-Buss Intel  Analytics.pptx
PPT1-Buss Intel Analytics.pptx
 
AMES 2016 - The Human Side of Analytics
AMES 2016 - The Human Side of AnalyticsAMES 2016 - The Human Side of Analytics
AMES 2016 - The Human Side of Analytics
 
data analytics lecture2.pptx
data analytics lecture2.pptxdata analytics lecture2.pptx
data analytics lecture2.pptx
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 

Mais de odsc

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discoveryodsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysisodsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depthodsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Informationodsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLodsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Namesodsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Dataodsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Scienceodsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Toolsodsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypseodsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 

Mais de odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Data Science 101

  • 1. DATA SCIENCE 101 A Layman’s Tour of Data Science with Todd Cioffi O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci opendatascicon.com
  • 2. GOALS FOR THE SESSION:  Introduce Terminology  Explain Concepts  Get You Comfortable – Understand the conversation – Even if you don’t know how to do it TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 2
  • 3. BIG PICTURE Infrastructure Big Data: “The 3 (or 4…) Vs”  Volume  Velocity  Variety Internet of Things (IoT) Cloud  NIST in a nutshell  Requestable  Available  Shareable  Scalable  Measurable  IaaS / PaaS / SaaS (vs. SAS) / *aaS  Plan for Failure Math Business Intelligence (BI) Business Analytics Data Analytics xxx Analytics** Code Machine Learning Data Mining Deep Learning Data Visualization : A Business Model, not a Technology TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 3
  • 4. DATA Traditional (‘70s) - RDBMS  Controlled Input  Controlled Structure  SQL: Structured Query Language  ACID  Atomic  Consistent  Isolated  Durable  “Real Time”  A fiction Today  Democratized Input  Flexible Structure  NoSQL  MongoDB / Cassandra / …  Text  XML /JSON / XBRL / …  Multimedia: Images, Audio, Video  Hadoop: MapReduce^ / Pig / Hive / Flume / …  Spark / Storm / Kafka / …  Graph DBs, Semantic Web, …  CAP Theorem  Consistency, Availability, Partition tolerance  BASE  Basically Available, Soft state, Eventually consistent  Idempotence: once or many = same resultant state  Plan for FailureTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 4
  • 5. STAGES OF ANALYTICS Descriptive  What happened? Predictive  What is going to happen? Prescriptive  How do we influence what is going to happen?  What do we do? TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 5
  • 6. SUMMARY TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 6
  • 7. ANALYTICS DEFINITIONS “Analytics is defined as the extensive use of data, statistical and quantitative analysis, exploratory and predictive models, and fact based management to drive decisions and actions“. - Tom Davenport, Competing on Analytics “Analytics is the discovery and communication of meaningful patterns in data. … analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. … Analytics is a multi-dimensional discipline. There is extensive use of mathematics and statistics, the use of descriptive techniques and predictive models to gain valuable knowledge from data - data analysis. The insights from data are used to recommend action or to guide decision making rooted in business context. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology.“ – Wikipedia “By any definition, analytics uses quantitative methods to explore data and reveal patterns within. Useful patterns can be formulated into reusable models. Applied to business, these models are then used to derive insight, prompting data-driven action.” – Todd Cioffi, RMU1 TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 7
  • 8. ANALYTICS TOOLS: A SAMPLE Enterprise (Scale and Cost)  SAS  SPSS  STATA  MATLAB  BlueMix (IBM Watson) Open Source  R  Python  Weka  Octave  RapidMiner*, Knime, … Freemium (Hybrid)  Dozens (Gartner, KDnuggets, …) TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 8
  • 9. DATA VIZ: TYPES AND TOOLS Scatter: x, y (z) Beyond Bar, Pie, Stacked Bar, …  Histogram (not a Bar)  Box & Whisker, Violin  Heatmap  Bubble  “Spider” How many axes are you trying to represent? What kinds of info do people understand? R  ggplot2 Python  matplotlib  seaborn D3.js Plot.ly Tableau TIBCO Spotfire Qlikview TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 9
  • 10. FAMOUS DATA VIZ THRU HISTORY Snow and Cholera Nightingale and the Crimea Minard and Napoleon Edward Tufte TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 10
  • 11. CRISP-DM CRoss Industry Standard Process for Data Mining “CRISP” TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 11
  • 12. CRISP: DRILL DOWN Business Understanding:  Business Objectives Why are we doing this? What are we trying to achieve?  Data Mining Goals  Definition of success criteria TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 12
  • 13. CRISP: DRILL DOWN Data Understanding: We need to understand the data that we will be using:  EDA: Exploratory Data Analysis  What attributes did we collect as data? Customers? Patients? Events? …  How are those attributes coded? What do our data points mean?  How is our data quality?  How, where, why, and by whom our data was collected may be important.  The data that we didn’t collect may also be relevant.  Data exploration might reveal unexpected, even surprising, properties.  Relative importance of various attributesTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 13
  • 14. CRISP: DRILL DOWN Data Preparation: Once we have a handle on our data, we need to prepare it for the Modeling step. This is where we shape and transform our data into the appropriate usable format. This includes: selecting columns, sampling rows, deriving new or compound variables, filtering data, and merging data sources. • The representation of data is a key to success. The wrong representation can hide important patterns. • Different Modeling approaches need different data representations. • As we learn more, and/or try new models, we might come back to this step. • Expect to spend time on this phase - almost always more than half, and sometimes even 90%, of total analysis time should be allocatedTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 14
  • 15. CRISP: DRILL DOWN Modeling: This is where we search for patterns in our data. These patterns winnow out unnecessary data and characterize the influence of attributes that matter. From these patterns, we can create a model that is not only descriptive, but predictive. • There are many different kinds of models, each looking at the data from a different perspective. • We may want to try different models, and different parameters within algorithms, to find our best results. TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 15
  • 16. CRISP: DRILL DOWN The Evaluation phase looks in two directions: We need to validate our model from the prior CRISP-DM step.  Precision, applicability, and understandability are all parts of a trade-off  Understandable models giving deeper insights are often preferred over more accurate models. We also need to evaluate our progress towards our business goals.  Does this model help us meet our success criteria?  Does new insight here funnel back into our business understanding?  Should we loop through CRISP-DM again with our new information? TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 16
  • 17. CRISP: DRILL DOWN Deployment: Once we have results that meet our goals, we need to put them into use, otherwise the effort is lost. • At any point in the process, we could take our results and gain new Business Understanding, creating an opportunity to cycle through the CRISP-DM model again, gaining even more value from our data Models age… TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 17
  • 18. MODELING: THE FUN BITS We want to find patterns in our data, then use these patterns to predict outcomes. How does that happen? By analyzing our data, we can derive a set of “rules” or a “formula” that describes some behavior.  Examples like “this” tended to fall into this pile. Examples like “that” tended to fall into that pile.. Collectively, the rules we assemble are called a model. The process of finding and deriving the model is called training. The data used for training is called training data. Once we have established our pattern - or model - we can run similar examples through our rules and predict where they would fall. This is called model application or applying the model. Example: based on this customer’s profile, knowing what we know, do we expect churn or no churn? We could then take that answer and decide whether to take action in order to hold them. There are many different approaches used to search for patterns in data. We will see a handful of them in this session. When any approach gets developed to the point where it can be described with a TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 18
  • 19. SO LET’S GET STARTED WITH MODELING… TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 19
  • 20. WHAT IS A COYOTE? Your six-year old nephew thinks that there are only five kinds of animals: 1) Kitty 2) Puppy 3) Horsey 4) Birdie 5) Fishie What does he think a coyote is? Why? TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 20
  • 21. K-NEAREST NEIGHBOR k-Nearest Neighbor (k-NN) is a very intuitive approach:  To find out what something is like, see what the things closest to it are like. Two key questions: What is “near”?  Euclidean Distance  Cosine Similarity  Manhattan Distance Which neighbors? How many?  K many… TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 21
  • 22. WHICH DOT IS CLOSER? 10-3 106TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 22 How about now?
  • 23. NORMALIZATION Orders of Magnitude  Also consider significant digits Range Z-Transform Leaking data: Norm is also a model TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 23
  • 24. K-NN IN YOUR HEAD… K = 1 Train on full data set How accurate? What did we learn? Why? TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 24
  • 25. OVERFIT The purpose of modeling is to find a generalizable pattern that will tell you about new data. If your model fits your current data too closely, it loses general utility. Kaggle Titanic  what about “new” passengers? TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 25
  • 26. TESTING & VALIDATION So how do we plan for “new” data when we’re working with one set of current data? Hold-Out or Split validation Cross-Validation Leave One Out TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 26
  • 27. CONFUSION MATRIX Performance Measures  Accuracy / Error  What is the value of knowing the ratio of the number right (or wrong) of the total?  Precision / Recall  “You have cancer...”  Precision: how many with positive tests actually have cancer?  Recall: how many with cancer tested positive?  Sensitivity / Specificity  “You have cancer...”  Sensitivity: how many with cancer tested positive? (see: recall)  Specificity: how many without cancer tested negative?  Here is a handy URL to know: http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf + - +’ A B -’ C D TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 27
  • 28. TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 28
  • 29. CONFUSION MATRIX, ARRANGED Reality Predicted + - +’ A B -’ C D Accuracy = (A+D) / (A+B+C+D) Error = (B+C) / (A+B+C+D) or 1 – ( (A+D) / (A+B+C+D) ) Precision = A / (A+B) Recall = A / (A+C) Specificity = D / (D+B) = Sensitivity You have Cancer... HTTP://WWW.DAMIENFRANCOIS.BE/BLOG/FILES/MODELPERFCHEATSHEET.PDFTODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 29
  • 30. CORRELATION Meaning:  Do things tend to move together? Range  To what degree?  Same or opposite?  -1 … 1 Not meaning  “Correlation does not equal Causation”  http://www.tylervigen.com/ TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 30
  • 31. LINEAR REGRESSION AND OTHER “LINES” Y = MX + B Height / Weight of Dog y = m1x1 + m2x2 + ... + mnxn + b Dependent / independent variable  Cigs / cancer, but not cancer v cigs SVM: Support Vector Machine  Line > Plane > Hyperplane TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 31
  • 32. FUNNY THING ABOUT LINES: ANSCOMBE’S QUARTET I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Property (in each case) Value Mean of x 9 (exact) Sample variance of x 11 (exact) Mean of y 7.50 (to 2 places) Sample variance of y 4.122 or 4.127 (to 3 places) Correlation between x and y 0.816 (to 3 places) Linear regression line y = 3.00 + 0.500x (to 2 and 3 places, respectively) TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 32
  • 33. ANSCOMBE’S QUARTET TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 33
  • 34. DATA TYPES Numerical  Integer  Real  Date-time Nominal  Binominal (either / or)  Polynominal (categorical)  Corpus Scalar, Ordinal, Categorical Dummy coding TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 34
  • 35. NAIVE BAYES Bayes: Simple probabilistic counting Smoke Pop Men 0.65 0.12 0.0780 + 0.88 0.5720 - Women 0.35 0.07 0.0245 + 0.93 0.3255 - 1 1 1 Smokers 0.1025 P(W|+) 0.2390 Mor N/S 0.9755 Sun, Wind, Precip > play outside Example contains a given word What does that mean about future examples with same word (or word combo)? TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 35
  • 36. RULES AND TREES Rule Induction  +++++++ ------  ++-- + ++ -+ --- Decision Trees Random Forest TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 36
  • 37. SAMPLING Rows, Records, Documents, Examples Spreadsheets think of data in rows. They are using a two-dimensional ledger (worksheets = 3-D). Databases use the term records (or documents) to identify the storage of one item. The display might seem linear, but the metaphor relating to real life is capturing more. Think of a medical record, a personnel file, or other such documents. These are even potentially multi-dimensional. Data Scientists uses the term examples. Whether a research biologist, a marketer, or political scientist, they are thinking in terms of populations – cohorts, customers, voters. Out of a given population, each individual is an example. From those examples, we find patterns. Linear, Shuffled, Stratified Kennard-Stone Over / Under TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 37
  • 38. TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 38
  • 39. CHECKERBOARD SET TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 39
  • 40. CHECKERBOARD SET: SAMPLE 0.05 TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 40
  • 41. CHECKERBOARD SET: SAMPLE 0.05 K-S TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 41
  • 42. CHECKERBOARD SET : OVER- /UNDER-SAMPLE TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 42
  • 43. FEATURE SELECTION In the same way that spreadsheets use 2- D columns, and databases use data fields to make up a record, each example in our population is described by some number of attributes - also called properties, variables, or features. Forward Selection Backward Elimination Evolutionary TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 43
  • 44. DIMENSIONALITY REDUCTION Ht/Wt graph  Food/mo (lbs)  Toy purchases ($)  Leash width (mm)  Property damage ($)  Stool volume (ml?) Helmets Clothes TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 44
  • 45. SUPERVISED LEARNING What does it mean? Target variable / feature / attribute / label What else could one do? Unsupervised learning AKA Classification and Clustering Not the same thing, but one can feed the other TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 45
  • 46. K-MEANS CLUSTERING Clustering modeler Iterative distance-based assessment • Start w/ Random Seeds • Assign each point to closest seed • Move seed to center of cluster • Lather, rinse, repeat until mean doesn’t move (or oscillates) and clusters don’t change. How many clusters?  k many Then what happens? • Could turn cluster assignments into classification labels TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 46
  • 47. OUTLIER DETECTION Distance Density LOF: Localized Density TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 47
  • 48. SCALE Began with Big Data PA at scale – how are algorithms impacted? Memory and Calculation constraints TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 48
  • 49. NO FREE LUNCH No single algorithm is the “best” for all data sets Different algorithms are often used in different situations  Naïve Bayes is common in Spam filters  Outlier Detection is helpful with Fraud  Clustering works well for Recommendation engines and identifying other marketing demos TODD CIOFFI - DATA SCIENCE 101: A LAYMAN’S TOUR OF DATA SCIENCE - OPEN DATA SCIENCE CONFERENCE - #ODSC - BOSTON 2015 49