4. Data Science Pipeline
•Analytic Data
•Analytic Code
•Documentation
•Distribution
•ElementsofReproducibleResearch
Report Writing for Data Science in R, Roger D. Peng, 2016
5. 1. Stating and refining the question
2. Exploring the data
3. Building formal statistical models
4. Interpreting the results
5. Communicating the results
Epicycle of Analysis
The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D. Peng and Elizabeth Matsui, 2016
6. • summarize the measurements in a
single data set without further
interpretation
•Descriptive
• Searching for discoveries, trends,
correlations, or relationships
between multiple variables to
generate ideas or hypotheses
Exploratory
• quantifying whether an observed
pattern will likely hold beyond the
data set in hand
Inferential
• uses a subset of measurements (the
features) to predict another
measurement (the outcome)
Predictive
• what happens to one measurement
if you make another measurement
change
Causal
• changing one measurement always
and exclusively leads to a specific,
deterministic behavior in another
Deterministic
The Elements of Data
Analytic Style, A guide for
people who want to analyze
data, Jeff Leek, 2015
8. Why use EDA - Summary
• Maximize insight into a data set
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop parsimonious models
• Determine optimal factor
settings
•NIST
• Show comparisons
• Show causality, mechanism,
explanation
• Show multivariate data
• Integrate multiple modes of
evidence
• Describe and document the
evidence
• Content is king
•JHUniversity
9. Answer to initial questions
What is a typical value for a certain feature?
What is the uncertainty for a typical value of a
feature?
What is a good distributional fit for a feature?
What is the percentile distribution?
Does modification on one variable have an
effect another variable?
Does a factor have an effect on performance
metrics?
What are the most important factors?
What is the best function for relating a
response variable to other variables?
What are the best settings for factors
(i.e. levels)?
Can we separate signal from noise?
Can we extract any structure from multivariate
data?
Does the data have outliers?
12. Practical Steps
•Before performing any measurements or simulation
• Identify
• Performance Metrics
• Performance Factors and Levels
• Caution: sometimes you have to guess the ranges for the levels
• Use an educated guess
Don’t run tons of simulations / experiments (As previously discussed)
Plot quick and dirty graphs
• No need for titles, labels
13. Some examples of EDA Graphs - WiFi Data (simulated)
• “Vendor” - factor / levels: LinkSys, …
• “Model“ – factor / Levels: GST200, …
• "Users_Max_Rate“ - factor (background traffic) /
levels: 1.6, 1.8,…,7.0 Mbps
• "Year“ – factor / Levels: 1999, 2008
• "BER“ – factor / Levels: 4, 5, 6, and 8
• "Type“ – factor (type of user) / Levels: 4, f, r
• Rate – performance metric (Mbps)
• Distance - factor (distance from the AP) / “Levels:
50,100m
Features
(Observation
Variables)
27. References
• NIST’s Handbook of Statistics Engineering (online)
• Report Writing for Data Science in R, Roger D. Peng, 2016
• The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D.
Peng and Elizabeth Matsui, 2016
• The Elements of Data Analytic Style, A guide for people who want to analyze
data, Jeff Leek, 2015
Notas do Editor
Left figure: Report Writing for Data Science in R, Roger D. Peng, 2016
Left figure: The Art of Data Science, A Guide for Anyone Who Works with Data, Roger D. Peng and Elizabeth Matsui, 2016
Figure and Text: The Elements of Data Analytic Style, A guide for people who want to analyze data, Jeff Leek, 2015