SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Data science workflow
Andrew Gelman
Dept of Statistics and Dept of Political Science
Columbia University, New York
PyData, New York, 28 Nov 2017
The (abridged) model in Stan
parameters {
real b;
real<lower=0> sigma_a;
real<lower=0> sigma_y;
vector[nteams] a;
}
model {
a ~ normal(b*prior_score, sigma_a)
sqrt_dif ~ normal(a[team1] - a[team2], sigma_y);
}
Fit the model
Inference for Stan model: worldcup_first_try.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00
sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01
sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00
a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00
a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00
a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00
a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00
a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00
. . .
Graph the estimates
Compare to model fit without prior rankings
Compare model to predictions
After finding and fixing a bug
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Data on putts in pro golf
Distance from hole (feet)
Probabilityofsuccess
1346/1443
577/694
337/455
208/353
149/272
136/256
111/240
69/217
67/200
75/237
52/202
46/192
54/174
28/167
27/201
31/195
33/191
20/147
24/152
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry-based model
x
R
r
−2σ 0 2σ
Stan code
data {
int J;
int n[J];
real x[J];
int y[J];
real r;
real R;
}
parameters {
real<lower=0> sigma;
}
model {
real p[J];
p = 2*Phi(asin((R-r)/x) / sigma) - 1;
y ~ binomial(n, p);
}
Fit the model
golf <- read.table("golf.txt", header=TRUE, skip=2)
x <- golf$x
y <- golf$y
n <- golf$n
J <- length(y)
r <- (1.68/2)/12
R <- (4.25/2)/12
fit1 <- stan("golf1.stan")
Check convergence
> print(fit1)
Inference for Stan model: golf1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1
sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Geometry−based model,
sigma = 1.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Two models fit to the golf putting data
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry−based model,
sigma = 1.5
Birthdays!
The published graphs show data from 30 days in the year
1970 1972 1974 1976 1978 1980 1982 1984 1986 1988
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap dayApril 1st Memorial day
Independence day
Labor day
Halloween
Thanksgiving
Christmas
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap day
April 1st Memorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
2000 2002 2004 2006 2008 2010 2012 2014
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
50
60
70
80
90
100
110
120
New year
Valentine's day
Leap day
April 1stMemorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
13th day of month
Xbox estimates, adjusting for demographics
Xbox estimates, adjusting for demographics and
partisanship
Data from 2016
Some ideas in data science workflow
Data and information
Replication
Fake-data simulation (or statistical theory)
Comparing predictions to data
The network of models

Mais conteúdo relacionado

Mais procurados

STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIALSTRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIALRohit Katarya
 
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Kirill Eremenko
 
Teoría y problemas de Sumas Notables II sn26 ccesa007
Teoría y problemas de Sumas Notables II  sn26 ccesa007Teoría y problemas de Sumas Notables II  sn26 ccesa007
Teoría y problemas de Sumas Notables II sn26 ccesa007Demetrio Ccesa Rayme
 
Copier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographieCopier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographieAhmed Manai
 
Math unit21 formulae
Math unit21 formulaeMath unit21 formulae
Math unit21 formulaeeLearningJa
 
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...EUBra BIGSEA
 
Javier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretasJavier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretasJavierJoseDominguezd
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4Roziq Bahtiar
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...Jurgen Riedel
 
HMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlHMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlPantelis Sopasakis
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Hsien-Hsin Sean Lee, Ph.D.
 
คู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprogramsคู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprogramsTherdkeat Khuonhat
 
Problem Application of Antiderivatives
Problem Application of AntiderivativesProblem Application of Antiderivatives
Problem Application of Antiderivativesnyaz26
 
Fast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationFast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationPantelis Sopasakis
 
Free FE practice problems
Free FE practice problemsFree FE practice problems
Free FE practice problemsEIT Experts
 

Mais procurados (19)

The final
The finalThe final
The final
 
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIALSTRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
 
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
 
Teoría y problemas de Sumas Notables II sn26 ccesa007
Teoría y problemas de Sumas Notables II  sn26 ccesa007Teoría y problemas de Sumas Notables II  sn26 ccesa007
Teoría y problemas de Sumas Notables II sn26 ccesa007
 
Copier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographieCopier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographie
 
Math unit21 formulae
Math unit21 formulaeMath unit21 formulae
Math unit21 formulae
 
Cilindro
CilindroCilindro
Cilindro
 
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
 
Javier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretasJavier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretas
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
 
Boolean difference examples
Boolean difference examplesBoolean difference examples
Boolean difference examples
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
 
HMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlHMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude Control
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
 
Trigo functions
Trigo functionsTrigo functions
Trigo functions
 
คู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprogramsคู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprograms
 
Problem Application of Antiderivatives
Problem Application of AntiderivativesProblem Application of Antiderivatives
Problem Application of Antiderivatives
 
Fast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationFast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimization
 
Free FE practice problems
Free FE practice problemsFree FE practice problems
Free FE practice problems
 

Semelhante a Data Science Workflow

Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelMatt Moores
 
Laporan pemodelan dan simulasi
Laporan pemodelan dan simulasiLaporan pemodelan dan simulasi
Laporan pemodelan dan simulasiIrwansyah Hazniel
 
Comparison GUM versus GUM+1
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1Maurice Maeck
 
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudskoCHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudskoSydneyJaydeanKhanyil
 
Compression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplotsCompression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplotsMathias Magdowski
 
Muhammad ariefnugraha 142014066_kode4
Muhammad ariefnugraha 142014066_kode4Muhammad ariefnugraha 142014066_kode4
Muhammad ariefnugraha 142014066_kode4Muhammad Nugraha
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Vu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxVu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxQucngV
 
ADVANCED ALGORITHMS-UNIT-3-Final.ppt
ADVANCED   ALGORITHMS-UNIT-3-Final.pptADVANCED   ALGORITHMS-UNIT-3-Final.ppt
ADVANCED ALGORITHMS-UNIT-3-Final.pptssuser702532
 
Precomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical modelsPrecomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical modelsMatt Moores
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationPranamesh Chakraborty
 

Semelhante a Data Science Workflow (20)

jacobi method, gauss siedel for solving linear equations
jacobi method, gauss siedel for solving linear equationsjacobi method, gauss siedel for solving linear equations
jacobi method, gauss siedel for solving linear equations
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts model
 
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
 
Laporan pemodelan dan simulasi
Laporan pemodelan dan simulasiLaporan pemodelan dan simulasi
Laporan pemodelan dan simulasi
 
Comparison GUM versus GUM+1
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1
 
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudskoCHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
 
Compression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplotsCompression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplots
 
Numerical Methods Solving Linear Equations
Numerical Methods Solving Linear EquationsNumerical Methods Solving Linear Equations
Numerical Methods Solving Linear Equations
 
Muhammad ariefnugraha 142014066_kode4
Muhammad ariefnugraha 142014066_kode4Muhammad ariefnugraha 142014066_kode4
Muhammad ariefnugraha 142014066_kode4
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Vu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxVu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptx
 
ADVANCED ALGORITHMS-UNIT-3-Final.ppt
ADVANCED   ALGORITHMS-UNIT-3-Final.pptADVANCED   ALGORITHMS-UNIT-3-Final.ppt
ADVANCED ALGORITHMS-UNIT-3-Final.ppt
 
Precomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical modelsPrecomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical models
 
Introduction to MATLAB
Introduction to MATLAB Introduction to MATLAB
Introduction to MATLAB
 
sheet6.pdf
sheet6.pdfsheet6.pdf
sheet6.pdf
 
doc6.pdf
doc6.pdfdoc6.pdf
doc6.pdf
 
paper6.pdf
paper6.pdfpaper6.pdf
paper6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 

Mais de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Mais de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Data Science Workflow

  • 1. Data science workflow Andrew Gelman Dept of Statistics and Dept of Political Science Columbia University, New York PyData, New York, 28 Nov 2017
  • 2.
  • 3.
  • 4. The (abridged) model in Stan parameters { real b; real<lower=0> sigma_a; real<lower=0> sigma_y; vector[nteams] a; } model { a ~ normal(b*prior_score, sigma_a) sqrt_dif ~ normal(a[team1] - a[team2], sigma_y); }
  • 5. Fit the model Inference for Stan model: worldcup_first_try. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00 sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01 sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00 a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00 a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00 a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00 a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00 a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00 . . .
  • 7. Compare to model fit without prior rankings
  • 8. Compare model to predictions
  • 9. After finding and fixing a bug
  • 10. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Data on putts in pro golf Distance from hole (feet) Probabilityofsuccess 1346/1443 577/694 337/455 208/353 149/272 136/256 111/240 69/217 67/200 75/237 52/202 46/192 54/174 28/167 27/201 31/195 33/191 20/147 24/152
  • 11. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3
  • 13. Stan code data { int J; int n[J]; real x[J]; int y[J]; real r; real R; } parameters { real<lower=0> sigma; } model { real p[J]; p = 2*Phi(asin((R-r)/x) / sigma) - 1; y ~ binomial(n, p); }
  • 14. Fit the model golf <- read.table("golf.txt", header=TRUE, skip=2) x <- golf$x y <- golf$y n <- golf$n J <- length(y) r <- (1.68/2)/12 R <- (4.25/2)/12 fit1 <- stan("golf1.stan")
  • 15. Check convergence > print(fit1) Inference for Stan model: golf1. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1 sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
  • 16. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Geometry−based model, sigma = 1.5
  • 17. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Two models fit to the golf putting data Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3 Geometry−based model, sigma = 1.5
  • 19. The published graphs show data from 30 days in the year
  • 20.
  • 21.
  • 22. 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap dayApril 1st Memorial day Independence day Labor day Halloween Thanksgiving Christmas
  • 23. Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap day April 1st Memorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 2000 2002 2004 2006 2008 2010 2012 2014 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean
  • 24. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 50 60 70 80 90 100 110 120 New year Valentine's day Leap day April 1stMemorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 13th day of month
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. Xbox estimates, adjusting for demographics
  • 31. Xbox estimates, adjusting for demographics and partisanship
  • 33. Some ideas in data science workflow Data and information Replication Fake-data simulation (or statistical theory) Comparing predictions to data The network of models