[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)

C
Carl VogelData Scientist
HowDataScientists
BrokeA/BTesting
(andhowwecanfixit)
Questions?
pos.it/slido-A
A Completely
True Story
[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)
Launch on
Neutral
(But thanks anyway)
ExistentialDread
(Get used to it)
A real PM
“If it’s something we really believe
in, I’ll launch on a flat result … if
it’s part of a broader strategy.”
“My features are hard as shit to build,
but easy to tweak, so I’m not always
worried about statistical significance.”
Another real PM
NotjustNHST
Features aren’t IID
Path dependencies in
feature roadmaps
We develop experiences by
building up features over
time and it’s helpful to
launch them incrementally
MDE is basically zero
Feature costs are nearly all
sunk before the test
Any lift pays off
NotjustNHST
Risk is mismeasured
Decision makers don’t
think about Type I and II
error rates, per se
They just want to make
more money than they lose
CanImakegood
decisionsabout
smalltomoderate
effectsquickly?
Youcan’tmake
reliableinferences
aboutsmallto
moderateeffects
quickly.
Didtheymisusethetool?
Ordidwehandthemthewrongone?
Non-Inferiority
Designs
Non-inferioritydesigns
Let’s try not to wreck the place
Superiority Non-Inferiority
Non-inferioritydesigns
Let’s try not to wreck the place
• Inferiority margins ( ) prompt us to ask:
• How much do we believe in this feature?
• How quickly will we improve on it?
• Stakeholders can give meaningful answers to these questions
• Compare to MDE/minimal lift, which is often made up
• Avoid meaningless minimum e
ff
ect estimates
• Can power against a “no e
ff
ect” alternative
Δ
[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)
What’s
the rush?
Thecostsoflongexperiments
Time is money, folks
• Opportunity cost of time:
• Experimental features live on a roadmap, waiting for launch decisions
delays development of subsequent features
• Opportunity cost of sampling:
• As long as the experiment runs, many users aren’t getting the best
variant
• Maintenance costs:
• More experiments running means more complexity in the codebase,
more e
ff
ort, etc.
Value of
Information
Designs
Whenisdataworthit?
Good things are worth waiting for
•Waiting is costly, but data is valuable.
•We should keep going as long as the value
of more data exceeds the cost of more time
•Quantify our impatience as part of test
design
ExpectedValuevs.CostofData
$0
$20,000
$40,000
$60,000
$80,000
Test Length
0 15 30 45 60
Exp. Value
Cost
Net Exp.
Value
Whyisdatavaluable?
How dumb am I, in dollars?
• Before we have data, our range of potential lifts is wide
• Our best guess could be way o
ff
; we could make a big
mistake
• Observing data narrows the range, even if our new guess is
wrong, it won’t be wrong by as much.
• If the value of being less wrong (in expectation) exceeds the
cost of waiting for the data, LFG!
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
$0
$10K
$200K
Sequentialtestingdecisions
Don’t stop ’til you get enough
• We can do this again after collecting some data
• This changes the core decision from: “is B > A?” to “should I stop or
continue testing?”
• Good
fi
t for A/B tests, where we collect data passively just by
waiting
• Once more data isn’t worth it, launch the best observed variant,
the inference problem is irrelevant (Claxton ’96)
• This is our best information, and it’s not worth getting more
Lessons
What’stheProblem?
Going back to basics
There’s no silver bullet
You may have other problems; you’ll need
other solutions
Misuse of tools should prompt us to
rethink the problem
What are we actually trying to solve?
What are the costs, benefits, and risks?
What’stheProblem?
Going back to basics
Are we solving the problem, or treating
symptoms?
Launch-on-neutral, run-til-significant, peeking,
etc. are symptoms, not the root problem
Lots of advanced techniques speed up tests, but
don’t actually address reasons for impatience
Here,there,andeverywhere
You’re soaking in it
This isn’t just about A/B testing
But it’s a domain where we have very
familiar tools close at hand
Whatareweherefor?
People who solve problems for people are the luckiest people in the world
This is the fun stuff
This is where we add value as data
scientists
These problems aren’t solved
Try new stuff!
Carl Vogel
Principal Data Scientist
carl.vogel@babylist.com
Thanks!
1 de 34

Recomendados

Hashing notes data structures (HASHING AND HASH FUNCTIONS) por
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Hashing notes data structures (HASHING AND HASH FUNCTIONS)
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Kuntal Bhowmick
194 visualizações13 slides
Merge sort algorithm por
Merge sort algorithmMerge sort algorithm
Merge sort algorithmsrutisenpatra
264 visualizações11 slides
Heteroskedasticity por
HeteroskedasticityHeteroskedasticity
Heteroskedasticitymodelos-econometricos
10.8K visualizações6 slides
Tale of Two Tests por
Tale of Two TestsTale of Two Tests
Tale of Two TestsOptimizely
239 visualizações41 slides
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making por
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision MakingData-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Makingindeedeng
2.5K visualizações227 slides
To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C... por
To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C...To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C...
To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C...Matthew Philip
574 visualizações50 slides

Mais conteúdo relacionado

Similar a [PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)

Building a culture of testing like lucid por
Building a culture of testing like lucidBuilding a culture of testing like lucid
Building a culture of testing like lucidKissmetrics on SlideShare
497 visualizações22 slides
Actionable Machine Learning por
Actionable Machine LearningActionable Machine Learning
Actionable Machine LearningMeir Maor
391 visualizações21 slides
Todd little - Risky Business | Real Options for Business Agility por
Todd little -  Risky Business | Real Options for Business AgilityTodd little -  Risky Business | Real Options for Business Agility
Todd little - Risky Business | Real Options for Business AgilityKanban Conferences
248 visualizações77 slides
What do we do with all this big por
What do we do with all this big What do we do with all this big
What do we do with all this big Rajeev Ranjan Dwivedi
26 visualizações16 slides
Portfolio Management Using Questionable Quality Data por
Portfolio Management Using Questionable Quality DataPortfolio Management Using Questionable Quality Data
Portfolio Management Using Questionable Quality DataPortfolio Decisions
269 visualizações32 slides
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P... por
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...James Anderson
198 visualizações10 slides

Similar a [PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)(20)

Actionable Machine Learning por Meir Maor
Actionable Machine LearningActionable Machine Learning
Actionable Machine Learning
Meir Maor391 visualizações
Todd little - Risky Business | Real Options for Business Agility por Kanban Conferences
Todd little -  Risky Business | Real Options for Business AgilityTodd little -  Risky Business | Real Options for Business Agility
Todd little - Risky Business | Real Options for Business Agility
Kanban Conferences248 visualizações
What do we do with all this big por Rajeev Ranjan Dwivedi
What do we do with all this big What do we do with all this big
What do we do with all this big
Rajeev Ranjan Dwivedi26 visualizações
Portfolio Management Using Questionable Quality Data por Portfolio Decisions
Portfolio Management Using Questionable Quality DataPortfolio Management Using Questionable Quality Data
Portfolio Management Using Questionable Quality Data
Portfolio Decisions269 visualizações
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P... por James Anderson
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
James Anderson198 visualizações
mtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdf por Jens-Fabian Goetzmann
mtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdfmtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdf
mtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdf
Jens-Fabian Goetzmann557 visualizações
Managing Data Science by David Martínez Rego por Big Data Spain
Managing Data Science by David Martínez RegoManaging Data Science by David Martínez Rego
Managing Data Science by David Martínez Rego
Big Data Spain556 visualizações
How to use data to make a hit tv show por Parul Verma
How to use data to make a hit tv showHow to use data to make a hit tv show
How to use data to make a hit tv show
Parul Verma67 visualizações
Software estimation is crap por Ian Garrison
Software estimation is crapSoftware estimation is crap
Software estimation is crap
Ian Garrison66 visualizações
Is data visualisation bullshit? por Alban Gérôme
Is data visualisation bullshit?Is data visualisation bullshit?
Is data visualisation bullshit?
Alban Gérôme637 visualizações
CommonAnalyticMistakes_v1.17_Unbranded por Jim Parnitzke
CommonAnalyticMistakes_v1.17_UnbrandedCommonAnalyticMistakes_v1.17_Unbranded
CommonAnalyticMistakes_v1.17_Unbranded
Jim Parnitzke190 visualizações
Is Bigger Data Really Better? 10 Facts from Theory and Practice por DataWorks Summit
Is Bigger Data Really Better? 10 Facts from Theory and PracticeIs Bigger Data Really Better? 10 Facts from Theory and Practice
Is Bigger Data Really Better? 10 Facts from Theory and Practice
DataWorks Summit720 visualizações
Mind Of An Analyst- Jennifer Vessenmeyer por Online Marketing Summit
Mind Of An Analyst- Jennifer VessenmeyerMind Of An Analyst- Jennifer Vessenmeyer
Mind Of An Analyst- Jennifer Vessenmeyer
Online Marketing Summit834 visualizações
Corporate Climb Presentation por Kirill Storch
Corporate Climb PresentationCorporate Climb Presentation
Corporate Climb Presentation
Kirill Storch332 visualizações
Why business people should always be involved por Jaap Vink
Why business people should always be involvedWhy business people should always be involved
Why business people should always be involved
Jaap Vink52 visualizações
Big data Hype(And Reality) por NarasingaMoorthy V
Big data Hype(And Reality)Big data Hype(And Reality)
Big data Hype(And Reality)
NarasingaMoorthy V55 visualizações
I love the smell of data in the morning (getting started with data science) ... por Troy Magennis
I love the smell of data in the morning (getting started with data science)  ...I love the smell of data in the morning (getting started with data science)  ...
I love the smell of data in the morning (getting started with data science) ...
Troy Magennis1.2K visualizações
Module 4: Model Selection and Evaluation por Sara Hooker
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker687 visualizações
Intro to Data Analytics with Oscar's Director of Product por Product School
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
Product School878 visualizações

Último

Chapter 3b- Process Communication (1) (1)(1) (1).pptx por
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
5 visualizações30 slides
How Leaders See Data? (Level 1) por
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)Narendra Narendra
13 visualizações76 slides
3196 The Case of The East River por
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
11 visualizações4 slides
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 visualizações12 slides
UNEP FI CRS Climate Risk Results.pptx por
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptxpekka28
11 visualizações51 slides
Organic Shopping in Google Analytics 4.pdf por
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
11 visualizações13 slides

Último(20)

Chapter 3b- Process Communication (1) (1)(1) (1).pptx por ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 visualizações
How Leaders See Data? (Level 1) por Narendra Narendra
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra13 visualizações
3196 The Case of The East River por ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 visualizações
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
DataScienceConferenc15 visualizações
UNEP FI CRS Climate Risk Results.pptx por pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 visualizações
Organic Shopping in Google Analytics 4.pdf por GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials11 visualizações
CRIJ4385_Death Penalty_F23.pptx por yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 visualizações
Cross-network in Google Analytics 4.pdf por GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 visualizações
Data structure and algorithm. por Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 19 visualizações
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation por DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
DataScienceConferenc17 visualizações
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx por JaysonGarabilesEspej
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 visualizações
Survey on Factuality in LLM's.pptx por NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 visualizações
Introduction to Microsoft Fabric.pdf por ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika29 visualizações
PROGRAMME.pdf por HiNedHaJar
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar18 visualizações
RuleBookForTheFairDataEconomy.pptx por noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 visualizações
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf por vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 visualizações
Short Story Assignment by Kelly Nguyen por kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 visualizações
Building Real-Time Travel Alerts por Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann111 visualizações

[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)