SlideShare uma empresa Scribd logo
1 de 26
Counteracting Selection
Bias in Machine Learning
1
Noam Finkelstein
MLConf SF
November 8th 2019
Overview
2
➣ Data are collected in all kinds of ways
➣ We pretend they are collected “At Random”
➣ This creates poor predictive performance in important
regions of the input space
➣ We can model the collection process to improve
performance
Takeaways
3
➣ Understand the importance of selection bias in ML
○ Not discussed as much in ML as in statistics
➣ Be able to identify when our data might have this problem.
➣ Learn how to model data collection.
➣ Learn how use our data to learn about selection bias when
possible.
Data Collection Step 1:
Things happen
4
Data Collection Step 2:
Some of them get recorded
5
Bias in Data Collection Step 1
Things happen
6
➣ Selection bias: Correlation between how likely we are to
see a data point (X, Y), and the outcome Y
➣ Example 1:
○ We are asked to create a tool to help project managers
predict profit of software projects
○ Our data include all software projects previously
undertaken at the company
○ PMs are good at their jobs, so projects that lose money
are not in the data much . They just don’t happen.
Bias in Data Collection Step 1
Things happen
7
Project Complexity
Profit
Approved Projects
Bias in Data Collection Step 1
Some Things Don’t Happen
8
Project Complexity
Profit
Approved Projects
Rejected Projects
Bias in Data Collection
99
➣ No ML model can learn about the
“complexity boundary”, even
though we have access to all the
projects that were undertaken.
Nothing is “missing”.
➣ This is a very bad way to fail!
Our model will do badly specifically
where we want it to protect us from
poor decisions.
Modeling the Data Collection Process
1010
➣ We know proposals that are
unlikely to be profitable are unlikely
to occur in the data.
➣ We can incorporate that
knowledge about the data
collection process into our model
to address this problem.
Bias in Data Collection Step 2
We don’t see everything
Weeks
WhiteBloodCellCount
➣ We want to know how patients are doing when they’re away from the clinic
➣ Patients come in when they’re feeling unwell, elevated WBC
➣ We’ll generally predict that they’re worse off than they are
Prediction in Machine Learning
➣ We generally model
➣ g is our favourite class of functions for regression or
classification, parameterized by
➣ “Easy” to do because Y is one dimensional, and
expectations are summary statistics
Modeling Data Collection
➣ Modeling the probability of observing some data,
is too hard (w/ finite data)!
➣ X is high dimensional
➣ Densities are complicated
Modeling Data Collection
➣ In many problems we care about, the probability of making
an observation is a function only of the outcome.
➣ Then the probability of making on observation is:
➣ Which, for (X, Y) pairs we don’t see, can be approximated:
Incorporating Knowledge on Data Collection
➣ If we’re being frequentists, we can define a loss function
that captures both how well we do on prediction outcome,
and how well we do on predicting observation:
Modeling Data Collection
➣ We can now learn from what we don’t see.
➣ We know there are regions of the input space w/ no data
➣ We know we’re less likely to see data w/ low profit
➣ Therefore: profit must be low in those regions
Project Complexity
Profit
Approved Projects
What if we don’t know the data
collection process?
17
➣ We can’t learn p entirely from data - would require us to
know the outcome specifically where we don’t observe it
(in most cases).
➣ If we have beliefs about p and g, we can be Bayesian about
things.
➣ If we have a few data points collected “at random” - i.e. not
according to p - then we can learn p
A Worked Example
18
➣ We have data collected according to some unknown,
non-random process p
WhiteBloodCellCount
Weeks
A Worked Example
19
➣ Functions compatible with this data will have different
behavior in unobserved regions
WhiteBloodCellCount
Weeks
A Worked Example
20
➣ We assume all data are “observed at random”, as usual. Fit
looks good!
➣ Validation data collected by the same process will not help!
WhiteBloodCellCount
Weeks
A Worked Example
21
➣ But it turns out the data was not collected at random -
we’re systematically way off in unobserved regions!
WhiteBloodCellCount
Weeks
A Worked Example
22
➣ What if we know how much more likely we are to make an
observation when the outcome is high?
WhiteBloodCellCount
Weeks
A Worked Example
23
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
A Worked Example
24
➣ What if we don’t know anything about data collection, but
get a few observations “at random”?
WhiteBloodCellCount
Weeks
Conclusions
25
➣ Selection bias hurts us in ML in ways we can’t detect
through normal validation procedures
➣ If we know something about the data collection process
we can incorporate it into our model to improve prediction.
➣ If we happen to have some data collected “at random”, we
can use it to learn about selection bias elsewhere in our
data.
Thank you!
Get in touch
noam@jhu.edu
@nsfinkelstein
26

Mais conteúdo relacionado

Semelhante a Noam Finkelstein - The Importance of Modeling Data Collection

Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Getting Started with Big Data and Splunk
Getting Started with Big Data and SplunkGetting Started with Big Data and Splunk
Getting Started with Big Data and SplunkTom Chavez
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
housepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfhousepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfVinayShekarReddy
 
Advanced sampling part 1 presentation notes
Advanced sampling part 1   presentation notesAdvanced sampling part 1   presentation notes
Advanced sampling part 1 presentation notesAnthony Shingleton
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making processPeter R Breach
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The DataAngel Evans
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
Understanding randomness
Understanding randomnessUnderstanding randomness
Understanding randomnesssuncil0071
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Narrated Version Dallas MPUG
Narrated Version Dallas MPUGNarrated Version Dallas MPUG
Narrated Version Dallas MPUGGlen Alleman
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Making sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMaking sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMetroWater
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 

Semelhante a Noam Finkelstein - The Importance of Modeling Data Collection (20)

Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Getting Started with Big Data and Splunk
Getting Started with Big Data and SplunkGetting Started with Big Data and Splunk
Getting Started with Big Data and Splunk
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
housepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdfhousepriceprediction-180915174356.pdf
housepriceprediction-180915174356.pdf
 
Advanced sampling part 1 presentation notes
Advanced sampling part 1   presentation notesAdvanced sampling part 1   presentation notes
Advanced sampling part 1 presentation notes
 
U5 a1 stages in the decision making process
U5 a1 stages in the decision making processU5 a1 stages in the decision making process
U5 a1 stages in the decision making process
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Understanding randomness
Understanding randomnessUnderstanding randomness
Understanding randomness
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Narrated Version Dallas MPUG
Narrated Version Dallas MPUGNarrated Version Dallas MPUG
Narrated Version Dallas MPUG
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Making sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomesMaking sense of community engagement, impacts and outcomes
Making sense of community engagement, impacts and outcomes
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 

Mais de MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...MLconf
 

Mais de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
 

Último

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Último (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Noam Finkelstein - The Importance of Modeling Data Collection

  • 1. Counteracting Selection Bias in Machine Learning 1 Noam Finkelstein MLConf SF November 8th 2019
  • 2. Overview 2 ➣ Data are collected in all kinds of ways ➣ We pretend they are collected “At Random” ➣ This creates poor predictive performance in important regions of the input space ➣ We can model the collection process to improve performance
  • 3. Takeaways 3 ➣ Understand the importance of selection bias in ML ○ Not discussed as much in ML as in statistics ➣ Be able to identify when our data might have this problem. ➣ Learn how to model data collection. ➣ Learn how use our data to learn about selection bias when possible.
  • 4. Data Collection Step 1: Things happen 4
  • 5. Data Collection Step 2: Some of them get recorded 5
  • 6. Bias in Data Collection Step 1 Things happen 6 ➣ Selection bias: Correlation between how likely we are to see a data point (X, Y), and the outcome Y ➣ Example 1: ○ We are asked to create a tool to help project managers predict profit of software projects ○ Our data include all software projects previously undertaken at the company ○ PMs are good at their jobs, so projects that lose money are not in the data much . They just don’t happen.
  • 7. Bias in Data Collection Step 1 Things happen 7 Project Complexity Profit Approved Projects
  • 8. Bias in Data Collection Step 1 Some Things Don’t Happen 8 Project Complexity Profit Approved Projects Rejected Projects
  • 9. Bias in Data Collection 99 ➣ No ML model can learn about the “complexity boundary”, even though we have access to all the projects that were undertaken. Nothing is “missing”. ➣ This is a very bad way to fail! Our model will do badly specifically where we want it to protect us from poor decisions.
  • 10. Modeling the Data Collection Process 1010 ➣ We know proposals that are unlikely to be profitable are unlikely to occur in the data. ➣ We can incorporate that knowledge about the data collection process into our model to address this problem.
  • 11. Bias in Data Collection Step 2 We don’t see everything Weeks WhiteBloodCellCount ➣ We want to know how patients are doing when they’re away from the clinic ➣ Patients come in when they’re feeling unwell, elevated WBC ➣ We’ll generally predict that they’re worse off than they are
  • 12. Prediction in Machine Learning ➣ We generally model ➣ g is our favourite class of functions for regression or classification, parameterized by ➣ “Easy” to do because Y is one dimensional, and expectations are summary statistics
  • 13. Modeling Data Collection ➣ Modeling the probability of observing some data, is too hard (w/ finite data)! ➣ X is high dimensional ➣ Densities are complicated
  • 14. Modeling Data Collection ➣ In many problems we care about, the probability of making an observation is a function only of the outcome. ➣ Then the probability of making on observation is: ➣ Which, for (X, Y) pairs we don’t see, can be approximated:
  • 15. Incorporating Knowledge on Data Collection ➣ If we’re being frequentists, we can define a loss function that captures both how well we do on prediction outcome, and how well we do on predicting observation:
  • 16. Modeling Data Collection ➣ We can now learn from what we don’t see. ➣ We know there are regions of the input space w/ no data ➣ We know we’re less likely to see data w/ low profit ➣ Therefore: profit must be low in those regions Project Complexity Profit Approved Projects
  • 17. What if we don’t know the data collection process? 17 ➣ We can’t learn p entirely from data - would require us to know the outcome specifically where we don’t observe it (in most cases). ➣ If we have beliefs about p and g, we can be Bayesian about things. ➣ If we have a few data points collected “at random” - i.e. not according to p - then we can learn p
  • 18. A Worked Example 18 ➣ We have data collected according to some unknown, non-random process p WhiteBloodCellCount Weeks
  • 19. A Worked Example 19 ➣ Functions compatible with this data will have different behavior in unobserved regions WhiteBloodCellCount Weeks
  • 20. A Worked Example 20 ➣ We assume all data are “observed at random”, as usual. Fit looks good! ➣ Validation data collected by the same process will not help! WhiteBloodCellCount Weeks
  • 21. A Worked Example 21 ➣ But it turns out the data was not collected at random - we’re systematically way off in unobserved regions! WhiteBloodCellCount Weeks
  • 22. A Worked Example 22 ➣ What if we know how much more likely we are to make an observation when the outcome is high? WhiteBloodCellCount Weeks
  • 23. A Worked Example 23 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  • 24. A Worked Example 24 ➣ What if we don’t know anything about data collection, but get a few observations “at random”? WhiteBloodCellCount Weeks
  • 25. Conclusions 25 ➣ Selection bias hurts us in ML in ways we can’t detect through normal validation procedures ➣ If we know something about the data collection process we can incorporate it into our model to improve prediction. ➣ If we happen to have some data collected “at random”, we can use it to learn about selection bias elsewhere in our data.
  • 26. Thank you! Get in touch noam@jhu.edu @nsfinkelstein 26