SlideShare uma empresa Scribd logo
1 de 67
Baixar para ler offline
Learning with Exploration
Alina Beygelzimer
Yahoo Labs, New York
(based on work by many)
Interactive Learning
Repeatedly:
1 A user comes to Yahoo
2 Yahoo chooses content to present (urls, ads, news stories)
3 The user reacts to the presented information (clicks on something)
Making good content decisions requires learning from user feedback.
Abstracting the Setting
For t = 1, . . . , T:
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward r(a, x)
Goal: Learn a good policy for choosing actions given context
Dominant Solution
1 Deploy some initial system
2 Collect data using this system
3 Use machine learning to build a reward predictor ˆr(a, x) from
collected data
4 Evaluate new system = arg maxa ˆr(a, x)
offline evaluation on past data
bucket test
5 If metrics improve, switch to this new system and repeat
Example: Bagels vs. Pizza for New York and Chicago users
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
New York
Chicago
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR
New York ? 0.6
Chicago 0.4 ?
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 0.7/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.4595 0.6/0.6
Chicago 0.4/0.4 0.7/0.7
Bagels win. Switch to serving bagels for all and update model
based on new data.
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR/True CTR
New York ?/0.4595/1 0.6/0.6/0.6
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
Yikes! Missed out big in NY!
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
3 Better data helps! Exploration is required.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
3 Better data helps! Exploration is required.
4 Prediction errors are not a proxy for controlled exploration.
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
Solution: Deployed system should be randomized with
probabilities recorded.
Offline Evaluation
Evaluating a new system on data collected by deployed system
may mislead badly:
New York ?/1/1 0.6/0.6/0.5
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
The new system appears worse than deployed system on
collected data, although its true loss may be much lower.
The Evaluation Problem
Given a new policy, how do we evaluate it?
The Evaluation Problem
Given a new policy, how do we evaluate it?
One possibility: Deploy it in the world.
Very Expensive! Need a bucket for every candidate policy.
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
NY
no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
Chicago
no click no click click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click no click
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click no click
. . .
Two weeks later, evaluate which is better.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click no click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click no click · · ·
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
Offline evaluation
Later evaluate any policy using the same events. Each evaluation
is cheap and immediate.
The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
Collect exploration samples of the form
(x, a, ra, pa),
where
x = context
a = action
ra = reward for action
pa = probability of action a
then evaluate
Value(π) = Average
ra 1(π(x) = a)
pa
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 0
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 | 0 0 | 4
3
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small
reduces variance.
How do you directly optimize based on past exploration
data?
1 Learn ˆr(a, x).
2 Compute for each x and a ∈ A:
(ra − ˆr(a, x))1(a = a)
pa
+ ˆr(a , x)
3 Learn π using a cost-sensitive multiclass classifier.
Take home summary
Using exploration data
1 There are techniques for using past exploration data to
evaluate any policy.
2 You can reliably measure performance offline, and hence
experiment much faster, shifting from guess-and-check (A/B
testing) to direct optimization.
Doing exploration
1 There has been much recent progress on practical
regret-optimal algorithms.
2 -greedy has suboptimal regret but is a reasonable choice in
practice.
Comparison of Approaches
Supervised -greedy Optimal CB algorithms
Feedback full bandit bandit
Regret O
ln
|Π|
δ
T O
3 |A| ln
|Π|
δ
T O
|A| ln
|Π|
δ
T
Running time O(T) O(T) O(T1.5)
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the
Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014
M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T.
Zhang: Efficient optimal learning for contextual bandits, 2011
A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit
Algorithms with Supervised Learning Guarantees, 2011

Mais conteúdo relacionado

Semelhante a Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...Domino Data Lab
 
Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...DevGAMM Conference
 
136 advanced a-b testing (anthony rindone)
136   advanced a-b testing (anthony rindone)136   advanced a-b testing (anthony rindone)
136 advanced a-b testing (anthony rindone)ProductCamp Boston
 
Finding your mobile growth
Finding your mobile growthFinding your mobile growth
Finding your mobile growthLean Startup Co.
 
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)ProductCamp Boston
 
Behavior Based Approach to Experiment Design
Behavior Based Approach to Experiment DesignBehavior Based Approach to Experiment Design
Behavior Based Approach to Experiment Designcolemanerine
 
Lessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender SystemsLessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender Systemschrisalvino
 
Decisicion traps
Decisicion trapsDecisicion traps
Decisicion trapsFebriandika
 
105 Advanced A-B Testing: Making Decisions with Data
105 Advanced A-B Testing: Making Decisions with Data105 Advanced A-B Testing: Making Decisions with Data
105 Advanced A-B Testing: Making Decisions with DataProductCamp Boston
 
Impersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopImpersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopKostiantyn Kudriavtsev
 
Application au service_de_la_sante_publique
Application au service_de_la_sante_publiqueApplication au service_de_la_sante_publique
Application au service_de_la_sante_publiqueFUMERY Michael
 
CASE STUDY2-Presentation.pdf
CASE STUDY2-Presentation.pdfCASE STUDY2-Presentation.pdf
CASE STUDY2-Presentation.pdfAhmed Elshahat
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
Novel Algorithms for Ranking and Suggesting True Popular Items
Novel Algorithms for Ranking and Suggesting True Popular ItemsNovel Algorithms for Ranking and Suggesting True Popular Items
Novel Algorithms for Ranking and Suggesting True Popular ItemsIJMER
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy OptimizationShubhaManikarnike
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJuliosarahdijulio
 
ConAgra deck 6-25-13
ConAgra deck 6-25-13ConAgra deck 6-25-13
ConAgra deck 6-25-13Jim Keegan
 
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...Boom Online Marketing
 

Semelhante a Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC (20)

Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
Data Science Popup Austin: Predicting Customer Behavior & Enhancing Customer ...
 
high effort judgement
high effort judgementhigh effort judgement
high effort judgement
 
Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...Applied Data Science for monetization: pitfalls, common misconceptions, and n...
Applied Data Science for monetization: pitfalls, common misconceptions, and n...
 
136 advanced a-b testing (anthony rindone)
136   advanced a-b testing (anthony rindone)136   advanced a-b testing (anthony rindone)
136 advanced a-b testing (anthony rindone)
 
Finding your mobile growth
Finding your mobile growthFinding your mobile growth
Finding your mobile growth
 
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
Making Decisions with Data: Beyond Basic A/B Testing (ProductCamp Boston 2016)
 
Zachary Brown - Forecasting Consumer Response to GMOs
Zachary Brown - Forecasting Consumer Response to GMOsZachary Brown - Forecasting Consumer Response to GMOs
Zachary Brown - Forecasting Consumer Response to GMOs
 
Behavior Based Approach to Experiment Design
Behavior Based Approach to Experiment DesignBehavior Based Approach to Experiment Design
Behavior Based Approach to Experiment Design
 
Lessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender SystemsLessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender Systems
 
Decisicion traps
Decisicion trapsDecisicion traps
Decisicion traps
 
105 Advanced A-B Testing: Making Decisions with Data
105 Advanced A-B Testing: Making Decisions with Data105 Advanced A-B Testing: Making Decisions with Data
105 Advanced A-B Testing: Making Decisions with Data
 
Impersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of HadoopImpersonal Recommendation system on top of Hadoop
Impersonal Recommendation system on top of Hadoop
 
Application au service_de_la_sante_publique
Application au service_de_la_sante_publiqueApplication au service_de_la_sante_publique
Application au service_de_la_sante_publique
 
CASE STUDY2-Presentation.pdf
CASE STUDY2-Presentation.pdfCASE STUDY2-Presentation.pdf
CASE STUDY2-Presentation.pdf
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Novel Algorithms for Ranking and Suggesting True Popular Items
Novel Algorithms for Ranking and Suggesting True Popular ItemsNovel Algorithms for Ranking and Suggesting True Popular Items
Novel Algorithms for Ranking and Suggesting True Popular Items
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio
 
ConAgra deck 6-25-13
ConAgra deck 6-25-13ConAgra deck 6-25-13
ConAgra deck 6-25-13
 
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
How To Put Together A Corporate Social Responsibility Strategy (And Why it Ma...
 

Mais de MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mais de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

  • 1. Learning with Exploration Alina Beygelzimer Yahoo Labs, New York (based on work by many)
  • 2. Interactive Learning Repeatedly: 1 A user comes to Yahoo 2 Yahoo chooses content to present (urls, ads, news stories) 3 The user reacts to the presented information (clicks on something) Making good content decisions requires learning from user feedback.
  • 3. Abstracting the Setting For t = 1, . . . , T: 1 The world produces some context x ∈ X 2 The learner chooses an action a ∈ A 3 The world reacts with reward r(a, x) Goal: Learn a good policy for choosing actions given context
  • 4. Dominant Solution 1 Deploy some initial system 2 Collect data using this system 3 Use machine learning to build a reward predictor ˆr(a, x) from collected data 4 Evaluate new system = arg maxa ˆr(a, x) offline evaluation on past data bucket test 5 If metrics improve, switch to this new system and repeat
  • 5. Example: Bagels vs. Pizza for New York and Chicago users
  • 6. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. New York Chicago
  • 7. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR New York ? 0.6 Chicago 0.4 ?
  • 8. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5
  • 9. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
  • 10. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 0.7/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
  • 11. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.4595 0.6/0.6 Chicago 0.4/0.4 0.7/0.7 Bagels win. Switch to serving bagels for all and update model based on new data.
  • 12. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR/True CTR New York ?/0.4595/1 0.6/0.6/0.6 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 Yikes! Missed out big in NY!
  • 13. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly.
  • 14. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected.
  • 15. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required.
  • 16. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required. 4 Prediction errors are not a proxy for controlled exploration.
  • 17. Attempt to fix New policy: bagels in the morning, pizza at night for both cities
  • 18. Attempt to fix New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both!
  • 19. Attempt to fix New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both! Solution: Deployed system should be randomized with probabilities recorded.
  • 20. Offline Evaluation Evaluating a new system on data collected by deployed system may mislead badly: New York ?/1/1 0.6/0.6/0.5 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 The new system appears worse than deployed system on collected data, although its true loss may be much lower.
  • 21. The Evaluation Problem Given a new policy, how do we evaluate it?
  • 22. The Evaluation Problem Given a new policy, how do we evaluate it? One possibility: Deploy it in the world. Very Expensive! Need a bucket for every candidate policy.
  • 23. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule
  • 24. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups:
  • 25. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
  • 26. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
  • 27. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
  • 28. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
  • 29. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click
  • 30. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 NY no click
  • 31. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
  • 32. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
  • 33. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
  • 34. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
  • 35. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
  • 36. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
  • 37. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click
  • 38. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 Chicago no click no click click
  • 39. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click
  • 40. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click . . . Two weeks later, evaluate which is better.
  • 41. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  • 42. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  • 43. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  • 44. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  • 45. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  • 46. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  • 47. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  • 48. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  • 49. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  • 50. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  • 51. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  • 52. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  • 53. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
  • 54. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click · · · (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb) Offline evaluation Later evaluate any policy using the same events. Each evaluation is cheap and immediate.
  • 55. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it?
  • 56. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it? Collect exploration samples of the form (x, a, ra, pa), where x = context a = action ra = reward for action pa = probability of action a then evaluate Value(π) = Average ra 1(π(x) = a) pa
  • 57. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate
  • 58. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 0
  • 59. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 | 0 0 | 4 3
  • 60. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it?
  • 61. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x)
  • 62. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work?
  • 63. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x)
  • 64. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x) Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small reduces variance.
  • 65. How do you directly optimize based on past exploration data? 1 Learn ˆr(a, x). 2 Compute for each x and a ∈ A: (ra − ˆr(a, x))1(a = a) pa + ˆr(a , x) 3 Learn π using a cost-sensitive multiclass classifier.
  • 66. Take home summary Using exploration data 1 There are techniques for using past exploration data to evaluate any policy. 2 You can reliably measure performance offline, and hence experiment much faster, shifting from guess-and-check (A/B testing) to direct optimization. Doing exploration 1 There has been much recent progress on practical regret-optimal algorithms. 2 -greedy has suboptimal regret but is a reasonable choice in practice.
  • 67. Comparison of Approaches Supervised -greedy Optimal CB algorithms Feedback full bandit bandit Regret O ln |Π| δ T O 3 |A| ln |Π| δ T O |A| ln |Π| δ T Running time O(T) O(T) O(T1.5) A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014 M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang: Efficient optimal learning for contextual bandits, 2011 A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit Algorithms with Supervised Learning Guarantees, 2011