Learning through exploration: I will talk about interactive learning applied to several core problems at Yahoo. Solving these problems well requires learning from user feedback. The difficulty is that only the feedback for what is actually shown to the user is observed. The need for exploration makes these problems fundamentally different from standard supervised learning problems—if a choice is not explored, we can’t optimize for it. Through examples, I will discuss the importance of gathering the right data. I will then discuss how to reuse data collected by production systems for offline evaluation and direct optimization. Being able to reliably measure performance offline allows for much faster experimentation, shifting from guess-and-check with A/B testing to direct optimization.
2. Interactive Learning
Repeatedly:
1 A user comes to Yahoo
2 Yahoo chooses content to present (urls, ads, news stories)
3 The user reacts to the presented information (clicks on something)
Making good content decisions requires learning from user feedback.
3. Abstracting the Setting
For t = 1, . . . , T:
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward r(a, x)
Goal: Learn a good policy for choosing actions given context
4. Dominant Solution
1 Deploy some initial system
2 Collect data using this system
3 Use machine learning to build a reward predictor ˆr(a, x) from
collected data
4 Evaluate new system = arg maxa ˆr(a, x)
offline evaluation on past data
bucket test
5 If metrics improve, switch to this new system and repeat
6. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
New York
Chicago
7. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR
New York ? 0.6
Chicago 0.4 ?
8. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
9. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 ?/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.
10. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.5 0.6/0.6
Chicago 0.4/0.4 0.7/0.5
Bagels win. Switch to serving bagels for all and update model
based on new data.
11. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR
New York ?/0.4595 0.6/0.6
Chicago 0.4/0.4 0.7/0.7
Bagels win. Switch to serving bagels for all and update model
based on new data.
12. Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/Estimated CTR/True CTR
New York ?/0.4595/1 0.6/0.6/0.6
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
Yikes! Missed out big in NY!
14. Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
15. Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
3 Better data helps! Exploration is required.
16. Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help: Observed = True where data was
collected.
3 Better data helps! Exploration is required.
4 Prediction errors are not a proxy for controlled exploration.
17. Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
18. Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
19. Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
Solution: Deployed system should be randomized with
probabilities recorded.
20. Offline Evaluation
Evaluating a new system on data collected by deployed system
may mislead badly:
New York ?/1/1 0.6/0.6/0.5
Chicago 0.4/0.4/0.4 0.7/0.7/0.7
The new system appears worse than deployed system on
collected data, although its true loss may be much lower.
22. The Evaluation Problem
Given a new policy, how do we evaluate it?
One possibility: Deploy it in the world.
Very Expensive! Need a bucket for every candidate policy.
23. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
24. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
25. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
26. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
27. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
no click
28. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2
no click
29. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click
30. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
NY
no click
31. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click no click
32. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1
no click no click
33. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click
34. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click
35. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click click
36. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2
no click no click click
37. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click
38. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
Chicago
no click no click click
39. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click no click
40. A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels for everyone rule
Segment users randomly into Policy 1 and Policy 2 groups:
Policy 2 Policy 1 Policy 2 Policy 1
no click no click click no click
. . .
Two weeks later, evaluate which is better.
41. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
42. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
43. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
44. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
45. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
46. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click
(x, b, 0, pb)
47. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
48. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
49. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click
(x, b, 0, pb) (x, p, 0, pp)
50. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
51. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
52. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
53. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click no click
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
54. Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest strategy: -greedy. Go with empirically best policy, but
always choose a random action with probability > 0.
no click no click click no click · · ·
(x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
Offline evaluation
Later evaluate any policy using the same events. Each evaluation
is cheap and immediate.
56. The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
Collect exploration samples of the form
(x, a, ra, pa),
where
x = context
a = action
ra = reward for action
pa = probability of action a
then evaluate
Value(π) = Average
ra 1(π(x) = a)
pa
57. The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate
58. The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 0
59. The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ Value(π) ]
with deviations bounded by O( 1√
T minx pπ(x)
).
Example:
Action 1 2
Reward 0.5 1
Probability 1
4
3
4
Estimate 2 | 0 0 | 4
3
60. Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
61. Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
62. Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
63. Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
64. Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra − ˆr(a, x))1(π(x) = a)
pa
+ ˆr(π(x), x)
Why does this work?
Ea∼p
ˆr(a, x)1(π(x) = a)
pa
= ˆr(π(x), x)
Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small
reduces variance.
65. How do you directly optimize based on past exploration
data?
1 Learn ˆr(a, x).
2 Compute for each x and a ∈ A:
(ra − ˆr(a, x))1(a = a)
pa
+ ˆr(a , x)
3 Learn π using a cost-sensitive multiclass classifier.
66. Take home summary
Using exploration data
1 There are techniques for using past exploration data to
evaluate any policy.
2 You can reliably measure performance offline, and hence
experiment much faster, shifting from guess-and-check (A/B
testing) to direct optimization.
Doing exploration
1 There has been much recent progress on practical
regret-optimal algorithms.
2 -greedy has suboptimal regret but is a reasonable choice in
practice.
67. Comparison of Approaches
Supervised -greedy Optimal CB algorithms
Feedback full bandit bandit
Regret O
ln
|Π|
δ
T O
3 |A| ln
|Π|
δ
T O
|A| ln
|Π|
δ
T
Running time O(T) O(T) O(T1.5)
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the
Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014
M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T.
Zhang: Efficient optimal learning for contextual bandits, 2011
A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit
Algorithms with Supervised Learning Guarantees, 2011