1. The document discusses fairness constraints in contextual bandit problems and classic bandit problems.
2. It shows that for classic bandits, Θ(k^3) rounds are necessary and sufficient to achieve a non-trivial regret under fairness constraints.
3. For contextual bandits, it establishes a tight relationship between achieving fairness and Knows What it Knows (KWIK) learning, where KWIK learnability implies the existence of fair learning algorithms.
Introduction of “Fairness in Learning: Classic and Contextual Bandits”
1. Introduction of
“Fairness in Learning: Classic and
Contextual Bandits”
authorized by Matthew Joseph, Michael Kearns, Jamie
Morgenstern, and Aaron Roth
NIPS2016-Yomi
January 19, 2017
Presenter: Kazuto Fukuchi
2. Fairness in Machine Learning
Consequential decisions using machine learning may lead
unfair treatment
E.g., Google’s ad suggestion system [Sweeney 13]
Fairness in contextual bandit problem
African descent names European descent names
Arrested? Located
Negative ad. Neutral ad.
3. Individual fairness
𝐾 persons
• Choose one person for conducting an action
• E.g., lend loan, hire, admission, etc.
When we can preferentially choose one person?
Only if the person has the largest ability
There is no other reason for preferential choice
Payback 90% Payback 60%
>
4. Contextual Bandit Problem
Each round 𝑡
1. Obtain a context 𝑥𝑗
𝑡
for each arm 𝑗
2. Choose one arm 𝑖 𝑡 ∈ [𝐾]
3. Observe reward 𝑟𝑖 𝑡
𝑡
s.t. 𝔼 𝑟𝑗
𝑡
= 𝑓𝑗 𝑥𝑗
𝑡
and 𝑟𝑗
𝑡
∈ [0,1] a.s.
𝐾-arms
𝑓1 𝑓2 𝑓3 𝑓4 𝑓5
Unknown to
the learner
Goal: Maximize the expected cumulative reward
𝔼
𝑡
𝑟𝑖 𝑡
𝑡
= 𝔼
𝑡
𝑓𝑖 𝑡
𝑥𝑖 𝑡
𝑡
5. Example: Linear Contextual Bandit
Define
𝐶 = 𝑓𝜃 ∶ 𝑓𝜃 𝑥 = 𝜃, 𝑥 , 𝜃 ∈ ℝ 𝑑
, 𝜃 ≤ 1
𝒳 = 𝑥 ∈ ℝ 𝑑
∶ 𝑥 ≤ 1
• Suppose 𝑓𝑗 = 𝑓𝜃 𝑗
∈ 𝐶, 𝑥𝑗
𝑡
∈ 𝒳
E.g., Online recommendation
• 𝜃𝑗: Feature of a product 𝑗
• 𝑥𝑗
𝑡
: Feature of a user 𝑡 regarding the product 𝑗
• Score of a user 𝑡 for a product 𝑗 is an inner product
𝑥𝑗
𝑡
, 𝜃𝑗
6. Example: Classic Bandit
• Expected reward is 𝔼 𝑟𝑗
𝑡
= 𝜇 𝑗
• Set 𝑓𝑗 𝑥𝑗
𝑡
= 𝜇 𝑗 for any 𝑥𝑗
𝑡
• Then, the contextual bandit becomes to the classic bandit
𝜇1 𝜇2 𝜇3 𝜇4 𝜇5
7. Regret
• History ℎ 𝑡: a record of 𝑡 − 1 experiences
• contexts, arm chosen, and reward observed
• A policy 𝜋: mapping from 𝑥 𝑡
and ℎ 𝑡 to a distribution on arms [𝐾]
• Probability of choosing arm 𝑗 with ℎ 𝑡 at round 𝑡
𝜋𝑗|ℎ 𝑡
𝑡
Regret: Dropped reward compared to the optimal policy
Regret 𝑥1
, … , 𝑥 𝑇
=
𝑡
max
𝑗
𝑓𝑗 𝑥𝑗
𝑡
− 𝔼𝑖 𝑡∼𝜋 𝑡
𝑡
𝑓𝑖 𝑡 𝑥𝑖 𝑡
𝑡
Regret bound 𝑅(𝑇) if max
𝑥1,…,𝑥 𝑇
Regret 𝑥1
, … , 𝑥 𝑇
≤ 𝑅(𝑇)
8. Fairness Constraint
It is unfair to preferentially choose one individual without an
acceptable reason
A policy 𝜋 is 𝜹-fair if with probability 1 − 𝛿
𝜋𝑗|ℎ
𝑡
> 𝜋𝑗′|ℎ
𝑡
only if 𝑓𝑗 𝑥𝑗
𝑡
> 𝑓𝑗′ 𝑥𝑗′
𝑡
.
Quality of the chosen individual is larger than others.
Probability of choosing arm
𝑗 at round 𝑡
𝑓𝑗(𝑥𝑗
𝑡
)
>
𝑓𝑗′(𝑥𝑗′
𝑡
)
9. Institution of Fairness Constraint
• Optimal policy is fair
• But we can’t get the optimal policy due to unknown 𝑓1, … , 𝑓𝐾
>
Can’t distinguish which arm has high
expected reward
Expected reward is lower than the left
group with h.p.
Fairness constraint enforces to choose a arm from the left
group with uniform distribution
10. Fairness in Classic Bandit
• Consider confidence bounds of the expected rewards
• Choose uniformly from the chained group
expected rewards
Arm 1
Arm 2
Arm 3
Arm 4
Arm 5
Chained
Expected reward is lower than that of arms
in the chained group
12. Regret Upper Bound
If 𝛿 <
1
𝑇
, then FairBandits has regret
𝑅 𝑇 = 𝑂 𝑘3 𝑇 ln
𝑇𝑘
𝛿
• 𝑇 = Ω 𝑘3 rounds require to obtain non-trivial regret, i.e.,
𝑅 𝑇
𝑇
≪ 1
• Non-fair case: 𝑂 𝑘𝑇
• 𝑘 becomes 𝑘3
by fairness constraint
• Dependence on 𝑇 is optimal
13. Regret Lower Bound
Any fair algorithm experiences constant per-round regret for at
least
𝑇 = Ω 𝑘3
ln
1
𝛿
• constant per-round regret = non-trivial regret
• To achieve non-trivial regret, we need at least 𝑘3
rounds
• Thus, Ω 𝑘3
is necessary and sufficient
14. Fairness in Contextual Bandit
KWIK learnable = Fair bandit learnable
KWIK (Know What It Know) learning
• Online regression
• Learner outputs either prediction 𝑦 𝑡 ∈ [0,1] or 𝑦 𝑡 =⊥
• ⊥ denotes “I Don’t Know”
• Only when 𝑦 𝑡 =⊥, the learner observes feedback 𝑦 𝑡 s.t.
𝔼 𝑦 𝑡 = 𝑓 𝑥 𝑡
𝑥 𝑡
Feature
Learner
𝑦 𝑡
∈ [0,1]
“I Don’t Know”
Accurately
predictable
15. KWIK learnable
(𝜖, 𝛿)-KWIK learnable on a class 𝑓 ∈ 𝐶 with 𝑚 𝜖, 𝛿 if
1. 𝑦 𝑡 ∈ ⊥ ∪ [𝑓 𝑥 𝑡 − 𝜖, 𝑓 𝑥 𝑡 + 𝜖] for all 𝑡 w.p. 1 − 𝛿
2. 𝑡=1
∞
𝕀 𝑦 𝑡
=⊥ ≤ 𝑚 𝜖, 𝛿
Institutions
• Prediction is accurate if 𝑦 𝑡 ≠⊥
• With small number of answering ⊥
• number of answering ⊥ = 𝑚 𝜖, 𝛿
16. KWIK Learnability Implies Fair Bandit
Learnability
Suppose 𝐶 is (𝜖, 𝛿)-KWIK learnable with 𝑚 𝜖, 𝛿
Then, there is 𝛿-fair algorithm for 𝑓𝑗 ∈ 𝐶 s.t.
𝑅 𝑇 = 𝑂 max 𝑘2 𝑚 𝜖∗,
min 𝛿,
1
𝑇
𝑇2 𝑘
, 𝑘3 ln
𝑘
𝛿
For 𝛿 ≤
1
𝑇
where
𝜖∗ = arg min
𝜖
max 𝜖𝑇, 𝑘𝑚 𝜖,
min 𝛿,
1
𝑇
𝑇2 𝑘
19. Institution of KWIKToFair
• Predict the expected rewards using KWIK algorithm for each
arm
• If the outputs of KWIK algorithm is not ⊥
• Same strategy of classic bandit is applicable
expected rewards 𝑓𝑗 𝑥𝑗
𝑡
Arm 1
Arm 2
Arm 3
Arm 4
Arm 5
2𝜖∗
20. Fair Bandit Learnability Implies KWIK
Learnability
Suppose
• There is 𝛿-fair algorithm for 𝑓𝑗 ∈ 𝐶 with regret 𝑅 𝑇, 𝛿
• There exists 𝑓 ∈ 𝐶, 𝑥 ℓ ∈ 𝒳 s.t. 𝑓 𝑥 ℓ = ℓ𝜖 for ℓ =
1, … ,
1
𝜖
Then, there is (𝜖, 𝛿)-KWIK learnable algorithm for 𝐶 with
𝑚 𝜖, 𝛿 is the solution of
𝑚 𝜖, 𝛿 𝜖
4
= 𝑅 𝑚 𝜖, 𝛿 ,
𝜖𝛿
2𝑇
21. An Exponential Separation Between Fair
and Unfair Learning
• Boolean conjunctions: Let 𝑥 ∈ 0,1 𝑑
𝐶 = 𝑓|𝑓 𝑥 = 𝑥𝑖1
∧ ⋯ ∧ 𝑥𝑖 𝑘
, 0 ≤ 𝑘 ≤ 𝑑, 𝑖1, … , 𝑖 𝑘 ∈ [𝑑]
• Boolean conjunctions without fairness constraint
𝑅 𝑇 = 𝑂(𝑘2
𝑑)
• For such 𝐶, KWIK bound is at least 𝑚 𝜖, 𝛿 = Ω 2 𝑑
• For 𝛿 <
1
2𝑇
, worst case regret bound is
𝑅 𝑇 = Ω 2 𝑑
23. Institution of FairToKWIK
• Divide domain of 𝑓(𝑥 𝑡) s.t. each width becomes 𝜖∗
• Using fair algorithm,
𝑓 𝑥 𝑡0 𝜖∗ 2𝜖∗
𝑥(0) 𝑥(1) 𝑥(2)
𝑥 𝑡
𝑥(ℓ) 𝑥 𝑡
>
<
?
𝑥(3) 𝑥(4)
𝑝ℓ,1 𝑝ℓ,2
Prob. of choosing left
arm
Prob. of choosing
right arm
If 𝑝ℓ,1 ≠ 𝑝ℓ,2 for all ℓ ≠ 3,
𝑥 𝑡
is in the red area
Output 3𝜖∗
Otherwise,
Output ⊥
24. Conclusions
• Fairness in contextual bandit problem and classic bandit
problem
• 𝛿-fair: with probability 1 − 𝛿
𝜋𝑗|ℎ
𝑡
> 𝜋𝑗′|ℎ
𝑡
only if 𝑓𝑗 𝑥𝑗
𝑡
> 𝑓𝑗′ 𝑥𝑗′
𝑡
Results
• Classical Bandits: Necessary and sufficient rounds to achieve
non-trivial regret is Θ 𝑘3
• Contextual Bandits: Tightly relationship with Knows What it
Knows (KWIK) learning