SlideShare uma empresa Scribd logo
1 de 69
Baixar para ler offline
1
Maximum Entropy Reinforcement Learning
Stochastic Control
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML
2017
T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement
Learning with a Stochastic Actor”, ICML 2018
T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018
Dongmin Lee
August, 2019
T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based
Policies”, ICML 2017
T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy
Deep Reinforcement Learning with a Stochastic Actor”, ICML 2018
T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”,
arXiv preprint 2018
Presented by Dongmin Lee
August, 2019
2
Outline
1. Introduction
2. Reinforcement Learning
• Markov Decision Processes (MDPs)
• Bellman Equation
3. Exploration methods
4. Maximum Entropy Reinforcement Learning
• Soft MDPs
• Soft Bellman Equation
5. From Soft Policy Iteration to Soft Actor-Critic
• Soft Policy Iteration
• Soft Actor-Critic
6. Results
3
Introduction
How do we build intelligent systems / machines?
• Data à Information
• Information à Decision
intelligence
intelligence
Intelligent systems must be able to adapt to uncertainty
• Uncertainty:
• Other cars
• Weather
• Road construction
• Passenger
4
Introduction
RL provides a formalism for behaviors
• Problem of a goal-directed agent interacting with an uncertain environment
• Interaction à adaptation
5
feedback & decision
Introduction
6
Introduction
Distinct Features of RL
1. There is no supervisor, only a reward signal
2. Feedback is delayed, not instantaneous
3. Time really matters (sequential, non i.i.d data)
4. Tradeoff between exploration and exploitation
What is deep RL?
Deep RL = RL + Deep learning
• RL: sequential decision making interacting with environments
• Deep learning: representation of policies and value functions
7source : http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_2.pdf
Introduction
What are the challenges of deep RL?
• Huge # of samples: millions
• Fast, stable learning
• Hyperparameter tuning
• Exploration
• Sparse reward signals
• Safety / reliability
• Simulator
8
Introduction
• Soft Actor-Critic on Minitaur ­ Training
9
Introduction
source : https://sites.google.com/view/sac-and-applications
• Soft Actor-Critic on Minitaur ­ Testing
10
Introduction
source : https://sites.google.com/view/sac-and-applications
• Interfered rollouts from Soft Actor-Critic policy trained for Dynamixel Claw task from vision
11
Introduction
source : https://sites.google.com/view/sac-and-applications
• State of the environment / system
• Action determined by the agent: affects the state
• Reward: evaluates the benefit of state and action
12
Reinforcement Learning
• State: 𝑠" ∈ 𝑆
• Action: 𝑎" ∈ 𝐴
• Policy (mapping from state to action): 𝜋
• Deterministic
𝜋 𝑠" = 𝑎"
• Stochastic (randomized)
𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠)
à Policy is what we want to optimize!
13
Reinforcement Learning
A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of
• 𝑆: set of states (state space)
e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ:
(continuous)
• 𝐴: set of actions (action space)
e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ<
(continuous)
• 𝑝: state transition probability
𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d
• 𝑟: reward function
𝑟 𝑠", 𝑎" = 𝑟"d
• 𝛾 ∈ (0,1]: discount factor
14
Markov Decision Processes (MDPs)
The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"
Value function
• The state value function 𝑉F
(𝑠) of a policy 𝜋 is the expected return starting from state 𝑠 under
executing 𝜋:
𝑉F
𝑠 ≜ 𝔼F I
NJ"
L
𝛾NO"
𝑟 𝑠N, 𝑎N |𝑠" = 𝑠
Q-function
• The action value function 𝑄F(𝑠, 𝑎) of a policy 𝜋 is the expected return starting from state 𝑠, taking
action 𝑎, and then following 𝜋:
𝑄F
𝑠, 𝑎 ≜ 𝔼F I
NJ"
L
𝛾NO"
𝑟 𝑠N, 𝑎N |𝑠" = 𝑠, 𝑎" = 𝑎
15
Markov Decision Processes (MDPs)
The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾"
𝑟 𝑠", 𝑎"
Optimal value function
• The optimal value function 𝑉∗(𝑠) is the maximum value function over all policies:
𝑉∗ 𝑠 ≜ max
F∈∏
𝑉F 𝑠 = max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎" |𝑠" = 𝑠
Optimal Q-function
• The optimal Q-function 𝑄∗(𝑠, 𝑎) is the maximum Q-function over all policies:
𝑄∗ 𝑠, 𝑎 ≜ max
F∈∏
𝑄F 𝑠, 𝑎
16
Markov Decision Processes (MDPs)
Bellman expectation equation for 𝑉F
• The value function 𝑉F is the unique solution to the following Bellman equation:
𝑉F 𝑠 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠
= I
S∈T
𝜋(𝑎|𝑠) 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝 𝑠=
𝑠, 𝑎 𝑉F
𝑠=
= I
S∈T
𝜋(𝑎|𝑠)𝑄F
𝑠, 𝑎
• In operator form
𝑉F = 𝑇𝑉F
Bellman expectation equation for 𝑄F
• The Q-function 𝑄F satisfies:
𝑄F 𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎)𝑉F 𝑠=
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝 𝑠=
𝑠, 𝑎 I
SV∈T
𝜋 𝑎=
𝑠=
𝑄F
𝑠=
, 𝑎=
17
Bellman Equation
Bellman optimality equation for 𝑉∗
• The optimal value function 𝑉∗
must satisfies the self-consistency condition given by the Bellman
equation for 𝑉F:
𝑉∗ 𝑠 = max
SY∈T
𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗ 𝑠"?@ |𝑠" = 𝑠
= max
S∈T
𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠= = max
S∈T
𝑄∗ 𝑠, 𝑎
• In operator form
𝑉∗
= 𝑇𝑉∗
Bellman optimality equation for 𝑄∗
• The optimal Q-function 𝑄∗ satisfies:
𝑄∗
𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠=
= 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=|𝑠, 𝑎) max
SV∈T
𝑄∗ 𝑠=, 𝑎=
18
Bellman Equation
The MDP problem
• To find an optimal policy that maximizes the expected cumulative reward:
max
F∈∏
𝔼F I
"JK
L
𝛾" 𝑟 𝑠", 𝑎"
Known model:
• Policy iteration
• Value iteration
Unknown model:
• Temporal-difference learning
• Q-learning
19
Bellman Equation
Policy Iteration
Initialize a policy 𝜋K;
1. (Policy Evaluation) Compute the value 𝑉F[ of 𝜋
by solving the Bellman equation 𝑉F[ = 𝑇F[ 𝑉F[;
2. (Policy Improvement) Update the policy to 𝜋?@ so that
𝜋?@(𝑠) ∈ arg max
S
{𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝 𝑠= 𝑠, 𝑎 𝑉F[ 𝑠= }
3. Set 𝑘 ← 𝑘 + 1; Repeat until convergence;
• Converged policy is optimal!
• Q) Can we do something even simpler?
20
Bellman Equation
Value Iteration
Initialize a value function 𝑉K;
1. Update the value function by
𝑉?@ ← 𝑇𝑉;
2. Set 𝑘 ← 𝑘 + 1; Repeat until convergence;
• Converged value function is optimal!
21
Bellman Equation
Common issue: exploration
• Is there a value of exploring unknown regions of the environment?
• Which action should we try to explore unknown regions of the environment?
22
Exploration methods
How can we algorithmically solve this issue in exploration vs exploitation?
1. 𝜖-Greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Noisy networks
5. Information theoretic exploration
23
Exploration methods
Method 1: 𝜖-Greedy
Idea:
• Choose a greedy action (arg max 𝑄F(𝑠, 𝑎)) with probability (1 − 𝜖)
• Choose a random action with probability 𝜖
Advantages:
1. Very simple
2. Light computation
Disadvantages:
1. Fine tuning 𝜖
2. Not quite systematic: No focus on unexplored regions
3. Inefficient
24
Exploration methods
Method 2: Optimism in the face of uncertainty (OFU)
Idea:
1. Construct a confidence set for MDP parameters using collected samples
2. Choose MDP parameters that give the highest rewards
3. Construct an optimal policy of the optimistically chosen MDP
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Complicated
2. Computation intensive
25
Exploration methods
Method 3: Thompson (posterior) sampling
Idea:
1. Sample MDP parameters from posterior distribution
2. Construct an optimal policy of the sampled MDP parameters
3. Update the posterior distribution
Advantages:
1. Regret optimal
2. Systematic: More focus on unexplored regions
Disadvantages:
1. Somewhat complicated
26
Exploration methods
Method 4: Noisy networks
Idea:
• Inject noise to the weights of Neural Network
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. Not systematic: No focus on unexplored regions
27
Exploration methods
Method 5: Information theoretic exploration
Idea:
• Use high entropy policy to explore more
Advantages:
1. Simple
2. Good empirical performance
Disadvantages:
1. More theoretic analyses needed
28
Exploration methods
What is entropy?
• Average rate at which information is produced by a stochastic source of data:
𝐻 𝑝 ≜ − I
e
𝑝e log 𝑝e = 𝔼 − log 𝑝e
• More generally, entropy refers to disorder
29
Maximum Entropy Reinforcement Learning
What’s the benefit of high entropy policy?
• Entropy of policy:
𝐻 𝜋(⋅ |𝑠) ≜ − I
S
𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠) = 𝔼S~F(⋅|U) − log 𝜋(𝑎|𝑠)
• Higher disorder in policy 𝜋
• Try new risky behaviors: Potentially explore unexplored regions
30
Maximum Entropy Reinforcement Learning
Maximum entropy stochastic control
Standard MDP problem:
max
F
𝔼F I
"
𝛾" 𝑟(𝑠", 𝑎")
Maximum entropy MDP problem:
max
F
𝔼F I
"
𝛾" 𝑟 𝑠", 𝑎" + 𝐻 𝜋" ⋅ 𝑠"
31
Soft MDPs
Theorem 1 (Soft value functions and Optimal policy)
Soft Q-function:
𝑄ijkl
F
𝑠", 𝑎" ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
Soft value function:
𝑉ijkl
F
𝑠" ≜ log n exp 𝑄ijkl
F
𝑠", 𝑎 𝑑𝑎
Optimal value functions:
𝑄ijkl
∗
𝑠", 𝑎" ≜ max
F
𝑄ijkl
F
𝑠", 𝑎"
𝑉ijkl
∗
𝑠" ≜ log n exp 𝑄ijkl
∗
𝑠", 𝑎 𝑑𝑎
32
Soft Bellman Equation
Proof.
Maximum entropy MDP problem:
max
F
𝔼F I
"
𝛾"
𝑟 𝑠", 𝑎" + 𝐻 𝜋" ⋅ 𝑠"
Soft value function:
𝑉ijkl
F
𝑠 = 𝔼F I
NJ"
𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠
= 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠
Soft Q-function:
𝑄ijkl
F
s, 𝑎 = 𝔼F I
NJ"
𝛾NO"
𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
33
Soft Bellman Equation
note that 𝑄ijkl
F
s, 𝑎 is the same with the original form
Proof.
Soft value function:
𝑉ijkl
F
𝑠 = 𝔼F I
NJ"
𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠
= 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠
Soft Q-function:
𝑄ijkl
F
s, 𝑎 = 𝔼F I
NJ"
𝛾NO"
𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
Soft value function (rewrite):
𝑉ijkl
F
𝑠 = 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠
= 𝔼F 𝑄ijkl
F
𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
34
Soft Bellman Equation
Proof.
Given a policy 𝜋jst, define a new policy 𝜋uvw as:
𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl
Fyz{
s, 𝑎
𝜋uvw 𝑎 𝑠 =
exp(𝑄ijkl
Fyz{
s, 𝑎 )
∫ exp(𝑄ijkl
Fyz{
s, 𝑎= ) 𝑑𝑎=
= softmax
S∈T
𝑄ijkl
Fyz{
s, 𝑎
= exp(𝑄ijkl
Fyz{
s, 𝑎 − 𝑉ijkl
Fyz{
𝑠 ) = exp(𝐴ijkl
Fyz{
s, 𝑎 )
Substitute 𝜋uvw 𝑎 𝑠 into 𝑉ijkl
F
𝑠 :
𝑉ijkl
F
𝑠 = 𝔼S~F 𝑄ijkl
F
s, 𝑎 − log
exp 𝑄ijkl
F
s, 𝑎
∫ exp 𝑄ijkl
F
s, 𝑎= 𝑑𝑎=
= 𝔼S~F 𝑄ijkl
F
s, 𝑎 − log exp 𝑄ijkl
F
s, 𝑎 + log n exp 𝑄ijkl
F
s, 𝑎=
𝑑𝑎=
35
Soft Bellman Equation
Proof.
Substitute 𝜋uvw 𝑎 𝑠 into 𝑉ijkl
F
𝑠 :
𝑉ijkl
F
𝑠 = 𝔼S~F 𝑄ijkl
F
s, 𝑎 − log exp 𝑄ijkl
F
s, 𝑎 + log n exp 𝑄ijkl
F
s, 𝑎=
𝑑𝑎=
= 𝔼S~F log n exp 𝑄ijkl
F
s, 𝑎=
𝑑𝑎=
= log n exp 𝑄ijkl
F
s, 𝑎= 𝑑𝑎=
∴ 𝑉ijkl
F
𝑠 = log n exp 𝑄ijkl
F
s, 𝑎 𝑑𝑎
36
Soft Bellman Equation
independent from 𝑎
Theorem 1 (Soft value functions and Optimal policy)
Soft Q-function:
𝑄ijkl
F
𝑠", 𝑎" ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
Soft value function:
𝑉ijkl
F
𝑠" ≜ log n exp 𝑄ijkl
F
𝑠", 𝑎 𝑑𝑎
Optimal value functions:
𝑄ijkl
∗
𝑠", 𝑎" ≜ max
F
𝑄ijkl
F
𝑠", 𝑎"
𝑉ijkl
∗
𝑠" ≜ log n exp 𝑄ijkl
∗
𝑠", 𝑎 𝑑𝑎
37
Soft Bellman Equation
Theorem 2 (Soft policy improvement theorem)
Given a policy 𝜋jst, define a new policy 𝜋uvw as:
𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl
Fyz{
s, 𝑎
Assume that throughout our computation, 𝑄 is bounded and ∫ exp 𝑄 𝑠, 𝑎 𝑑𝑎 is bounded for any 𝑠 (for
both 𝜋jst and 𝜋uvw).
Then,
𝑄ijkl
F€•‚
s, 𝑎 ≥ 𝑄ijkl
Fyz{
s, 𝑎 ∀𝑠, 𝑎
à Monotonic policy improvement
38
Soft Bellman Equation
39
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Example
𝜋(𝑎|𝑠)
𝑟(𝑠, 𝑎)
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
40
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
𝑄ijkl
Fyz{
left = ?
𝑄ijkl
Fyz{
right = ?
𝑄ijkl
Fyz{
left = ?
𝑄ijkl
Fyz{
right = ?
𝑄ijkl
Fyz{
left = ?
𝑄ijkl
Fyz{
right =?
Soft Q-function:
𝑄ijkl
F
𝑠, 𝑎 ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m
𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
= 𝑟 𝑠, 𝑎 + 𝛾𝑝 𝑠=
𝑠, 𝑎 𝑉ijkl
F
𝑠=
Soft value function:
𝑉ijkl
F
𝑠 = I
S
𝜋 𝑎 𝑠 𝑄ijkl
F
s, 𝑎 − log 𝜋 𝑎 𝑠
Entropy of policy:
𝐻 𝜋(⋅ |𝑠) ≜ − I
S
𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠)
41
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst (suppose γ = 1, 𝑝 𝑠=
𝑠, 𝑎 = 1)
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
𝑄ijkl
Fyz{
left = 2
𝑄ijkl
Fyz{
right = 4
𝑄ijkl
Fyz{
left = 3
𝑄ijkl
Fyz{
right = 1
𝑄ijkl
Fyz{
left = 4.3
𝑄ijkl
Fyz{
right = 3.3
Soft Q-function:
𝑄ijkl
F
𝑠, 𝑎 ≜ 𝑟 𝑠", 𝑎" + 𝔼F I
mJ@
L
𝛾m
𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m
= 𝑟 𝑠, 𝑎 + 𝛾𝑝 𝑠=
𝑠, 𝑎 𝑉ijkl
F
𝑠=
Soft value function:
𝑉ijkl
F
𝑠 = I
S
𝜋 𝑎 𝑠 𝑄ijkl
F
s, 𝑎 − log 𝜋 𝑎 𝑠
Entropy of policy:
𝐻 𝜋(⋅ |𝑠) ≜ − I
S
𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠)
42
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
A new policy 𝜋uvw:
𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl
Fyz{
s, 𝑎
𝜋uvw 𝑎 𝑠 =
exp(𝑄ijkl
Fyz{
s, 𝑎 )
∑SV exp(𝑄ijkl
Fyz{
s, 𝑎= )
= softmax
S∈T
𝑄ijkl
Fyz{
s, 𝑎
= exp(𝑄ijkl
Fyz{
s, 𝑎 − 𝑉ijkl
Fyz{
𝑠 )
𝑄ijkl
Fyz{
left = 2
𝑄ijkl
Fyz{
right = 4
𝑄ijkl
Fyz{
left = 3
𝑄ijkl
Fyz{
right = 1
𝑄ijkl
Fyz{
left = 4.3
𝑄ijkl
Fyz{
right = 3.3
43
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Example
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋jst
0.5
0.5
0.5
0.5
+2
+4+1
+1
𝑠•
𝑠Ž
0.5
+3
0.5
+1
𝑄ijkl
Fyz{
left = 2
𝑄ijkl
Fyz{
right = 4
𝑄ijkl
Fyz{
left = 3
𝑄ijkl
Fyz{
right = 1
𝑄ijkl
Fyz{
left = 4.3
𝑄ijkl
Fyz{
right = 3.3
𝑠K
𝑠@
𝑠†
𝑠‡
𝑠ˆ
𝜋uvw 𝜋uvw 𝑎 𝑠 =
v•– —˜y™š
›yz{
i,S
∑
œV v•– —˜y™š
›yz{ i,SV
0.73
0.27
0.12
0.88
+2
+4+1
+1
𝑠•
𝑠Ž
0.88
+3
0.12
+1
44
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
So, Substitute 𝑠K, 𝑎K for 𝑠, 𝑎:
𝐻 𝜋jst ⋅ 𝑠K + 𝔼Fyz{
𝑄ijkl
Fyz{
𝑠K, 𝑎K ≤ 𝐻 𝜋uvw ⋅ 𝑠K + 𝔼F€•‚
𝑄ijkl
Fyz{
𝑠K, 𝑎K
0.3 + 3.8 ≤ 0.25 + 4.03
4.1 ≤ 4.28
45
Soft Bellman Equation
Proof.
If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw
from 𝜋jst:
𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{
𝑄ijkl
Fyz{
s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚
𝑄ijkl
Fyz{
s, 𝑎
Then, we can show that:
𝑄ijkl
Fyz{
s, 𝑎 = 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋jst ⋅ 𝑠@ + 𝔼SŸ~Fyz{
𝑄ijkl
F
𝑠@, 𝑎@
≤ 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝔼SŸ~F€•‚
𝑄ijkl
F
𝑠@, 𝑎@
= 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝔼U 𝐻 𝜋jst ⋅ 𝑠† + 𝔼S ~Fyz{
𝑄ijkl
F
𝑠†, 𝑎†
≤ 𝔼UŸ
𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝔼U 𝐻 𝜋uvw ⋅ 𝑠† + 𝔼S ~F€•‚
𝑄ijkl
F
𝑠†, 𝑎†
= 𝔼UŸ,S ~F€•‚,U 𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝐻 𝜋uvw ⋅ 𝑠† + 𝑟† + 𝛾‡ 𝔼U¡
𝐻 𝜋uvw ⋅ 𝑠‡ + 𝔼S¡~F€•‚
𝑄ijkl
F
𝑠‡, 𝑎‡
⋮
≤ 𝔼N~F€•‚
𝑟K + ∑"J@
L 𝛾" 𝐻 𝜋uvw ⋅ 𝑠" + 𝑟"
= 𝑄ijkl
F€•‚
s, 𝑎
Theorem 2 (Soft policy improvement theorem)
Given a policy 𝜋jst, define a new policy 𝜋uvw as:
𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl
Fyz{
s, 𝑎
Assume that throughout our computation, 𝑄 is bounded and ∫ exp 𝑄 𝑠, 𝑎 𝑑𝑎 is bounded for any 𝑠 (for
both 𝜋jst and 𝜋uvw).
Then,
𝑄ijkl
F€•‚
s, 𝑎 ≥ 𝑄ijkl
Fyz{
s, 𝑎 ∀𝑠, 𝑎
à Monotonic policy improvement
46
Soft Bellman Equation
Comparison to standard MDP: Bellman expectation equation
Standard:
𝑄F
𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=
|𝑠, 𝑎)𝑉F
𝑠=
𝑉F
𝑠 = ∑S∈T 𝜋(𝑎|𝑠)𝑄F
𝑠, 𝑎
Maximum entropy:
𝑄ijkl
F
s, 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
F
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉ijkl
F
𝑠=
𝑉ijkl
F
𝑠 = 𝔼F 𝑄ijkl
F
𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
= log ∫ exp 𝑄ijkl
F
s, 𝑎 𝑑𝑎
Maximum entropy variant:
• Add explicit temperature parameter 𝛼
𝑉ijkl
F
𝑠 = 𝔼F 𝑄ijkl
F
𝑠", 𝑎" − 𝛼 log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
= 𝛼 log ∫ exp
@
¤
𝑄ijkl
F
s, 𝑎 𝑑𝑎
47
Soft Bellman Equation
Comparison to standard MDP: Bellman optimality equation
Standard:
𝑄∗
𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠=
𝑉∗ 𝑠 = max
S∈T
𝑄∗ 𝑠, 𝑎
Maximum entropy variant:
𝑄ijkl
∗
s, 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl
∗
𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎
= 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉ijkl
∗
𝑠=
𝑉ijkl
∗
𝑠 = 𝔼F 𝑄ijkl
∗
𝑠", 𝑎" − 𝛼 log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠
= 𝛼 log ∫ exp
@
¤
𝑄ijkl
∗
s, 𝑎 𝑑𝑎
Connection:
• Note that as 𝛼 → 0, exp
@
¤
𝑄ijkl
∗
s, 𝑎 emphasizes max
S∈T
𝑄∗ 𝑠, 𝑎 :
𝛼 log ∫ exp
@
¤
𝑄ijkl
∗
s, 𝑎 𝑑𝑎 → max
S∈T
𝑄∗ 𝑠, 𝑎 as 𝛼 → 0
48
Soft Bellman Equation
Comparison to standard MDP: Policy
Standard:
𝜋uvw 𝑎 𝑠 ∈ arg max
S
𝑄Fyz{ 𝑠, 𝑎
à Deterministic policy
Maximum Entropy:
𝜋uvw 𝑎 𝑠 =
exp
1
𝛼
𝑄ijkl
Fyz{
s, 𝑎
∫ exp
1
𝛼
𝑄ijkl
Fyz{
s, 𝑎= 𝑑𝑎=
= exp
1
𝛼
𝑄ijkl
Fyz{
s, 𝑎 − 𝑉ijkl
Fyz{
𝑠
à Stochastic policy
49
Soft Bellman Equation
So, what’s the advantage of maximum entropy?
• Computation:
No maximization involved
• Exploration:
High entropy policy
• Structural similarity:
Can combine it with many RL methods for standard MDP
50
Soft Bellman Equation
Idea:
• Maximum entropy + off-policy actor-critic
Advantages:
• Exploration
• Sample efficiency
• Stable convergence
• Little hyperparameter tuning
51
From Soft Policy Iteration to Soft Actor-Critic
Version 1 (Jan 4 2018) Version 2 (Dec 13 2018)
Soft policy evaluation
Modified Bellman operator:
𝑇F
𝑄F
𝑠, 𝑎 ≜ 𝑟 𝑠, 𝑎 + 𝛾 I
UV∈W
𝑝(𝑠=
|𝑠, 𝑎)𝑉ijkl
F
𝑠=
where 𝑉F 𝑠 = 𝔼F 𝑄ijkl
F
s, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
Repeatedly apply 𝑇F
to 𝑄:
𝑄?@ ← 𝑇F 𝑄
• Result: 𝑄 converges to 𝑄F!
52
Soft Policy Iteration
Soft policy improvement
If no constraint on policies,
𝜋uvw 𝑎 𝑠 = exp
1
𝛼
𝑄ijkl
Fyz{
s, 𝑎 − 𝑉ijkl
Fyz{
𝑠
When constraint need to be satisfied (i.e., 𝜋 ∈ ∏),
Information projection:
𝜋uvw ⋅ 𝑠 ∈ arg min
FV ⋅ 𝑠 ∈∏
𝐷¨© 𝜋= ⋅ 𝑠 ∥ exp
1
𝛼
𝑄ijkl
Fyz{
s,⋅ − 𝑉ijkl
Fyz{
𝑠
• Result: Monotonic improvement on policy! (𝜋uvw better than 𝜋jst)
53
target policy
Soft Policy Iteration
Model-Free version of soft policy iteration: Soft actor-critic
Soft policy iteration: maximum entropy variant of policy iteration
Soft actor-critic (SAC): maximum entropy variant of actor-critic
• Soft critic:
evaluates soft Q-function 𝑄« of policy 𝜋¬
• Soft actor:
improves maximum entropy policy using critic’s evaluation
• Temperature parameter alpha:
automatically adjusts entropy for maximum entropy policy using alpha 𝛼
54
Soft Actor-Critic
Soft critic: evaluates soft Q-function 𝑄« of policy 𝜋¬
Idea: DQN + Maximum entropy
• Training soft Q-function:
min
«
𝐽— 𝜃 ≜ 𝔼 UY,SY
1
2
𝑄« 𝑠", 𝑎" − 𝑟 𝑠", 𝑎" + 𝛾𝔼UY¯Ÿ
𝑉«° 𝑠"?@
†
• Stochastic gradient:
±∇« 𝐽— 𝜃 = ∇« 𝑄« 𝑠", 𝑎" × 𝑄« 𝑠", 𝑎" − 𝑟 𝑠", 𝑎" + 𝛾 𝑄«° 𝑠"?@, 𝑎"?@ − 𝛼 log 𝜋¬ 𝑎"?@ 𝑠"?@
55
Soft Actor-Critic
target
sample estimate of 𝔼UY¯Ÿ
𝑉«° 𝑠"?@
Soft actor: updates policy using critic’s evaluation
Idea: Policy gradient + Maximum entropy
• Minimizing the expected KL-divergence:
min
¬
𝐷¨© 𝜋¬ ⋅ 𝑠" ∥ exp
1
𝛼
𝑄« 𝑠",⋅ − 𝑉« 𝑠"
which is equivalent to
min
¬
𝐽F 𝜙 ≜ 𝔼UY
𝔼SY~Fµ ⋅ 𝑠"
𝛼 log 𝜋¬ 𝑎" 𝑠" − 𝑄« 𝑠", 𝑎"
56
Soft Actor-Critic
target policy
Reparameterization trick
Reparameterize the policy as
𝑎" ≜ 𝑓¬ 𝜖"; 𝑠"
which 𝜖" is an input noise with some fixed distribution (e.g., Gaussian)
Benefit: lower variance
Rewrite the policy optimization problem as
min
¬
𝐽F 𝜙 ≜ 𝔼UY,·Y
𝛼 log 𝜋¬ 𝑓¬ 𝜖"; 𝑠" 𝑠" − 𝑄« 𝑠", 𝑓¬ 𝜖"; 𝑠"
Stochastic gradient:
±∇¬ 𝐽F 𝜙 = ∇¬ 𝛼 log 𝜋¬ 𝑎" 𝑠" + ∇SY
𝛼 log 𝜋¬ 𝑎" 𝑠" − ∇SY
𝑄« 𝑠", 𝑎" ×∇¬ 𝑓¬ 𝜖"; 𝑠"
which 𝑎" is evaluated at 𝑓¬ 𝜖"; 𝑠"
57
Soft Actor-Critic
Automating entropy adjustment
Why do we have to do automatic temperature parameter 𝛼 adjustment?
• Choosing the optimal temperature 𝛼"
∗ is non-trivial, and the temperature needs to be tuned for each task.
• Since the entropy 𝐻 𝜋" can vary unpredictably both across tasks and during training as the policy becomes
better, this makes the temperature adjustment particularly difficult.
• Thus, simply forcing the entropy to a fixed value is a poor solution, since the policy should be free to explore
more in regions.
58
Soft Actor-Critic
Automating entropy adjustment
Constrained optimization problem suppose 𝛾 = 1 :
max
F¸,…,F¹
𝔼 I
"JK
º
𝑟(𝑠", 𝑎") s. t. 𝐻 𝜋" ≥ 𝐻K ∀𝑡
where 𝐻K is a desired minimum expected entropy threshold
Rewrite the objective as an iterated maximization employing an (approximate) dynamic programming
approach:
max
F¸
𝔼 𝑟 𝑠K, 𝑎K + max
FŸ
𝔼 … + max
F¹
𝔼 𝑟 𝑠º, 𝑎º
Subject to the constraint on entropy
Start from the last time step 𝑇:
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º
Subject to 𝔼 U¹,S¹ ~¼›
− log 𝜋º 𝑎º 𝑠º − 𝐻K ≥ 0
59
Soft Actor-Critic
Automating entropy adjustment
Change the constrained maximization to the dual problem using Lagrangian:
1. Create the following function:
𝑓 𝜋º = ½
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º , s. t. ℎ(𝜋º) ≥ 0
−∞, otherwise
ℎ 𝜋º = 𝐻 𝜋º − 𝐻K = 𝔼 U¹,S¹ ~¼›
− log 𝜋º 𝑎º 𝑠º − 𝐻K
2. So, rewrite the objective as
max
F¹
𝑓 𝜋º s. t. ℎ 𝜋º ≥ 0
3. Use Lagrange multiplier:
min
¤¹ÁK
𝐿 𝜋º, 𝛼º = 𝑓 𝜋º + 𝛼ºℎ 𝜋º
where 𝛼º is the dual variable
4. Rewrite the objective using Lagrangian as
max
F¹
𝑓 𝜋º = min
¤¹ÁK
max
F¹
𝐿 𝜋º, 𝛼º = min
¤¹ÁK
max
F¹
𝑓 𝜋º + 𝛼ºℎ 𝜋º
60
Soft Actor-Critic
Automating entropy adjustment
Change the constrained maximization to the dual problem using Lagrangian:
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º = max
F¹
𝑓 𝜋º
= min
¤¹ÁK
max
F¹
𝐿 𝜋º, 𝛼º = min
¤¹ÁK
max
F¹
𝑓 𝜋º + 𝛼ºℎ 𝜋º
= min
¤¹ÁK
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º + 𝛼º 𝔼 U¹,S¹ ~¼›
− log 𝜋º 𝑎º 𝑠º − 𝐻K
= min
¤¹ÁK
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º − 𝛼º log 𝜋º 𝑎º 𝑠º − 𝛼º 𝐻K
= min
¤¹ÁK
max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º − 𝛼º 𝐻K
We can solve for
• Optimal policy 𝜋º
∗
as
𝜋º
∗ = arg max
F¹
𝔼 U¹,S¹ ~¼›
𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º − 𝛼º 𝐻K
• Optimal dual variable 𝛼º
∗
as
𝛼º
∗
= arg min
¤¹ÁK
𝔼 U¹,S¹ ~¼›∗ 𝛼º 𝐻 𝜋º
∗
− 𝛼º 𝐻K
Thus,
max
F¹
𝔼 𝑟 𝑠º, 𝑎º = 𝔼 U¹,S¹ ~¼›∗ 𝑟 𝑠º, 𝑎º + 𝛼º
∗ 𝐻 𝜋º
∗ − 𝛼º
∗ 𝐻K
61
Soft Actor-Critic
Automating entropy adjustment
Then, the soft Q-function can be computed as
𝑄ºO@ 𝑠ºO@, 𝑎ºO@ = 𝑟 𝑠ºO@, 𝑎ºO@ + 𝔼 𝑄º 𝑠º, 𝑎º − 𝛼º log 𝜋º 𝑎º 𝑠º
= 𝑟 𝑠ºO@, 𝑎ºO@ + 𝔼 𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ = 𝑟 𝑠ºO@, 𝑎ºO@ + max
F¹
𝔼 𝑟 𝑠º, 𝑎º + 𝛼º
∗
𝐻(𝜋º
∗
)
Now, given the time step 𝑇 − 1, we have:
max
F¹°Ÿ
𝔼 𝑟 𝑠ºO@, 𝑎ºO@ + max
F¹
𝔼 𝑟 𝑠º, 𝑎º
= max
F¹°Ÿ
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ − 𝛼º
∗
𝐻 𝜋º
∗
= min
¤¹°ŸÁK
max
F¹°Ÿ
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ − 𝛼º
∗
𝐻 𝜋º
∗
+ 𝛼ºO@ 𝐻 𝜋ºO@ − 𝐻K
= min
¤¹°ŸÁK
max
F¹°Ÿ
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ − 𝛼ºO@ 𝐻 𝜋ºO@ − 𝛼ºO@ 𝐻K − 𝛼º
∗
𝐻 𝜋º
∗
62
Soft Actor-Critic
Automating entropy adjustment
We can solve for
• Optimal policy 𝜋ºO@
∗
as
𝜋ºO@
∗
= arg max
F¹°Ÿ
𝔼 U¹°Ÿ,S¹°Ÿ ~¼›
𝑄ºO@
∗
𝑠ºO@, 𝑎ºO@ + 𝛼ºO@ 𝐻 𝜋ºO@ − 𝛼ºO@ 𝐻K
• Optimal dual variable 𝛼ºO@
∗
as
𝛼ºO@
∗ = arg min
¤¹°ŸÁK
𝔼 U¹°Ÿ,S¹°Ÿ ~¼›∗ 𝛼ºO@ 𝐻 𝜋ºO@
∗ − 𝛼ºO@ 𝐻K
In this way, we can proceed backwards in time and recursively optimize constrained optimization
problem. Then, we can solve the optimal dual variable 𝛼"
∗
after solving for 𝑄"
∗
and 𝜋"
∗
:
𝛼"
∗
= arg min
¤Y
𝔼SY~FY
∗ −𝛼" log 𝜋"
∗
𝑎" 𝑠" − 𝛼" 𝐻K
63
Soft Actor-Critic
Temperature parameter alpha: automatically adjusts entropy for maximum entropy policy
using alpha 𝛼
Idea: Constrained optimization problem + Dual problem
• Minimizing the dual objective by approximating dual gradient descent:
min
¤
𝔼SY~FY
−𝛼 log 𝜋" 𝑎" 𝑠" − 𝛼𝐻K
64
Soft Actor-Critic
Putting everything together: Soft Actor-Critic (SAC)
65
Soft Actor-Critic
66
Results
• Soft actor-critic (blue and yellow) performs consistently across all tasks and outperforming both
on-policy and off-policy methods in the most challenging tasks.
67
References
Papers
• T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017
• T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning
with a Stochastic Actor”, ICML 2018
• T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018
Theoretical analysis
• “Induction of Q-learning, Soft Q-learning, and Sparse Q-learning” written by Sungjoon Choi
• “Reinforcement Learning with Deep Energy-Based Policies” reviewed by Kyungjae Lee
• “Entropy Regularization in Reinforcement Learning (Stochastic Policy)” reviewed by Kyungjae Lee
• Reframing Control as an Inference Problem, CS 294-112: Deep Reinforcement Learning
• Constrained optimization problem and dual problem for SAC with automatically adjusted
temperature (in Korean)
68
Any Questions?
69
Thank You

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs[DL輪読会]逆強化学習とGANs
[DL輪読会]逆強化学習とGANs
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 
強化学習2章
強化学習2章強化学習2章
強化学習2章
 
EMアルゴリズム
EMアルゴリズムEMアルゴリズム
EMアルゴリズム
 
強化学習における好奇心
強化学習における好奇心強化学習における好奇心
強化学習における好奇心
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
 
海鳥の経路予測のための逆強化学習
海鳥の経路予測のための逆強化学習海鳥の経路予測のための逆強化学習
海鳥の経路予測のための逆強化学習
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learningゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
 
Sutton chapter4
Sutton chapter4Sutton chapter4
Sutton chapter4
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル
 
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
 

Semelhante a Maximum Entropy Reinforcement Learning (Stochastic Control)

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
ZhiwuGuo1
 

Semelhante a Maximum Entropy Reinforcement Learning (Stochastic Control) (20)

Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Continuous control
Continuous controlContinuous control
Continuous control
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rl
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-opticsA machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 

Mais de Dongmin Lee

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
Dongmin Lee
 

Mais de Dongmin Lee (12)

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요
 

Último

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Último (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 

Maximum Entropy Reinforcement Learning (Stochastic Control)

  • 1. 1 Maximum Entropy Reinforcement Learning Stochastic Control T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, ICML 2018 T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018 Dongmin Lee August, 2019 T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, ICML 2018 T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018 Presented by Dongmin Lee August, 2019
  • 2. 2 Outline 1. Introduction 2. Reinforcement Learning • Markov Decision Processes (MDPs) • Bellman Equation 3. Exploration methods 4. Maximum Entropy Reinforcement Learning • Soft MDPs • Soft Bellman Equation 5. From Soft Policy Iteration to Soft Actor-Critic • Soft Policy Iteration • Soft Actor-Critic 6. Results
  • 3. 3 Introduction How do we build intelligent systems / machines? • Data à Information • Information à Decision intelligence intelligence
  • 4. Intelligent systems must be able to adapt to uncertainty • Uncertainty: • Other cars • Weather • Road construction • Passenger 4 Introduction
  • 5. RL provides a formalism for behaviors • Problem of a goal-directed agent interacting with an uncertain environment • Interaction à adaptation 5 feedback & decision Introduction
  • 6. 6 Introduction Distinct Features of RL 1. There is no supervisor, only a reward signal 2. Feedback is delayed, not instantaneous 3. Time really matters (sequential, non i.i.d data) 4. Tradeoff between exploration and exploitation
  • 7. What is deep RL? Deep RL = RL + Deep learning • RL: sequential decision making interacting with environments • Deep learning: representation of policies and value functions 7source : http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_2.pdf Introduction
  • 8. What are the challenges of deep RL? • Huge # of samples: millions • Fast, stable learning • Hyperparameter tuning • Exploration • Sparse reward signals • Safety / reliability • Simulator 8 Introduction
  • 9. • Soft Actor-Critic on Minitaur ­ Training 9 Introduction source : https://sites.google.com/view/sac-and-applications
  • 10. • Soft Actor-Critic on Minitaur ­ Testing 10 Introduction source : https://sites.google.com/view/sac-and-applications
  • 11. • Interfered rollouts from Soft Actor-Critic policy trained for Dynamixel Claw task from vision 11 Introduction source : https://sites.google.com/view/sac-and-applications
  • 12. • State of the environment / system • Action determined by the agent: affects the state • Reward: evaluates the benefit of state and action 12 Reinforcement Learning
  • 13. • State: 𝑠" ∈ 𝑆 • Action: 𝑎" ∈ 𝐴 • Policy (mapping from state to action): 𝜋 • Deterministic 𝜋 𝑠" = 𝑎" • Stochastic (randomized) 𝜋 𝑎 𝑠 = Prob(𝑎" = 𝑎|𝑠" = 𝑠) à Policy is what we want to optimize! 13 Reinforcement Learning
  • 14. A Markov decision process (MDP) is a tuple < 𝑆, 𝐴, 𝑝, 𝑟, 𝛾 >, consisting of • 𝑆: set of states (state space) e.g., 𝑆 = 1, … , 𝑛 (discrete), 𝑆 = ℝ: (continuous) • 𝐴: set of actions (action space) e.g., 𝐴 = 1, … , 𝑚 (discrete), 𝐴 = ℝ< (continuous) • 𝑝: state transition probability 𝑝 𝑠= 𝑠, 𝑎 ≜ Prob 𝑠"?@ = 𝑠= 𝑠" = 𝑠, 𝑎" = 𝑎 d • 𝑟: reward function 𝑟 𝑠", 𝑎" = 𝑟"d • 𝛾 ∈ (0,1]: discount factor 14 Markov Decision Processes (MDPs)
  • 15. The MDP problem • To find an optimal policy that maximizes the expected cumulative reward: max F∈∏ 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" Value function • The state value function 𝑉F (𝑠) of a policy 𝜋 is the expected return starting from state 𝑠 under executing 𝜋: 𝑉F 𝑠 ≜ 𝔼F I NJ" L 𝛾NO" 𝑟 𝑠N, 𝑎N |𝑠" = 𝑠 Q-function • The action value function 𝑄F(𝑠, 𝑎) of a policy 𝜋 is the expected return starting from state 𝑠, taking action 𝑎, and then following 𝜋: 𝑄F 𝑠, 𝑎 ≜ 𝔼F I NJ" L 𝛾NO" 𝑟 𝑠N, 𝑎N |𝑠" = 𝑠, 𝑎" = 𝑎 15 Markov Decision Processes (MDPs)
  • 16. The MDP problem • To find an optimal policy that maximizes the expected cumulative reward: max F∈∏ 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" Optimal value function • The optimal value function 𝑉∗(𝑠) is the maximum value function over all policies: 𝑉∗ 𝑠 ≜ max F∈∏ 𝑉F 𝑠 = max F∈∏ 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" |𝑠" = 𝑠 Optimal Q-function • The optimal Q-function 𝑄∗(𝑠, 𝑎) is the maximum Q-function over all policies: 𝑄∗ 𝑠, 𝑎 ≜ max F∈∏ 𝑄F 𝑠, 𝑎 16 Markov Decision Processes (MDPs)
  • 17. Bellman expectation equation for 𝑉F • The value function 𝑉F is the unique solution to the following Bellman equation: 𝑉F 𝑠 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠 = I S∈T 𝜋(𝑎|𝑠) 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝 𝑠= 𝑠, 𝑎 𝑉F 𝑠= = I S∈T 𝜋(𝑎|𝑠)𝑄F 𝑠, 𝑎 • In operator form 𝑉F = 𝑇𝑉F Bellman expectation equation for 𝑄F • The Q-function 𝑄F satisfies: 𝑄F 𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉F 𝑠= = 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝 𝑠= 𝑠, 𝑎 I SV∈T 𝜋 𝑎= 𝑠= 𝑄F 𝑠= , 𝑎= 17 Bellman Equation
  • 18. Bellman optimality equation for 𝑉∗ • The optimal value function 𝑉∗ must satisfies the self-consistency condition given by the Bellman equation for 𝑉F: 𝑉∗ 𝑠 = max SY∈T 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗ 𝑠"?@ |𝑠" = 𝑠 = max S∈T 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠= = max S∈T 𝑄∗ 𝑠, 𝑎 • In operator form 𝑉∗ = 𝑇𝑉∗ Bellman optimality equation for 𝑄∗ • The optimal Q-function 𝑄∗ satisfies: 𝑄∗ 𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗ 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠= = 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝(𝑠=|𝑠, 𝑎) max SV∈T 𝑄∗ 𝑠=, 𝑎= 18 Bellman Equation
  • 19. The MDP problem • To find an optimal policy that maximizes the expected cumulative reward: max F∈∏ 𝔼F I "JK L 𝛾" 𝑟 𝑠", 𝑎" Known model: • Policy iteration • Value iteration Unknown model: • Temporal-difference learning • Q-learning 19 Bellman Equation
  • 20. Policy Iteration Initialize a policy 𝜋K; 1. (Policy Evaluation) Compute the value 𝑉F[ of 𝜋 by solving the Bellman equation 𝑉F[ = 𝑇F[ 𝑉F[; 2. (Policy Improvement) Update the policy to 𝜋?@ so that 𝜋?@(𝑠) ∈ arg max S {𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝 𝑠= 𝑠, 𝑎 𝑉F[ 𝑠= } 3. Set 𝑘 ← 𝑘 + 1; Repeat until convergence; • Converged policy is optimal! • Q) Can we do something even simpler? 20 Bellman Equation
  • 21. Value Iteration Initialize a value function 𝑉K; 1. Update the value function by 𝑉?@ ← 𝑇𝑉; 2. Set 𝑘 ← 𝑘 + 1; Repeat until convergence; • Converged value function is optimal! 21 Bellman Equation
  • 22. Common issue: exploration • Is there a value of exploring unknown regions of the environment? • Which action should we try to explore unknown regions of the environment? 22 Exploration methods
  • 23. How can we algorithmically solve this issue in exploration vs exploitation? 1. 𝜖-Greedy 2. Optimism in the face of uncertainty 3. Thompson (posterior) sampling 4. Noisy networks 5. Information theoretic exploration 23 Exploration methods
  • 24. Method 1: 𝜖-Greedy Idea: • Choose a greedy action (arg max 𝑄F(𝑠, 𝑎)) with probability (1 − 𝜖) • Choose a random action with probability 𝜖 Advantages: 1. Very simple 2. Light computation Disadvantages: 1. Fine tuning 𝜖 2. Not quite systematic: No focus on unexplored regions 3. Inefficient 24 Exploration methods
  • 25. Method 2: Optimism in the face of uncertainty (OFU) Idea: 1. Construct a confidence set for MDP parameters using collected samples 2. Choose MDP parameters that give the highest rewards 3. Construct an optimal policy of the optimistically chosen MDP Advantages: 1. Regret optimal 2. Systematic: More focus on unexplored regions Disadvantages: 1. Complicated 2. Computation intensive 25 Exploration methods
  • 26. Method 3: Thompson (posterior) sampling Idea: 1. Sample MDP parameters from posterior distribution 2. Construct an optimal policy of the sampled MDP parameters 3. Update the posterior distribution Advantages: 1. Regret optimal 2. Systematic: More focus on unexplored regions Disadvantages: 1. Somewhat complicated 26 Exploration methods
  • 27. Method 4: Noisy networks Idea: • Inject noise to the weights of Neural Network Advantages: 1. Simple 2. Good empirical performance Disadvantages: 1. Not systematic: No focus on unexplored regions 27 Exploration methods
  • 28. Method 5: Information theoretic exploration Idea: • Use high entropy policy to explore more Advantages: 1. Simple 2. Good empirical performance Disadvantages: 1. More theoretic analyses needed 28 Exploration methods
  • 29. What is entropy? • Average rate at which information is produced by a stochastic source of data: 𝐻 𝑝 ≜ − I e 𝑝e log 𝑝e = 𝔼 − log 𝑝e • More generally, entropy refers to disorder 29 Maximum Entropy Reinforcement Learning
  • 30. What’s the benefit of high entropy policy? • Entropy of policy: 𝐻 𝜋(⋅ |𝑠) ≜ − I S 𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠) = 𝔼S~F(⋅|U) − log 𝜋(𝑎|𝑠) • Higher disorder in policy 𝜋 • Try new risky behaviors: Potentially explore unexplored regions 30 Maximum Entropy Reinforcement Learning
  • 31. Maximum entropy stochastic control Standard MDP problem: max F 𝔼F I " 𝛾" 𝑟(𝑠", 𝑎") Maximum entropy MDP problem: max F 𝔼F I " 𝛾" 𝑟 𝑠", 𝑎" + 𝐻 𝜋" ⋅ 𝑠" 31 Soft MDPs
  • 32. Theorem 1 (Soft value functions and Optimal policy) Soft Q-function: 𝑄ijkl F 𝑠", 𝑎" ≜ 𝑟 𝑠", 𝑎" + 𝔼F I mJ@ L 𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m Soft value function: 𝑉ijkl F 𝑠" ≜ log n exp 𝑄ijkl F 𝑠", 𝑎 𝑑𝑎 Optimal value functions: 𝑄ijkl ∗ 𝑠", 𝑎" ≜ max F 𝑄ijkl F 𝑠", 𝑎" 𝑉ijkl ∗ 𝑠" ≜ log n exp 𝑄ijkl ∗ 𝑠", 𝑎 𝑑𝑎 32 Soft Bellman Equation
  • 33. Proof. Maximum entropy MDP problem: max F 𝔼F I " 𝛾" 𝑟 𝑠", 𝑎" + 𝐻 𝜋" ⋅ 𝑠" Soft value function: 𝑉ijkl F 𝑠 = 𝔼F I NJ" 𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠 = 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl F 𝑠"?@ |𝑠" = 𝑠 Soft Q-function: 𝑄ijkl F s, 𝑎 = 𝔼F I NJ" 𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 33 Soft Bellman Equation note that 𝑄ijkl F s, 𝑎 is the same with the original form
  • 34. Proof. Soft value function: 𝑉ijkl F 𝑠 = 𝔼F I NJ" 𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠 = 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl F 𝑠"?@ |𝑠" = 𝑠 Soft Q-function: 𝑄ijkl F s, 𝑎 = 𝔼F I NJ" 𝛾NO" 𝑟 𝑠N, 𝑎N − log 𝜋 𝑎N 𝑠N |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 Soft value function (rewrite): 𝑉ijkl F 𝑠 = 𝔼F 𝑟 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" + 𝛾𝑉ijkl F 𝑠"?@ |𝑠" = 𝑠 = 𝔼F 𝑄ijkl F 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠 34 Soft Bellman Equation
  • 35. Proof. Given a policy 𝜋jst, define a new policy 𝜋uvw as: 𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl Fyz{ s, 𝑎 𝜋uvw 𝑎 𝑠 = exp(𝑄ijkl Fyz{ s, 𝑎 ) ∫ exp(𝑄ijkl Fyz{ s, 𝑎= ) 𝑑𝑎= = softmax S∈T 𝑄ijkl Fyz{ s, 𝑎 = exp(𝑄ijkl Fyz{ s, 𝑎 − 𝑉ijkl Fyz{ 𝑠 ) = exp(𝐴ijkl Fyz{ s, 𝑎 ) Substitute 𝜋uvw 𝑎 𝑠 into 𝑉ijkl F 𝑠 : 𝑉ijkl F 𝑠 = 𝔼S~F 𝑄ijkl F s, 𝑎 − log exp 𝑄ijkl F s, 𝑎 ∫ exp 𝑄ijkl F s, 𝑎= 𝑑𝑎= = 𝔼S~F 𝑄ijkl F s, 𝑎 − log exp 𝑄ijkl F s, 𝑎 + log n exp 𝑄ijkl F s, 𝑎= 𝑑𝑎= 35 Soft Bellman Equation
  • 36. Proof. Substitute 𝜋uvw 𝑎 𝑠 into 𝑉ijkl F 𝑠 : 𝑉ijkl F 𝑠 = 𝔼S~F 𝑄ijkl F s, 𝑎 − log exp 𝑄ijkl F s, 𝑎 + log n exp 𝑄ijkl F s, 𝑎= 𝑑𝑎= = 𝔼S~F log n exp 𝑄ijkl F s, 𝑎= 𝑑𝑎= = log n exp 𝑄ijkl F s, 𝑎= 𝑑𝑎= ∴ 𝑉ijkl F 𝑠 = log n exp 𝑄ijkl F s, 𝑎 𝑑𝑎 36 Soft Bellman Equation independent from 𝑎
  • 37. Theorem 1 (Soft value functions and Optimal policy) Soft Q-function: 𝑄ijkl F 𝑠", 𝑎" ≜ 𝑟 𝑠", 𝑎" + 𝔼F I mJ@ L 𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m Soft value function: 𝑉ijkl F 𝑠" ≜ log n exp 𝑄ijkl F 𝑠", 𝑎 𝑑𝑎 Optimal value functions: 𝑄ijkl ∗ 𝑠", 𝑎" ≜ max F 𝑄ijkl F 𝑠", 𝑎" 𝑉ijkl ∗ 𝑠" ≜ log n exp 𝑄ijkl ∗ 𝑠", 𝑎 𝑑𝑎 37 Soft Bellman Equation
  • 38. Theorem 2 (Soft policy improvement theorem) Given a policy 𝜋jst, define a new policy 𝜋uvw as: 𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl Fyz{ s, 𝑎 Assume that throughout our computation, 𝑄 is bounded and ∫ exp 𝑄 𝑠, 𝑎 𝑑𝑎 is bounded for any 𝑠 (for both 𝜋jst and 𝜋uvw). Then, 𝑄ijkl F€•‚ s, 𝑎 ≥ 𝑄ijkl Fyz{ s, 𝑎 ∀𝑠, 𝑎 à Monotonic policy improvement 38 Soft Bellman Equation
  • 39. 39 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 Example 𝜋(𝑎|𝑠) 𝑟(𝑠, 𝑎) 𝑠K 𝑠@ 𝑠† 𝑠‡ 𝑠ˆ 𝜋jst 0.5 0.5 0.5 0.5 +2 +4+1 +1 𝑠• 𝑠Ž 0.5 +3 0.5 +1
  • 40. 40 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 Example 𝑠K 𝑠@ 𝑠† 𝑠‡ 𝑠ˆ 𝜋jst 0.5 0.5 0.5 0.5 +2 +4+1 +1 𝑠• 𝑠Ž 0.5 +3 0.5 +1 𝑄ijkl Fyz{ left = ? 𝑄ijkl Fyz{ right = ? 𝑄ijkl Fyz{ left = ? 𝑄ijkl Fyz{ right = ? 𝑄ijkl Fyz{ left = ? 𝑄ijkl Fyz{ right =? Soft Q-function: 𝑄ijkl F 𝑠, 𝑎 ≜ 𝑟 𝑠", 𝑎" + 𝔼F I mJ@ L 𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m = 𝑟 𝑠, 𝑎 + 𝛾𝑝 𝑠= 𝑠, 𝑎 𝑉ijkl F 𝑠= Soft value function: 𝑉ijkl F 𝑠 = I S 𝜋 𝑎 𝑠 𝑄ijkl F s, 𝑎 − log 𝜋 𝑎 𝑠 Entropy of policy: 𝐻 𝜋(⋅ |𝑠) ≜ − I S 𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠)
  • 41. 41 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 Example 𝑠K 𝑠@ 𝑠† 𝑠‡ 𝑠ˆ 𝜋jst (suppose γ = 1, 𝑝 𝑠= 𝑠, 𝑎 = 1) 0.5 0.5 0.5 0.5 +2 +4+1 +1 𝑠• 𝑠Ž 0.5 +3 0.5 +1 𝑄ijkl Fyz{ left = 2 𝑄ijkl Fyz{ right = 4 𝑄ijkl Fyz{ left = 3 𝑄ijkl Fyz{ right = 1 𝑄ijkl Fyz{ left = 4.3 𝑄ijkl Fyz{ right = 3.3 Soft Q-function: 𝑄ijkl F 𝑠, 𝑎 ≜ 𝑟 𝑠", 𝑎" + 𝔼F I mJ@ L 𝛾m 𝑟 𝑠"?m, 𝑎"?m + 𝐻 𝜋 ⋅ 𝑠"?m = 𝑟 𝑠, 𝑎 + 𝛾𝑝 𝑠= 𝑠, 𝑎 𝑉ijkl F 𝑠= Soft value function: 𝑉ijkl F 𝑠 = I S 𝜋 𝑎 𝑠 𝑄ijkl F s, 𝑎 − log 𝜋 𝑎 𝑠 Entropy of policy: 𝐻 𝜋(⋅ |𝑠) ≜ − I S 𝜋(𝑎|𝑠) log 𝜋(𝑎|𝑠)
  • 42. 42 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 Example 𝑠K 𝑠@ 𝑠† 𝑠‡ 𝑠ˆ 𝜋jst 0.5 0.5 0.5 0.5 +2 +4+1 +1 𝑠• 𝑠Ž 0.5 +3 0.5 +1 A new policy 𝜋uvw: 𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl Fyz{ s, 𝑎 𝜋uvw 𝑎 𝑠 = exp(𝑄ijkl Fyz{ s, 𝑎 ) ∑SV exp(𝑄ijkl Fyz{ s, 𝑎= ) = softmax S∈T 𝑄ijkl Fyz{ s, 𝑎 = exp(𝑄ijkl Fyz{ s, 𝑎 − 𝑉ijkl Fyz{ 𝑠 ) 𝑄ijkl Fyz{ left = 2 𝑄ijkl Fyz{ right = 4 𝑄ijkl Fyz{ left = 3 𝑄ijkl Fyz{ right = 1 𝑄ijkl Fyz{ left = 4.3 𝑄ijkl Fyz{ right = 3.3
  • 43. 43 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 Example 𝑠K 𝑠@ 𝑠† 𝑠‡ 𝑠ˆ 𝜋jst 0.5 0.5 0.5 0.5 +2 +4+1 +1 𝑠• 𝑠Ž 0.5 +3 0.5 +1 𝑄ijkl Fyz{ left = 2 𝑄ijkl Fyz{ right = 4 𝑄ijkl Fyz{ left = 3 𝑄ijkl Fyz{ right = 1 𝑄ijkl Fyz{ left = 4.3 𝑄ijkl Fyz{ right = 3.3 𝑠K 𝑠@ 𝑠† 𝑠‡ 𝑠ˆ 𝜋uvw 𝜋uvw 𝑎 𝑠 = v•– —˜y™š ›yz{ i,S ∑ œV v•– —˜y™š ›yz{ i,SV 0.73 0.27 0.12 0.88 +2 +4+1 +1 𝑠• 𝑠Ž 0.88 +3 0.12 +1
  • 44. 44 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 So, Substitute 𝑠K, 𝑎K for 𝑠, 𝑎: 𝐻 𝜋jst ⋅ 𝑠K + 𝔼Fyz{ 𝑄ijkl Fyz{ 𝑠K, 𝑎K ≤ 𝐻 𝜋uvw ⋅ 𝑠K + 𝔼F€•‚ 𝑄ijkl Fyz{ 𝑠K, 𝑎K 0.3 + 3.8 ≤ 0.25 + 4.03 4.1 ≤ 4.28
  • 45. 45 Soft Bellman Equation Proof. If one greedily maximize the sum of entropy and value with one-step look-ahead, then one obtains 𝜋uvw from 𝜋jst: 𝐻 𝜋jst ⋅ 𝑠 + 𝔼Fyz{ 𝑄ijkl Fyz{ s, 𝑎 ≤ 𝐻 𝜋uvw ⋅ 𝑠 + 𝔼F€•‚ 𝑄ijkl Fyz{ s, 𝑎 Then, we can show that: 𝑄ijkl Fyz{ s, 𝑎 = 𝔼UŸ 𝑟K + 𝛾 𝐻 𝜋jst ⋅ 𝑠@ + 𝔼SŸ~Fyz{ 𝑄ijkl F 𝑠@, 𝑎@ ≤ 𝔼UŸ 𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝔼SŸ~F€•‚ 𝑄ijkl F 𝑠@, 𝑎@ = 𝔼UŸ 𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝔼U 𝐻 𝜋jst ⋅ 𝑠† + 𝔼S ~Fyz{ 𝑄ijkl F 𝑠†, 𝑎† ≤ 𝔼UŸ 𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝔼U 𝐻 𝜋uvw ⋅ 𝑠† + 𝔼S ~F€•‚ 𝑄ijkl F 𝑠†, 𝑎† = 𝔼UŸ,S ~F€•‚,U 𝑟K + 𝛾 𝐻 𝜋uvw ⋅ 𝑠@ + 𝑟@ + 𝛾† 𝐻 𝜋uvw ⋅ 𝑠† + 𝑟† + 𝛾‡ 𝔼U¡ 𝐻 𝜋uvw ⋅ 𝑠‡ + 𝔼S¡~F€•‚ 𝑄ijkl F 𝑠‡, 𝑎‡ ⋮ ≤ 𝔼N~F€•‚ 𝑟K + ∑"J@ L 𝛾" 𝐻 𝜋uvw ⋅ 𝑠" + 𝑟" = 𝑄ijkl F€•‚ s, 𝑎
  • 46. Theorem 2 (Soft policy improvement theorem) Given a policy 𝜋jst, define a new policy 𝜋uvw as: 𝜋uvw 𝑎 𝑠 ∝ exp 𝑄ijkl Fyz{ s, 𝑎 Assume that throughout our computation, 𝑄 is bounded and ∫ exp 𝑄 𝑠, 𝑎 𝑑𝑎 is bounded for any 𝑠 (for both 𝜋jst and 𝜋uvw). Then, 𝑄ijkl F€•‚ s, 𝑎 ≥ 𝑄ijkl Fyz{ s, 𝑎 ∀𝑠, 𝑎 à Monotonic policy improvement 46 Soft Bellman Equation
  • 47. Comparison to standard MDP: Bellman expectation equation Standard: 𝑄F 𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠= |𝑠, 𝑎)𝑉F 𝑠= 𝑉F 𝑠 = ∑S∈T 𝜋(𝑎|𝑠)𝑄F 𝑠, 𝑎 Maximum entropy: 𝑄ijkl F s, 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl F 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉ijkl F 𝑠= 𝑉ijkl F 𝑠 = 𝔼F 𝑄ijkl F 𝑠", 𝑎" − log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠 = log ∫ exp 𝑄ijkl F s, 𝑎 𝑑𝑎 Maximum entropy variant: • Add explicit temperature parameter 𝛼 𝑉ijkl F 𝑠 = 𝔼F 𝑄ijkl F 𝑠", 𝑎" − 𝛼 log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠 = 𝛼 log ∫ exp @ ¤ 𝑄ijkl F s, 𝑎 𝑑𝑎 47 Soft Bellman Equation
  • 48. Comparison to standard MDP: Bellman optimality equation Standard: 𝑄∗ 𝑠, 𝑎 = 𝔼 𝑟 𝑠", 𝑎" + 𝛾𝑉∗ 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉∗ 𝑠= 𝑉∗ 𝑠 = max S∈T 𝑄∗ 𝑠, 𝑎 Maximum entropy variant: 𝑄ijkl ∗ s, 𝑎 = 𝔼F 𝑟 𝑠", 𝑎" + 𝛾𝑉ijkl ∗ 𝑠"?@ |𝑠" = 𝑠, 𝑎" = 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾 ∑UV∈W 𝑝(𝑠=|𝑠, 𝑎)𝑉ijkl ∗ 𝑠= 𝑉ijkl ∗ 𝑠 = 𝔼F 𝑄ijkl ∗ 𝑠", 𝑎" − 𝛼 log 𝜋 𝑎" 𝑠" |𝑠" = 𝑠 = 𝛼 log ∫ exp @ ¤ 𝑄ijkl ∗ s, 𝑎 𝑑𝑎 Connection: • Note that as 𝛼 → 0, exp @ ¤ 𝑄ijkl ∗ s, 𝑎 emphasizes max S∈T 𝑄∗ 𝑠, 𝑎 : 𝛼 log ∫ exp @ ¤ 𝑄ijkl ∗ s, 𝑎 𝑑𝑎 → max S∈T 𝑄∗ 𝑠, 𝑎 as 𝛼 → 0 48 Soft Bellman Equation
  • 49. Comparison to standard MDP: Policy Standard: 𝜋uvw 𝑎 𝑠 ∈ arg max S 𝑄Fyz{ 𝑠, 𝑎 à Deterministic policy Maximum Entropy: 𝜋uvw 𝑎 𝑠 = exp 1 𝛼 𝑄ijkl Fyz{ s, 𝑎 ∫ exp 1 𝛼 𝑄ijkl Fyz{ s, 𝑎= 𝑑𝑎= = exp 1 𝛼 𝑄ijkl Fyz{ s, 𝑎 − 𝑉ijkl Fyz{ 𝑠 à Stochastic policy 49 Soft Bellman Equation
  • 50. So, what’s the advantage of maximum entropy? • Computation: No maximization involved • Exploration: High entropy policy • Structural similarity: Can combine it with many RL methods for standard MDP 50 Soft Bellman Equation
  • 51. Idea: • Maximum entropy + off-policy actor-critic Advantages: • Exploration • Sample efficiency • Stable convergence • Little hyperparameter tuning 51 From Soft Policy Iteration to Soft Actor-Critic Version 1 (Jan 4 2018) Version 2 (Dec 13 2018)
  • 52. Soft policy evaluation Modified Bellman operator: 𝑇F 𝑄F 𝑠, 𝑎 ≜ 𝑟 𝑠, 𝑎 + 𝛾 I UV∈W 𝑝(𝑠= |𝑠, 𝑎)𝑉ijkl F 𝑠= where 𝑉F 𝑠 = 𝔼F 𝑄ijkl F s, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠 Repeatedly apply 𝑇F to 𝑄: 𝑄?@ ← 𝑇F 𝑄 • Result: 𝑄 converges to 𝑄F! 52 Soft Policy Iteration
  • 53. Soft policy improvement If no constraint on policies, 𝜋uvw 𝑎 𝑠 = exp 1 𝛼 𝑄ijkl Fyz{ s, 𝑎 − 𝑉ijkl Fyz{ 𝑠 When constraint need to be satisfied (i.e., 𝜋 ∈ ∏), Information projection: 𝜋uvw ⋅ 𝑠 ∈ arg min FV ⋅ 𝑠 ∈∏ 𝐷¨© 𝜋= ⋅ 𝑠 ∥ exp 1 𝛼 𝑄ijkl Fyz{ s,⋅ − 𝑉ijkl Fyz{ 𝑠 • Result: Monotonic improvement on policy! (𝜋uvw better than 𝜋jst) 53 target policy Soft Policy Iteration
  • 54. Model-Free version of soft policy iteration: Soft actor-critic Soft policy iteration: maximum entropy variant of policy iteration Soft actor-critic (SAC): maximum entropy variant of actor-critic • Soft critic: evaluates soft Q-function 𝑄« of policy 𝜋¬ • Soft actor: improves maximum entropy policy using critic’s evaluation • Temperature parameter alpha: automatically adjusts entropy for maximum entropy policy using alpha 𝛼 54 Soft Actor-Critic
  • 55. Soft critic: evaluates soft Q-function 𝑄« of policy 𝜋¬ Idea: DQN + Maximum entropy • Training soft Q-function: min « 𝐽— 𝜃 ≜ 𝔼 UY,SY 1 2 𝑄« 𝑠", 𝑎" − 𝑟 𝑠", 𝑎" + 𝛾𝔼UY¯Ÿ 𝑉«° 𝑠"?@ † • Stochastic gradient: ±∇« 𝐽— 𝜃 = ∇« 𝑄« 𝑠", 𝑎" × 𝑄« 𝑠", 𝑎" − 𝑟 𝑠", 𝑎" + 𝛾 𝑄«° 𝑠"?@, 𝑎"?@ − 𝛼 log 𝜋¬ 𝑎"?@ 𝑠"?@ 55 Soft Actor-Critic target sample estimate of 𝔼UY¯Ÿ 𝑉«° 𝑠"?@
  • 56. Soft actor: updates policy using critic’s evaluation Idea: Policy gradient + Maximum entropy • Minimizing the expected KL-divergence: min ¬ 𝐷¨© 𝜋¬ ⋅ 𝑠" ∥ exp 1 𝛼 𝑄« 𝑠",⋅ − 𝑉« 𝑠" which is equivalent to min ¬ 𝐽F 𝜙 ≜ 𝔼UY 𝔼SY~Fµ ⋅ 𝑠" 𝛼 log 𝜋¬ 𝑎" 𝑠" − 𝑄« 𝑠", 𝑎" 56 Soft Actor-Critic target policy
  • 57. Reparameterization trick Reparameterize the policy as 𝑎" ≜ 𝑓¬ 𝜖"; 𝑠" which 𝜖" is an input noise with some fixed distribution (e.g., Gaussian) Benefit: lower variance Rewrite the policy optimization problem as min ¬ 𝐽F 𝜙 ≜ 𝔼UY,·Y 𝛼 log 𝜋¬ 𝑓¬ 𝜖"; 𝑠" 𝑠" − 𝑄« 𝑠", 𝑓¬ 𝜖"; 𝑠" Stochastic gradient: ±∇¬ 𝐽F 𝜙 = ∇¬ 𝛼 log 𝜋¬ 𝑎" 𝑠" + ∇SY 𝛼 log 𝜋¬ 𝑎" 𝑠" − ∇SY 𝑄« 𝑠", 𝑎" ×∇¬ 𝑓¬ 𝜖"; 𝑠" which 𝑎" is evaluated at 𝑓¬ 𝜖"; 𝑠" 57 Soft Actor-Critic
  • 58. Automating entropy adjustment Why do we have to do automatic temperature parameter 𝛼 adjustment? • Choosing the optimal temperature 𝛼" ∗ is non-trivial, and the temperature needs to be tuned for each task. • Since the entropy 𝐻 𝜋" can vary unpredictably both across tasks and during training as the policy becomes better, this makes the temperature adjustment particularly difficult. • Thus, simply forcing the entropy to a fixed value is a poor solution, since the policy should be free to explore more in regions. 58 Soft Actor-Critic
  • 59. Automating entropy adjustment Constrained optimization problem suppose 𝛾 = 1 : max F¸,…,F¹ 𝔼 I "JK º 𝑟(𝑠", 𝑎") s. t. 𝐻 𝜋" ≥ 𝐻K ∀𝑡 where 𝐻K is a desired minimum expected entropy threshold Rewrite the objective as an iterated maximization employing an (approximate) dynamic programming approach: max F¸ 𝔼 𝑟 𝑠K, 𝑎K + max FŸ 𝔼 … + max F¹ 𝔼 𝑟 𝑠º, 𝑎º Subject to the constraint on entropy Start from the last time step 𝑇: max F¹ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º Subject to 𝔼 U¹,S¹ ~¼› − log 𝜋º 𝑎º 𝑠º − 𝐻K ≥ 0 59 Soft Actor-Critic
  • 60. Automating entropy adjustment Change the constrained maximization to the dual problem using Lagrangian: 1. Create the following function: 𝑓 𝜋º = ½ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º , s. t. ℎ(𝜋º) ≥ 0 −∞, otherwise ℎ 𝜋º = 𝐻 𝜋º − 𝐻K = 𝔼 U¹,S¹ ~¼› − log 𝜋º 𝑎º 𝑠º − 𝐻K 2. So, rewrite the objective as max F¹ 𝑓 𝜋º s. t. ℎ 𝜋º ≥ 0 3. Use Lagrange multiplier: min ¤¹ÁK 𝐿 𝜋º, 𝛼º = 𝑓 𝜋º + 𝛼ºℎ 𝜋º where 𝛼º is the dual variable 4. Rewrite the objective using Lagrangian as max F¹ 𝑓 𝜋º = min ¤¹ÁK max F¹ 𝐿 𝜋º, 𝛼º = min ¤¹ÁK max F¹ 𝑓 𝜋º + 𝛼ºℎ 𝜋º 60 Soft Actor-Critic
  • 61. Automating entropy adjustment Change the constrained maximization to the dual problem using Lagrangian: max F¹ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º = max F¹ 𝑓 𝜋º = min ¤¹ÁK max F¹ 𝐿 𝜋º, 𝛼º = min ¤¹ÁK max F¹ 𝑓 𝜋º + 𝛼ºℎ 𝜋º = min ¤¹ÁK max F¹ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º + 𝛼º 𝔼 U¹,S¹ ~¼› − log 𝜋º 𝑎º 𝑠º − 𝐻K = min ¤¹ÁK max F¹ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º − 𝛼º log 𝜋º 𝑎º 𝑠º − 𝛼º 𝐻K = min ¤¹ÁK max F¹ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º − 𝛼º 𝐻K We can solve for • Optimal policy 𝜋º ∗ as 𝜋º ∗ = arg max F¹ 𝔼 U¹,S¹ ~¼› 𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º − 𝛼º 𝐻K • Optimal dual variable 𝛼º ∗ as 𝛼º ∗ = arg min ¤¹ÁK 𝔼 U¹,S¹ ~¼›∗ 𝛼º 𝐻 𝜋º ∗ − 𝛼º 𝐻K Thus, max F¹ 𝔼 𝑟 𝑠º, 𝑎º = 𝔼 U¹,S¹ ~¼›∗ 𝑟 𝑠º, 𝑎º + 𝛼º ∗ 𝐻 𝜋º ∗ − 𝛼º ∗ 𝐻K 61 Soft Actor-Critic
  • 62. Automating entropy adjustment Then, the soft Q-function can be computed as 𝑄ºO@ 𝑠ºO@, 𝑎ºO@ = 𝑟 𝑠ºO@, 𝑎ºO@ + 𝔼 𝑄º 𝑠º, 𝑎º − 𝛼º log 𝜋º 𝑎º 𝑠º = 𝑟 𝑠ºO@, 𝑎ºO@ + 𝔼 𝑟 𝑠º, 𝑎º + 𝛼º 𝐻 𝜋º 𝑄ºO@ ∗ 𝑠ºO@, 𝑎ºO@ = 𝑟 𝑠ºO@, 𝑎ºO@ + max F¹ 𝔼 𝑟 𝑠º, 𝑎º + 𝛼º ∗ 𝐻(𝜋º ∗ ) Now, given the time step 𝑇 − 1, we have: max F¹°Ÿ 𝔼 𝑟 𝑠ºO@, 𝑎ºO@ + max F¹ 𝔼 𝑟 𝑠º, 𝑎º = max F¹°Ÿ 𝑄ºO@ ∗ 𝑠ºO@, 𝑎ºO@ − 𝛼º ∗ 𝐻 𝜋º ∗ = min ¤¹°ŸÁK max F¹°Ÿ 𝑄ºO@ ∗ 𝑠ºO@, 𝑎ºO@ − 𝛼º ∗ 𝐻 𝜋º ∗ + 𝛼ºO@ 𝐻 𝜋ºO@ − 𝐻K = min ¤¹°ŸÁK max F¹°Ÿ 𝑄ºO@ ∗ 𝑠ºO@, 𝑎ºO@ − 𝛼ºO@ 𝐻 𝜋ºO@ − 𝛼ºO@ 𝐻K − 𝛼º ∗ 𝐻 𝜋º ∗ 62 Soft Actor-Critic
  • 63. Automating entropy adjustment We can solve for • Optimal policy 𝜋ºO@ ∗ as 𝜋ºO@ ∗ = arg max F¹°Ÿ 𝔼 U¹°Ÿ,S¹°Ÿ ~¼› 𝑄ºO@ ∗ 𝑠ºO@, 𝑎ºO@ + 𝛼ºO@ 𝐻 𝜋ºO@ − 𝛼ºO@ 𝐻K • Optimal dual variable 𝛼ºO@ ∗ as 𝛼ºO@ ∗ = arg min ¤¹°ŸÁK 𝔼 U¹°Ÿ,S¹°Ÿ ~¼›∗ 𝛼ºO@ 𝐻 𝜋ºO@ ∗ − 𝛼ºO@ 𝐻K In this way, we can proceed backwards in time and recursively optimize constrained optimization problem. Then, we can solve the optimal dual variable 𝛼" ∗ after solving for 𝑄" ∗ and 𝜋" ∗ : 𝛼" ∗ = arg min ¤Y 𝔼SY~FY ∗ −𝛼" log 𝜋" ∗ 𝑎" 𝑠" − 𝛼" 𝐻K 63 Soft Actor-Critic
  • 64. Temperature parameter alpha: automatically adjusts entropy for maximum entropy policy using alpha 𝛼 Idea: Constrained optimization problem + Dual problem • Minimizing the dual objective by approximating dual gradient descent: min ¤ 𝔼SY~FY −𝛼 log 𝜋" 𝑎" 𝑠" − 𝛼𝐻K 64 Soft Actor-Critic
  • 65. Putting everything together: Soft Actor-Critic (SAC) 65 Soft Actor-Critic
  • 66. 66 Results • Soft actor-critic (blue and yellow) performs consistently across all tasks and outperforming both on-policy and off-policy methods in the most challenging tasks.
  • 67. 67 References Papers • T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies”, ICML 2017 • T. Haarnoja, et, al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, ICML 2018 • T. Haarnoja, et, al., “Soft Actor-Critic Algorithms and Applications”, arXiv preprint 2018 Theoretical analysis • “Induction of Q-learning, Soft Q-learning, and Sparse Q-learning” written by Sungjoon Choi • “Reinforcement Learning with Deep Energy-Based Policies” reviewed by Kyungjae Lee • “Entropy Regularization in Reinforcement Learning (Stochastic Policy)” reviewed by Kyungjae Lee • Reframing Control as an Inference Problem, CS 294-112: Deep Reinforcement Learning • Constrained optimization problem and dual problem for SAC with automatically adjusted temperature (in Korean)