Reinforcement Learning in Configurable Environments
1. 1 / 26 POLITECNICO DI MILANO 1863
POLITECNICO DI MILANO
SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING
Reinforcement Learning
in Configurable Environments:
an Information Theoretic approach
Supervisor: Prof. Marcello Restelli
Co-supervisor: Dott. Alberto Maria Metelli
Emanuele Ghelfi, 875550
20 Dec, 2018
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
2. 2 / 26 POLITECNICO DI MILANO 1863
Contents
1 Reinforcement Learning
2 Configurable MDP
3 Contribution - Relative Entropy Model Policy Search
4 Finite–Sample Analysis
5 Experiments
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
3. 3 / 26 POLITECNICO DI MILANO 1863
Motivations - Configurable Environments
Configure environmental parameters in a principled way.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
4. 4 / 26 POLITECNICO DI MILANO 1863
Reinforcement Learning
Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making
problems.
Environment
Agent
(policy)
state St reward Rt action At
Rt+1
St+1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
5. 5 / 26 POLITECNICO DI MILANO 1863
Policy Search
Definition (Policy)
A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions.
Definition (Performance)
The performance of a policy is defined as:
Jπ
= J(θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (1)
Definition (MDP solution)
An MDP is solved when we find the best performing policy:
θ∗
∈ arg max
θ∈Θ
J(θ) . (2)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
6. 6 / 26 POLITECNICO DI MILANO 1863
Gradient vs Trust Region
Gradient methods optimize performance acting
with gradient updates:
θt+1
= θt
+ λ θJ(θt
).
θl θt θ∗
2
4
6
8
10
Jπ
Trust region methods perform a constrained
optimization:
max
θ
J(θ) s.t. θ ∈ I(θt
).
θl θt θ∗
2
4
6
8
10
θ ∈ I(θt
)
Jπ
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
7. 7 / 26 POLITECNICO DI MILANO 1863
Trust Region methods
Relative Entropy Policy Search (REPS) (Peters et al., 2010).
Trust Region Policy Optimization (TRPO) (Schulman et al., 2015).
Proximal Policy Optimization (PPO) (Schulman et al., 2017).
Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b).
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
8. 8 / 26 POLITECNICO DI MILANO 1863
Configurable MDP
A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension.
Environment
(configuration)
Agent
(policy)
state St reward Rt action At
Rt+1
St+1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
9. 9 / 26 POLITECNICO DI MILANO 1863
Configurable MDP
Definition (CMDP performance)
The performance of a model-policy pair is:
JP,π
= J(ω, θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (3)
Definition (CMDP Solution)
The CMDP solution is the model-policy pair maximizing the performance:
ω∗
, θ∗
∈ arg max
ω∈Ω,θ∈Θ
J(ω, θ) . (4)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
10. 10 / 26 POLITECNICO DI MILANO 1863
State of the Art
Safe Policy Model Iteration (Metelli et al., 2018a):
Safe Approach (Pirotta et al., 2013) for solving CMDPs:
JP ,π
− JP,π
Performance improvement
≥ B(P , π ) =
AP ,π
P,π,µ + AP,π
P,π,µ
1 − γ
Advantage term
−
γ∆qP,πD
2(1 − γ)2
Dissimilarity Penalization
. (5)
Limitations:
Finite state-actions space.
Full knowledge of the environment dynamics.
High sample complexity.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
11. 11 / 26 POLITECNICO DI MILANO 1863
Relative Entropy Model Policy Search
We present REMPS, a novel algorithm for CMDPs:
Information Theoretic approach.
Optimization and Projection.
Approximated models.
Continous state and action spaces.
We optimize the Average Reward:
JP,π
= lim inf
H→+∞
E
at ∼π(·|st )
st+1∼P(·|st ,at )
1
H
H−1
t=0
R(st, at, st+1) .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
12. 12 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
We define the following constrained optimization problem:
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
Dual
min
η
η log EdP,π e
+
R(s,a,s )
η
subject to:
η ≥ 0.
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution over state, action and next-state.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13. 13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )] Objective Function
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
14. 13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤ Trust Region
Ed [1] = 1 .
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15. 13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 . d is well formed
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
16. 14 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization Solution
Theorem (REMPS solution)
The solution of the REMPS problem is:
d(s, a, s ) =
dP,π(s, a, s ) exp R(s,a,s )
η
S A S dP,π(s, a, s ) exp R(s,a,s )
η dsdads
(6)
where η is the minimizer of the dual problem.
Probability of (s, a, s ) weighted exponentially with the reward.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
17. 15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
18. 15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
dP,π
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
19. 15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
dP,π
DKL(d dP,π
) ≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
20. 15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21. 15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
22. 16 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d-Projection
Discrete state-action
spaces.
arg min
θ∈Θ,ω∈Ω
DKL d dω,θ .
State-Kernel Projection
Discrete action spaces.
arg min
θ∈Θ,ω∈Ω
DKL(Pπ Pπθ
ω ).
Independent Projection
Continuous state action
spaces.
arg min
θ∈Θ
DKL(π πθ).
arg min
ω∈Ω
DKL(P Pω).
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
23. 17 / 26 POLITECNICO DI MILANO 1863
Algorithm
Algorithm 1 Relative Entropy Model Policy Search
1: for t = 0,1,... until convergence do
2: Collect samples from πθt , Pωt
3: Obtain η∗, the minimizer of the dual problem. Optimization
4: Project d according to the projection strategy. Projection
a. d-Projection;
b. State-Kernel Projection;
c. Independent Projection.
5: Update Policy.
6: Update Model.
7: end for
8: return Policy-Model Pair (πθt , Pωt )
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
24. 18 / 26 POLITECNICO DI MILANO 1863
Finite–Sample Analysis
How much can differ the ideal performance from the approximated one?
d solution of Optimization with ∞ samples.
d solution of Optimization with N samples.
d solution of Optimization and Projection with N samples.
Jd − Jd
=Jd − Jd
OPT
+ Jd
− Jd
PROJ
. (7)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
25. 19 / 26 POLITECNICO DI MILANO 1863
Finite–Sample Analysis
How much can differ the ideal performance from the approximated one?
Jd − Jd
≤ rmaxψ(N)
8v log 2eN
v + 8 log 8
δ
N
(8)
+ rmaxφ
4 8v log 2eN
v + 8 log 8
δ
N
(9)
+
√
2rmax sup
d∈DdP,π
inf
d∈DP,Π
DKL(d d). (10)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
26. 20 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
Motivations:
Visualize the behaviour of REMPS;
Overcoming local minima;
Configure transition function.
S0 S1
a0, θ, 1 − ω, 0
a0, θ, ω, s
a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0
a0, θ, ω, s
a0, θ, 1 − ω, L
a1, 1 − θ, ωk, l
a1, 1 − θ, 1 − ωk, s
00.20.40.60.81
0
0.5
1
2
4
6
8
10
θ
ω
J
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
32. 22 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Cartpole
Standard RL benchmark;
Configure cart acceleration;
Approximated model.
Minimize acceleration balancing the pole:
R(s, a, s ) = 10 − ω2
20 − 20 · (1 − cos(γ)). γ
˙γ
˙x
x
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
33. 23 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Cartpole
Ideal Model Approximated Model
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
REMPS
G(PO)MDP
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
34. 24 / 26 POLITECNICO DI MILANO 1863
Experimental Results - TORCS
TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010).
Autonomous driving and Configuration.
Continuous Control.
Approximated model.
We configure the rear wing and the brake system.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
36. 26 / 26 POLITECNICO DI MILANO 1863
Conclusions
Contributions:
REMPS able to solve the model-policy learning problem.
Finite-sample analysis.
Experimental evaluation.
Future research directions:
Adaptive KL-constraint.
Other divergences.
Finite-Time Analysis.
Plan to submit at ICML 2019.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
37. 27 / 26 POLITECNICO DI MILANO 1863
Conclusions
Thank you for your attention
Questions?
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
38. 28 / 26 POLITECNICO DI MILANO 1863
Bibliography I
[Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis
Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org.
2013
[Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to
overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary
Computation, July 2010, pp. 1–8. – pp. 1–8
[Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Configurable
Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of
the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan,
Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497
[Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli,
Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat]
(2018), September. (2018), September
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
39. 29 / 26 POLITECNICO DI MILANO 1863
Bibliography II
[Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy
Policy Search. 2010
[Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello,
Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.):
Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta,
Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315
[Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan,
Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs]
(2015), February. (2015), February
[Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ;
Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017),
July. (2017), July
[Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to
Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA,
USA : MIT Press, 1998
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
40. 30 / 26 POLITECNICO DI MILANO 1863
d-projection
Projection of the stationary state distribution:
θ, ω = arg min
θ∈Θ,ω∈Ω
DKL d(s, a, s ) dω,θ
(s, a, s )
s.t. dω,θ
(s) =
S A
dω,θ
(s )πθ(a|s )Pω(s |s, a)dads .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
41. 31 / 26 POLITECNICO DI MILANO 1863
Projection of the State Kernel
Projection of the state kernel:
θ, ω = arg min
θ∈Θ,ω∈Ω S
d(s)DKL(Pπ
(·|s) Pπθ
ω (·|s))ds
= arg max
θ∈Θ,ω∈Ω S
d(s)
S
Pπ
(s |s) log Pπθ
ω (s |s)ds ds
= arg max
θ∈Θ,ω∈Ω S
d(s)
S A
π (a|s)P (s |s, a) log Pπθ
ω (s |s)ds ds
= arg max
θ∈Θ,ω∈Ω S A S
d(s, a, s ) log
A
Pω(s |s, a )πθ(a |s)da dsdads
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
42. 32 / 26 POLITECNICO DI MILANO 1863
Independent Projection
Independent Projection of policy and model:
θ = arg min
θ∈Θ S
d(s)DKL(π (·|s) πθ(·|s))ds = (11)
= arg min
θ∈Θ S
d(s)
A
π (a|s) log
π(a|s)
πθ(a|s)
dads = (12)
= arg min
θ∈Θ S A S
d(s, a, s ) log
π (a|s)
πθ(a|s)
dsdads = (13)
= arg max
θ∈Θ S A S
d(s, a, s ) log πθ(a|s)dsdads , (14)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
43. 33 / 26 POLITECNICO DI MILANO 1863
Projection
Theorem (Joint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π and
dP ,π the stationary distribution induced by the model P and policy π . Let us assume that
the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax.
The norm of the difference of performance can be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax 2DKL(dP,π dP ,π ). (15)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
44. 34 / 26 POLITECNICO DI MILANO 1863
Projection
Theorem (Disjoint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π
the stationary distribution induced by the model P and policy π . Let us assume that the
reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If
(P , π ) admits group invertible state kernel P π the norm of the difference of performance can
be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16)
where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
45. 35 / 26 POLITECNICO DI MILANO 1863
G(PO)MDP - Model extension
JP,π
= pθ,ω(τ)G(τ)dτ
ωJP,π
= pθ,ω(τ) ω log p(τ)G(τ)dτ (17)
= pθ,ω(τ)
H
k=0
log Pω(sk+1|sk, ak) G(τ) (18)
ωJPπ
G(PO)MDP =
H
l=0
H
k=l
ω log Pω(sk+1|sk, ak) γl
R(sl , al , sl+1) N (19)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments