Reinforcement Learning in Configurable Environments

1 / 26 POLITECNICO DI MILANO 1863
POLITECNICO DI MILANO
SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING
Reinforcement Learning
in Configurable Environments:
an Information Theoretic approach
Supervisor: Prof. Marcello Restelli
Co-supervisor: Dott. Alberto Maria Metelli
Emanuele Ghelfi, 875550
20 Dec, 2018
Emanuele Ghelfi Reinforcement Learning in Configurable Environments

Contents
1 Reinforcement Learning
2 Conﬁgurable MDP
3 Contribution - Relative Entropy Model Policy Search
4 Finite–Sample Analysis
5 Experiments

Motivations - Conﬁgurable Environments
Conﬁgure environmental parameters in a principled way.

Reinforcement Learning
Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making
problems.
Environment
Agent
(policy)
state St reward Rt action At
Rt+1
St+1

Policy Search
Definition (Policy)
A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions.
Definition (Performance)
The performance of a policy is defined as:
Jπ
= J(θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (1)
Definition (MDP solution)
An MDP is solved when we find the best performing policy:
θ∗
∈ arg max
θ∈Θ
J(θ) . (2)

Gradient vs Trust Region
Gradient methods optimize performance acting
with gradient updates:
θt+1
= θt
+ λ θJ(θt
).
θl θt θ∗
2
4
6
8
10
Jπ
Trust region methods perform a constrained
optimization:
max
θ
J(θ) s.t. θ ∈ I(θt
).
θl θt θ∗
2
4
6
8
10
θ ∈ I(θt
)
Jπ

Trust Region methods
Relative Entropy Policy Search (REPS) (Peters et al., 2010).
Trust Region Policy Optimization (TRPO) (Schulman et al., 2015).
Proximal Policy Optimization (PPO) (Schulman et al., 2017).
Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b).

Configurable MDP
A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension.
Environment
(configuration)
Agent
(policy)
state St reward Rt action At
Rt+1
St+1

Configurable MDP
Definition (CMDP performance)
The performance of a model-policy pair is:
JP,π
= J(ω, θ) = E
at ∼π(·|st )
H−1
t=0
γt
R(st, at, st+1) . (3)
Definition (CMDP Solution)
The CMDP solution is the model-policy pair maximizing the performance:
ω∗
, θ∗
∈ arg max
ω∈Ω,θ∈Θ
J(ω, θ) . (4)

State of the Art
Safe Policy Model Iteration (Metelli et al., 2018a):
Safe Approach (Pirotta et al., 2013) for solving CMDPs:
JP ,π
− JP,π
Performance improvement
≥ B(P , π ) =
AP ,π
P,π,µ + AP,π
P,π,µ
1 − γ
Advantage term
−
γ∆qP,πD
2(1 − γ)2
Dissimilarity Penalization
. (5)
Limitations:
Finite state-actions space.
Full knowledge of the environment dynamics.
High sample complexity.

Relative Entropy Model Policy Search
We present REMPS, a novel algorithm for CMDPs:
Information Theoretic approach.
Optimization and Projection.
Approximated models.
Continous state and action spaces.
We optimize the Average Reward:
JP,π
= lim inf
H→+∞
E
at ∼π(·|st )
1
H
H−1
t=0
R(st, at, st+1) .

REMPS - Optimization
We deﬁne the following constrained optimization problem:
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
Dual
min
η
η log EdP,π e
+
R(s,a,s )
η
subject to:
η ≥ 0.
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution over state, action and next-state.

Primal
max
d
Ed [R(s, a, s )] Objective Function
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
: Trust Region.
d: stationary distribution.

Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤ Trust Region
Ed [1] = 1 .
: Trust Region.

Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 . d is well formed
: Trust Region.

REMPS - Optimization Solution
Theorem (REMPS solution)
The solution of the REMPS problem is:
d(s, a, s ) =
dP,π(s, a, s ) exp R(s,a,s )
η
S A S dP,π(s, a, s ) exp R(s,a,s )
η dsdads
(6)
where η is the minimizer of the dual problem.
Probability of (s, a, s ) weighted exponentially with the reward.

REMPS - Projection
DP,Π

REMPS - Projection
dP,π
DP,Π

REMPS - Projection
dP,π
DKL(d dP,π
) ≤
DP,Π

REMPS - Projection
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π

REMPS - Projection
d
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π

REMPS - Projection
d-Projection
Discrete state-action
spaces.
arg min
θ∈Θ,ω∈Ω
DKL d dω,θ .
State-Kernel Projection
Discrete action spaces.
arg min
θ∈Θ,ω∈Ω
DKL(Pπ Pπθ
ω ).
Independent Projection
Continuous state action
spaces.
arg min
θ∈Θ
DKL(π πθ).
arg min
ω∈Ω
DKL(P Pω).

Algorithm
Algorithm 1 Relative Entropy Model Policy Search
1: for t = 0,1,... until convergence do
2: Collect samples from πθt , Pωt
3: Obtain η∗, the minimizer of the dual problem. Optimization
4: Project d according to the projection strategy. Projection
a. d-Projection;
b. State-Kernel Projection;
c. Independent Projection.
5: Update Policy.
6: Update Model.
7: end for
8: return Policy-Model Pair (πθt , Pωt )

Finite–Sample Analysis
How much can diﬀer the ideal performance from the approximated one?
d solution of Optimization with ∞ samples.
d solution of Optimization with N samples.
d solution of Optimization and Projection with N samples.
Jd − Jd
=Jd − Jd
OPT
+ Jd
− Jd
PROJ
. (7)

Finite–Sample Analysis
How much can diﬀer the ideal performance from the approximated one?
Jd − Jd
≤ rmaxψ(N)
8v log 2eN
v + 8 log 8
δ
N
(8)
+ rmaxφ
4 8v log 2eN
v + 8 log 8
δ
N
(9)
+
√
2rmax sup
d∈DdP,π
inf
d∈DP,Π
DKL(d d). (10)

Experimental Results - Chain
Motivations:
Visualize the behaviour of REMPS;
Overcoming local minima;
Conﬁgure transition function.
S0 S1
a0, θ, 1 − ω, 0
a0, θ, ω, s
a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0
a0, θ, ω, s
a0, θ, 1 − ω, L
a1, 1 − θ, ωk, l
a1, 1 − θ, 1 − ωk, s
00.20.40.60.81
0
0.5
1
2
4
6
8
10
θ
ω
J
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9

5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP

5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4

5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2

5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
REMPS = 1e−1

5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
REMPS = 1e−1
REMPS = 10

Experimental Results - Cartpole
Standard RL benchmark;
Conﬁgure cart acceleration;
Approximated model.
Minimize acceleration balancing the pole:
R(s, a, s ) = 10 − ω2
20 − 20 · (1 − cos(γ)). γ
˙γ
˙x
x

Experimental Results - Cartpole
Ideal Model Approximated Model
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
REMPS
G(PO)MDP

Experimental Results - TORCS
TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010).
Autonomous driving and Conﬁguration.
Continuous Control.
Approximated model.
We conﬁgure the rear wing and the brake system.

Experimental Results - TORCS
0 2 4 6 8 10 12
·102
−60
−40
−20
0
Iteration
AverageReward
2 4 6 8 10 12
·102
0.2
0.4
0.6
0.8
1
Iteration
RearWing
2 4 6 8 10 12
·102
0.2
0.4
0.6
0.8
1
Iteration
Front-RearBrakeRepartition
REMPS REPS

Conclusions
Contributions:
REMPS able to solve the model-policy learning problem.
Finite-sample analysis.
Experimental evaluation.
Future research directions:
Adaptive KL-constraint.
Other divergences.
Finite-Time Analysis.
Plan to submit at ICML 2019.

Conclusions
Thank you for your attention
Questions?

Bibliography I
[Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis
Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org.
2013
[Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to
overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary
Computation, July 2010, pp. 1–8. – pp. 1–8
[Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Conﬁgurable
Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of
the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan,
Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497
[Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli,
Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat]
(2018), September. (2018), September

Bibliography II
[Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy
Policy Search. 2010
[Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello,
Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.):
Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta,
Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315
[Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan,
Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs]
(2015), February. (2015), February
[Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ;
Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017),
July. (2017), July
[Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to
Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA,
USA : MIT Press, 1998

d-projection
Projection of the stationary state distribution:
θ, ω = arg min
θ∈Θ,ω∈Ω
DKL d(s, a, s ) dω,θ
(s, a, s )
s.t. dω,θ
(s) =
S A
dω,θ
(s )πθ(a|s )Pω(s |s, a)dads .

Projection
Theorem (Joint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π and
dP ,π the stationary distribution induced by the model P and policy π . Let us assume that
the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax.
The norm of the diﬀerence of performance can be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax 2DKL(dP,π dP ,π ). (15)

Projection
Theorem (Disjoint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π
the stationary distribution induced by the model P and policy π . Let us assume that the
reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If
(P , π ) admits group invertible state kernel P π the norm of the diﬀerence of performance can
be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16)
where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π .

G(PO)MDP - Model extension
JP,π
= pθ,ω(τ)G(τ)dτ
ωJP,π
= pθ,ω(τ) ω log p(τ)G(τ)dτ (17)
= pθ,ω(τ)
H
k=0
log Pω(sk+1|sk, ak) G(τ) (18)
ωJPπ
G(PO)MDP =
H
l=0
H
k=l
ω log Pω(sk+1|sk, ak) γl
R(sl , al , sl+1) N (19)

Reinforcement Learning in Configurable Environments

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Reinforcement Learning in Configurable Environments

Semelhante a Reinforcement Learning in Configurable Environments (20)

Último

Último (20)

Reinforcement Learning in Configurable Environments