SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
1 / 26 POLITECNICO DI MILANO 1863
POLITECNICO DI MILANO
SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING
Reinforcement Learning
in Configurable Environments:
an Information Theoretic approach
Supervisor: Prof. Marcello Restelli
Co-supervisor: Dott. Alberto Maria Metelli
Emanuele Ghelfi, 875550
20 Dec, 2018
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
2 / 26 POLITECNICO DI MILANO 1863
Contents
1 Reinforcement Learning
2 Configurable MDP
3 Contribution - Relative Entropy Model Policy Search
4 Finite–Sample Analysis
5 Experiments
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
3 / 26 POLITECNICO DI MILANO 1863
Motivations - Configurable Environments
Configure environmental parameters in a principled way.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
4 / 26 POLITECNICO DI MILANO 1863
Reinforcement Learning
Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making
problems.
Environment
Agent
(policy)
state St reward Rt action At
Rt+1
St+1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
5 / 26 POLITECNICO DI MILANO 1863
Policy Search
Definition (Policy)
A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions.
Definition (Performance)
The performance of a policy is defined as:
Jπ
= J(θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (1)
Definition (MDP solution)
An MDP is solved when we find the best performing policy:
θ∗
∈ arg max
θ∈Θ
J(θ) . (2)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
6 / 26 POLITECNICO DI MILANO 1863
Gradient vs Trust Region
Gradient methods optimize performance acting
with gradient updates:
θt+1
= θt
+ λ θJ(θt
).
θl θt θ∗
2
4
6
8
10
Jπ
Trust region methods perform a constrained
optimization:
max
θ
J(θ) s.t. θ ∈ I(θt
).
θl θt θ∗
2
4
6
8
10
θ ∈ I(θt
)
Jπ
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
7 / 26 POLITECNICO DI MILANO 1863
Trust Region methods
Relative Entropy Policy Search (REPS) (Peters et al., 2010).
Trust Region Policy Optimization (TRPO) (Schulman et al., 2015).
Proximal Policy Optimization (PPO) (Schulman et al., 2017).
Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b).
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
8 / 26 POLITECNICO DI MILANO 1863
Configurable MDP
A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension.
Environment
(configuration)
Agent
(policy)
state St reward Rt action At
Rt+1
St+1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
9 / 26 POLITECNICO DI MILANO 1863
Configurable MDP
Definition (CMDP performance)
The performance of a model-policy pair is:
JP,π
= J(ω, θ) = E
at ∼π(·|st )
st+1∼P(·|st ,at )
H−1
t=0
γt
R(st, at, st+1) . (3)
Definition (CMDP Solution)
The CMDP solution is the model-policy pair maximizing the performance:
ω∗
, θ∗
∈ arg max
ω∈Ω,θ∈Θ
J(ω, θ) . (4)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
10 / 26 POLITECNICO DI MILANO 1863
State of the Art
Safe Policy Model Iteration (Metelli et al., 2018a):
Safe Approach (Pirotta et al., 2013) for solving CMDPs:
JP ,π
− JP,π
Performance improvement
≥ B(P , π ) =
AP ,π
P,π,µ + AP,π
P,π,µ
1 − γ
Advantage term
−
γ∆qP,πD
2(1 − γ)2
Dissimilarity Penalization
. (5)
Limitations:
Finite state-actions space.
Full knowledge of the environment dynamics.
High sample complexity.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
11 / 26 POLITECNICO DI MILANO 1863
Relative Entropy Model Policy Search
We present REMPS, a novel algorithm for CMDPs:
Information Theoretic approach.
Optimization and Projection.
Approximated models.
Continous state and action spaces.
We optimize the Average Reward:
JP,π
= lim inf
H→+∞
E
at ∼π(·|st )
st+1∼P(·|st ,at )
1
H
H−1
t=0
R(st, at, st+1) .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
12 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
We define the following constrained optimization problem:
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
Dual
min
η
η log EdP,π e
+
R(s,a,s )
η
subject to:
η ≥ 0.
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution over state, action and next-state.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )] Objective Function
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 .
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤ Trust Region
Ed [1] = 1 .
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
13 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization
Primal
max
d
Ed [R(s, a, s )]
subject to:
DKL(d||dP,π
) ≤
Ed [1] = 1 . d is well formed
dP,π: sampling distribution.
: Trust Region.
d: stationary distribution.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
14 / 26 POLITECNICO DI MILANO 1863
REMPS - Optimization Solution
Theorem (REMPS solution)
The solution of the REMPS problem is:
d(s, a, s ) =
dP,π(s, a, s ) exp R(s,a,s )
η
S A S dP,π(s, a, s ) exp R(s,a,s )
η dsdads
(6)
where η is the minimizer of the dual problem.
Probability of (s, a, s ) weighted exponentially with the reward.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
dP,π
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
dP,π
DKL(d dP,π
) ≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
15 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d
d
dP,π
DKL(d dP,π
) ≤
≤
DP,Π
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
16 / 26 POLITECNICO DI MILANO 1863
REMPS - Projection
d-Projection
Discrete state-action
spaces.
arg min
θ∈Θ,ω∈Ω
DKL d dω,θ .
State-Kernel Projection
Discrete action spaces.
arg min
θ∈Θ,ω∈Ω
DKL(Pπ Pπθ
ω ).
Independent Projection
Continuous state action
spaces.
arg min
θ∈Θ
DKL(π πθ).
arg min
ω∈Ω
DKL(P Pω).
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
17 / 26 POLITECNICO DI MILANO 1863
Algorithm
Algorithm 1 Relative Entropy Model Policy Search
1: for t = 0,1,... until convergence do
2: Collect samples from πθt , Pωt
3: Obtain η∗, the minimizer of the dual problem. Optimization
4: Project d according to the projection strategy. Projection
a. d-Projection;
b. State-Kernel Projection;
c. Independent Projection.
5: Update Policy.
6: Update Model.
7: end for
8: return Policy-Model Pair (πθt , Pωt )
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
18 / 26 POLITECNICO DI MILANO 1863
Finite–Sample Analysis
How much can differ the ideal performance from the approximated one?
d solution of Optimization with ∞ samples.
d solution of Optimization with N samples.
d solution of Optimization and Projection with N samples.
Jd − Jd
=Jd − Jd
OPT
+ Jd
− Jd
PROJ
. (7)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
19 / 26 POLITECNICO DI MILANO 1863
Finite–Sample Analysis
How much can differ the ideal performance from the approximated one?
Jd − Jd
≤ rmaxψ(N)
8v log 2eN
v + 8 log 8
δ
N
(8)
+ rmaxφ
4 8v log 2eN
v + 8 log 8
δ
N
(9)
+
√
2rmax sup
d∈DdP,π
inf
d∈DP,Π
DKL(d d). (10)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
20 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
Motivations:
Visualize the behaviour of REMPS;
Overcoming local minima;
Configure transition function.
S0 S1
a0, θ, 1 − ω, 0
a0, θ, ω, s
a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0
a0, θ, ω, s
a0, θ, 1 − ω, L
a1, 1 − θ, ωk, l
a1, 1 − θ, 1 − ωk, s
00.20.40.60.81
0
0.5
1
2
4
6
8
10
θ
ω
J
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
2
3
4
5
6
7
8
9
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
REMPS = 1e−1
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
21 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Chain
5 10 15 20
2
4
6
8
10
12
J(π∗
, P∗
)
Iteration
AverageReward
0 5 10 15 20
−0.2
0
0.2
0.4
0.6
0.8
1
ω∗
Iteration
ω
G(PO)MDP
REMPS = 1e−4
REMPS = 1e−2
REMPS = 1e−1
REMPS = 10
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
22 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Cartpole
Standard RL benchmark;
Configure cart acceleration;
Approximated model.
Minimize acceleration balancing the pole:
R(s, a, s ) = 10 − ω2
20 − 20 · (1 − cos(γ)). γ
˙γ
˙x
x
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
23 / 26 POLITECNICO DI MILANO 1863
Experimental Results - Cartpole
Ideal Model Approximated Model
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
500 1,000 1,500 2,000
500
1,000
1,500
2,000
Iteration
Return
REMPS
G(PO)MDP
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
24 / 26 POLITECNICO DI MILANO 1863
Experimental Results - TORCS
TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010).
Autonomous driving and Configuration.
Continuous Control.
Approximated model.
We configure the rear wing and the brake system.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
25 / 26 POLITECNICO DI MILANO 1863
Experimental Results - TORCS
0 2 4 6 8 10 12
·102
−60
−40
−20
0
Iteration
AverageReward
2 4 6 8 10 12
·102
0.2
0.4
0.6
0.8
1
Iteration
RearWing
2 4 6 8 10 12
·102
0.2
0.4
0.6
0.8
1
Iteration
Front-RearBrakeRepartition
REMPS REPS
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
26 / 26 POLITECNICO DI MILANO 1863
Conclusions
Contributions:
REMPS able to solve the model-policy learning problem.
Finite-sample analysis.
Experimental evaluation.
Future research directions:
Adaptive KL-constraint.
Other divergences.
Finite-Time Analysis.
Plan to submit at ICML 2019.
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
27 / 26 POLITECNICO DI MILANO 1863
Conclusions
Thank you for your attention
Questions?
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
28 / 26 POLITECNICO DI MILANO 1863
Bibliography I
[Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis
Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org.
2013
[Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to
overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary
Computation, July 2010, pp. 1–8. – pp. 1–8
[Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Configurable
Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of
the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan,
Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497
[Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli,
Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat]
(2018), September. (2018), September
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
29 / 26 POLITECNICO DI MILANO 1863
Bibliography II
[Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy
Policy Search. 2010
[Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello,
Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.):
Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta,
Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315
[Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan,
Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs]
(2015), February. (2015), February
[Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ;
Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017),
July. (2017), July
[Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to
Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA,
USA : MIT Press, 1998
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
30 / 26 POLITECNICO DI MILANO 1863
d-projection
Projection of the stationary state distribution:
θ, ω = arg min
θ∈Θ,ω∈Ω
DKL d(s, a, s ) dω,θ
(s, a, s )
s.t. dω,θ
(s) =
S A
dω,θ
(s )πθ(a|s )Pω(s |s, a)dads .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
31 / 26 POLITECNICO DI MILANO 1863
Projection of the State Kernel
Projection of the state kernel:
θ, ω = arg min
θ∈Θ,ω∈Ω S
d(s)DKL(Pπ
(·|s) Pπθ
ω (·|s))ds
= arg max
θ∈Θ,ω∈Ω S
d(s)
S
Pπ
(s |s) log Pπθ
ω (s |s)ds ds
= arg max
θ∈Θ,ω∈Ω S
d(s)
S A
π (a|s)P (s |s, a) log Pπθ
ω (s |s)ds ds
= arg max
θ∈Θ,ω∈Ω S A S
d(s, a, s ) log
A
Pω(s |s, a )πθ(a |s)da dsdads
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
32 / 26 POLITECNICO DI MILANO 1863
Independent Projection
Independent Projection of policy and model:
θ = arg min
θ∈Θ S
d(s)DKL(π (·|s) πθ(·|s))ds = (11)
= arg min
θ∈Θ S
d(s)
A
π (a|s) log
π(a|s)
πθ(a|s)
dads = (12)
= arg min
θ∈Θ S A S
d(s, a, s ) log
π (a|s)
πθ(a|s)
dsdads = (13)
= arg max
θ∈Θ S A S
d(s, a, s ) log πθ(a|s)dsdads , (14)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
33 / 26 POLITECNICO DI MILANO 1863
Projection
Theorem (Joint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π and
dP ,π the stationary distribution induced by the model P and policy π . Let us assume that
the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax.
The norm of the difference of performance can be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax 2DKL(dP,π dP ,π ). (15)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
34 / 26 POLITECNICO DI MILANO 1863
Projection
Theorem (Disjoint bound)
Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π
the stationary distribution induced by the model P and policy π . Let us assume that the
reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If
(P , π ) admits group invertible state kernel P π the norm of the difference of performance can
be upper bounded as:
|JP,π
− JP ,π
| ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16)
where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π .
Emanuele Ghelfi Reinforcement Learning in Configurable Environments
35 / 26 POLITECNICO DI MILANO 1863
G(PO)MDP - Model extension
JP,π
= pθ,ω(τ)G(τ)dτ
ωJP,π
= pθ,ω(τ) ω log p(τ)G(τ)dτ (17)
= pθ,ω(τ)
H
k=0
log Pω(sk+1|sk, ak) G(τ) (18)
ωJPπ
G(PO)MDP =
H
l=0
H
k=l
ω log Pω(sk+1|sk, ak) γl
R(sl , al , sl+1) N (19)
Emanuele Ghelfi Reinforcement Learning in Configurable Environments

Mais conteúdo relacionado

Semelhante a Reinforcement Learning in Configurable Environments

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)Arnaud de Myttenaere
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
A new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning AutomataA new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning Automatainfopapers
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsMatt Moores
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
100 things I know
100 things I know100 things I know
100 things I knowr-uribe
 
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfDongseo University
 
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...SYRTO Project
 
Programming the Interaction Space Effectively with ReSpecTX
Programming the Interaction Space Effectively with ReSpecTXProgramming the Interaction Space Effectively with ReSpecTX
Programming the Interaction Space Effectively with ReSpecTXStefano Mariani
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Koh Takeuchi
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...Yuko Kuroki (黒木祐子)
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
 
Evolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environmentEvolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environmentFlorence (Flo) Debarre
 

Semelhante a Reinforcement Learning in Configurable Environments (20)

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Conic Clustering
Conic ClusteringConic Clustering
Conic Clustering
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
A new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning AutomataA new Reinforcement Scheme for Stochastic Learning Automata
A new Reinforcement Scheme for Stochastic Learning Automata
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
100 things I know
100 things I know100 things I know
100 things I know
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
 
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
 
Programming the Interaction Space Effectively with ReSpecTX
Programming the Interaction Space Effectively with ReSpecTXProgramming the Interaction Space Effectively with ReSpecTX
Programming the Interaction Space Effectively with ReSpecTX
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...A walk through the intersection between machine learning and mechanistic mode...
A walk through the intersection between machine learning and mechanistic mode...
 
Evolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environmentEvolution of specialist vs. generalist strategies in a continuous environment
Evolution of specialist vs. generalist strategies in a continuous environment
 
Finite frequency H∞ control for wind turbine systems in T-S form
Finite frequency H∞ control for wind turbine systems in T-S formFinite frequency H∞ control for wind turbine systems in T-S form
Finite frequency H∞ control for wind turbine systems in T-S form
 
Ece4510 notes07
Ece4510 notes07Ece4510 notes07
Ece4510 notes07
 

Último

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 

Último (20)

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 

Reinforcement Learning in Configurable Environments

  • 1. 1 / 26 POLITECNICO DI MILANO 1863 POLITECNICO DI MILANO SCHOOL OF INDUSTRIAL AND INFORMATION ENGINEERING Reinforcement Learning in Configurable Environments: an Information Theoretic approach Supervisor: Prof. Marcello Restelli Co-supervisor: Dott. Alberto Maria Metelli Emanuele Ghelfi, 875550 20 Dec, 2018 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 2. 2 / 26 POLITECNICO DI MILANO 1863 Contents 1 Reinforcement Learning 2 Configurable MDP 3 Contribution - Relative Entropy Model Policy Search 4 Finite–Sample Analysis 5 Experiments Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 3. 3 / 26 POLITECNICO DI MILANO 1863 Motivations - Configurable Environments Configure environmental parameters in a principled way. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 4. 4 / 26 POLITECNICO DI MILANO 1863 Reinforcement Learning Reinforcement Learning (Sutton and Barto, 1998) considers sequential decision making problems. Environment Agent (policy) state St reward Rt action At Rt+1 St+1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 5. 5 / 26 POLITECNICO DI MILANO 1863 Policy Search Definition (Policy) A policy is a function πθ : S → ∆(A) that maps states to probability distributions over actions. Definition (Performance) The performance of a policy is defined as: Jπ = J(θ) = E at ∼π(·|st ) st+1∼P(·|st ,at ) H−1 t=0 γt R(st, at, st+1) . (1) Definition (MDP solution) An MDP is solved when we find the best performing policy: θ∗ ∈ arg max θ∈Θ J(θ) . (2) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 6. 6 / 26 POLITECNICO DI MILANO 1863 Gradient vs Trust Region Gradient methods optimize performance acting with gradient updates: θt+1 = θt + λ θJ(θt ). θl θt θ∗ 2 4 6 8 10 Jπ Trust region methods perform a constrained optimization: max θ J(θ) s.t. θ ∈ I(θt ). θl θt θ∗ 2 4 6 8 10 θ ∈ I(θt ) Jπ Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 7. 7 / 26 POLITECNICO DI MILANO 1863 Trust Region methods Relative Entropy Policy Search (REPS) (Peters et al., 2010). Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). Proximal Policy Optimization (PPO) (Schulman et al., 2017). Policy Optimization via Importance Sampling (POIS) (Metelli et al., 2018b). Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 8. 8 / 26 POLITECNICO DI MILANO 1863 Configurable MDP A Configurable Markov Decision Process (Metelli et al., 2018a) (CMDP) is an MDP extension. Environment (configuration) Agent (policy) state St reward Rt action At Rt+1 St+1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 9. 9 / 26 POLITECNICO DI MILANO 1863 Configurable MDP Definition (CMDP performance) The performance of a model-policy pair is: JP,π = J(ω, θ) = E at ∼π(·|st ) st+1∼P(·|st ,at ) H−1 t=0 γt R(st, at, st+1) . (3) Definition (CMDP Solution) The CMDP solution is the model-policy pair maximizing the performance: ω∗ , θ∗ ∈ arg max ω∈Ω,θ∈Θ J(ω, θ) . (4) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 10. 10 / 26 POLITECNICO DI MILANO 1863 State of the Art Safe Policy Model Iteration (Metelli et al., 2018a): Safe Approach (Pirotta et al., 2013) for solving CMDPs: JP ,π − JP,π Performance improvement ≥ B(P , π ) = AP ,π P,π,µ + AP,π P,π,µ 1 − γ Advantage term − γ∆qP,πD 2(1 − γ)2 Dissimilarity Penalization . (5) Limitations: Finite state-actions space. Full knowledge of the environment dynamics. High sample complexity. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 11. 11 / 26 POLITECNICO DI MILANO 1863 Relative Entropy Model Policy Search We present REMPS, a novel algorithm for CMDPs: Information Theoretic approach. Optimization and Projection. Approximated models. Continous state and action spaces. We optimize the Average Reward: JP,π = lim inf H→+∞ E at ∼π(·|st ) st+1∼P(·|st ,at ) 1 H H−1 t=0 R(st, at, st+1) . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 12. 12 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization We define the following constrained optimization problem: Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . Dual min η η log EdP,π e + R(s,a,s ) η subject to: η ≥ 0. dP,π: sampling distribution. : Trust Region. d: stationary distribution over state, action and next-state. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 13. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] Objective Function subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 14. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Trust Region Ed [1] = 1 . dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 15. 13 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Primal max d Ed [R(s, a, s )] subject to: DKL(d||dP,π ) ≤ Ed [1] = 1 . d is well formed dP,π: sampling distribution. : Trust Region. d: stationary distribution. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 16. 14 / 26 POLITECNICO DI MILANO 1863 REMPS - Optimization Solution Theorem (REMPS solution) The solution of the REMPS problem is: d(s, a, s ) = dP,π(s, a, s ) exp R(s,a,s ) η S A S dP,π(s, a, s ) exp R(s,a,s ) η dsdads (6) where η is the minimizer of the dual problem. Probability of (s, a, s ) weighted exponentially with the reward. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 17. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 18. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection dP,π DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 19. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection dP,π DKL(d dP,π ) ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 20. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d dP,π DKL(d dP,π ) ≤ ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 21. 15 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d d dP,π DKL(d dP,π ) ≤ ≤ DP,Π Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 22. 16 / 26 POLITECNICO DI MILANO 1863 REMPS - Projection d-Projection Discrete state-action spaces. arg min θ∈Θ,ω∈Ω DKL d dω,θ . State-Kernel Projection Discrete action spaces. arg min θ∈Θ,ω∈Ω DKL(Pπ Pπθ ω ). Independent Projection Continuous state action spaces. arg min θ∈Θ DKL(π πθ). arg min ω∈Ω DKL(P Pω). Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 23. 17 / 26 POLITECNICO DI MILANO 1863 Algorithm Algorithm 1 Relative Entropy Model Policy Search 1: for t = 0,1,... until convergence do 2: Collect samples from πθt , Pωt 3: Obtain η∗, the minimizer of the dual problem. Optimization 4: Project d according to the projection strategy. Projection a. d-Projection; b. State-Kernel Projection; c. Independent Projection. 5: Update Policy. 6: Update Model. 7: end for 8: return Policy-Model Pair (πθt , Pωt ) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 24. 18 / 26 POLITECNICO DI MILANO 1863 Finite–Sample Analysis How much can differ the ideal performance from the approximated one? d solution of Optimization with ∞ samples. d solution of Optimization with N samples. d solution of Optimization and Projection with N samples. Jd − Jd =Jd − Jd OPT + Jd − Jd PROJ . (7) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 25. 19 / 26 POLITECNICO DI MILANO 1863 Finite–Sample Analysis How much can differ the ideal performance from the approximated one? Jd − Jd ≤ rmaxψ(N) 8v log 2eN v + 8 log 8 δ N (8) + rmaxφ 4 8v log 2eN v + 8 log 8 δ N (9) + √ 2rmax sup d∈DdP,π inf d∈DP,Π DKL(d d). (10) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 26. 20 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain Motivations: Visualize the behaviour of REMPS; Overcoming local minima; Configure transition function. S0 S1 a0, θ, 1 − ω, 0 a0, θ, ω, s a1, 1 − θ, 1 − ω, s a1, 1 − θ, ω, 0 a0, θ, ω, s a0, θ, 1 − ω, L a1, 1 − θ, ωk, l a1, 1 − θ, 1 − ωk, s 00.20.40.60.81 0 0.5 1 2 4 6 8 10 θ ω J 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 27. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 28. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 29. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 30. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 REMPS = 1e−1 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 31. 21 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Chain 5 10 15 20 2 4 6 8 10 12 J(π∗ , P∗ ) Iteration AverageReward 0 5 10 15 20 −0.2 0 0.2 0.4 0.6 0.8 1 ω∗ Iteration ω G(PO)MDP REMPS = 1e−4 REMPS = 1e−2 REMPS = 1e−1 REMPS = 10 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 32. 22 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Cartpole Standard RL benchmark; Configure cart acceleration; Approximated model. Minimize acceleration balancing the pole: R(s, a, s ) = 10 − ω2 20 − 20 · (1 − cos(γ)). γ ˙γ ˙x x Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 33. 23 / 26 POLITECNICO DI MILANO 1863 Experimental Results - Cartpole Ideal Model Approximated Model 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Iteration Return 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Iteration Return REMPS G(PO)MDP Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 34. 24 / 26 POLITECNICO DI MILANO 1863 Experimental Results - TORCS TORCS: Car Racing simulation (Bernhard Wymann, 2013; Loiacono et al., 2010). Autonomous driving and Configuration. Continuous Control. Approximated model. We configure the rear wing and the brake system. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 35. 25 / 26 POLITECNICO DI MILANO 1863 Experimental Results - TORCS 0 2 4 6 8 10 12 ·102 −60 −40 −20 0 Iteration AverageReward 2 4 6 8 10 12 ·102 0.2 0.4 0.6 0.8 1 Iteration RearWing 2 4 6 8 10 12 ·102 0.2 0.4 0.6 0.8 1 Iteration Front-RearBrakeRepartition REMPS REPS Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 36. 26 / 26 POLITECNICO DI MILANO 1863 Conclusions Contributions: REMPS able to solve the model-policy learning problem. Finite-sample analysis. Experimental evaluation. Future research directions: Adaptive KL-constraint. Other divergences. Finite-Time Analysis. Plan to submit at ICML 2019. Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 37. 27 / 26 POLITECNICO DI MILANO 1863 Conclusions Thank you for your attention Questions? Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 38. 28 / 26 POLITECNICO DI MILANO 1863 Bibliography I [Bernhard Wymann 2013] Bernhard Wymann, Christophe Guionneau Christos Dimitrakakis Rémi Coulom Andrew S.: TORCS, The Open Racing Car Simulator. http://www.torcs.org. 2013 [Loiacono et al. 2010] Loiacono, D. ; Prete, A. ; Lanzi, P. L. ; Cardamone, L.: Learning to overtake in TORCS using simple reinforcement learning. In: IEEE Congress on Evolutionary Computation, July 2010, pp. 1–8. – pp. 1–8 [Metelli et al. 2018a] Metelli, Alberto M. ; Mutti, Mirco ; Restelli, Marcello: Configurable Markov Decision Processes. In: Dy, Jennifer (Ed.) ; Krause, Andreas (Ed.): Proceedings of the 35th International Conference on Machine Learning vol. 80. Stockholmsmässan, Stockholm Sweden : PMLR, 10–15 Jul 2018, pp. 3488–3497. pp. 3488–3497 [Metelli et al. 2018b] Metelli, Alberto M. ; Papini, Matteo ; Faccio, Francesco ; Restelli, Marcello: Policy Optimization via Importance Sampling. In: arXiv:1809.06098 [cs, stat] (2018), September. (2018), September Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 39. 29 / 26 POLITECNICO DI MILANO 1863 Bibliography II [Peters et al. 2010] Peters, Jan ; Mulling, Katharina ; Altun, Yasemin: Relative Entropy Policy Search. 2010 [Pirotta et al. 2013] Pirotta, Matteo ; Restelli, Marcello ; Pecorino, Alessio ; Calandriello, Daniele: Safe Policy Iteration. In: Dasgupta, Sanjoy (Ed.) ; McAllester, David (Ed.): Proceedings of the 30th International Conference on Machine Learning vol. 28. Atlanta, Georgia, USA : PMLR, 17–19 Jun 2013, pp. 307–315. – pp. 307–315 [Schulman et al. 2015] Schulman, John ; Levine, Sergey ; Moritz, Philipp ; Jordan, Michael I. ; Abbeel, Pieter: Trust Region Policy Optimization. In: arXiv:1502.05477 [cs] (2015), February. (2015), February [Schulman et al. 2017] Schulman, John ; Wolski, Filip ; Dhariwal, Prafulla ; Radford, Alec ; Klimov, Oleg: Proximal Policy Optimization Algorithms. In: arXiv:1707.06347 [cs] (2017), July. (2017), July [Sutton and Barto 1998] Sutton, Richard S. ; Barto, Andrew G.: Introduction to Reinforcement Learning. 1st. Cambridge, MA, USA : MIT Press, 1998 Cambridge, MA, USA : MIT Press, 1998 Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 40. 30 / 26 POLITECNICO DI MILANO 1863 d-projection Projection of the stationary state distribution: θ, ω = arg min θ∈Θ,ω∈Ω DKL d(s, a, s ) dω,θ (s, a, s ) s.t. dω,θ (s) = S A dω,θ (s )πθ(a|s )Pω(s |s, a)dads . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 41. 31 / 26 POLITECNICO DI MILANO 1863 Projection of the State Kernel Projection of the state kernel: θ, ω = arg min θ∈Θ,ω∈Ω S d(s)DKL(Pπ (·|s) Pπθ ω (·|s))ds = arg max θ∈Θ,ω∈Ω S d(s) S Pπ (s |s) log Pπθ ω (s |s)ds ds = arg max θ∈Θ,ω∈Ω S d(s) S A π (a|s)P (s |s, a) log Pπθ ω (s |s)ds ds = arg max θ∈Θ,ω∈Ω S A S d(s, a, s ) log A Pω(s |s, a )πθ(a |s)da dsdads Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 42. 32 / 26 POLITECNICO DI MILANO 1863 Independent Projection Independent Projection of policy and model: θ = arg min θ∈Θ S d(s)DKL(π (·|s) πθ(·|s))ds = (11) = arg min θ∈Θ S d(s) A π (a|s) log π(a|s) πθ(a|s) dads = (12) = arg min θ∈Θ S A S d(s, a, s ) log π (a|s) πθ(a|s) dsdads = (13) = arg max θ∈Θ S A S d(s, a, s ) log πθ(a|s)dsdads , (14) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 43. 33 / 26 POLITECNICO DI MILANO 1863 Projection Theorem (Joint bound) Let us denote with dP,π the stationary distribution induced by the model P and policy π and dP ,π the stationary distribution induced by the model P and policy π . Let us assume that the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. The norm of the difference of performance can be upper bounded as: |JP,π − JP ,π | ≤ Rmax 2DKL(dP,π dP ,π ). (15) Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 44. 34 / 26 POLITECNICO DI MILANO 1863 Projection Theorem (Disjoint bound) Let us denote with dP,π the stationary distribution induced by the model P and policy π, dP ,π the stationary distribution induced by the model P and policy π . Let us assume that the reward is uniformly bounded, that is for s, s ∈ S, a ∈ A it holds that |R(s, a, s )| < Rmax. If (P , π ) admits group invertible state kernel P π the norm of the difference of performance can be upper bounded as: |JP,π − JP ,π | ≤ Rmax c1Es,a∼dP,π 2DKL(π π) + 2DKL(P P) , (16) where c1 = 1 + ||A #||∞ and A # is the group inverse of the state kernel P π . Emanuele Ghelfi Reinforcement Learning in Configurable Environments
  • 45. 35 / 26 POLITECNICO DI MILANO 1863 G(PO)MDP - Model extension JP,π = pθ,ω(τ)G(τ)dτ ωJP,π = pθ,ω(τ) ω log p(τ)G(τ)dτ (17) = pθ,ω(τ) H k=0 log Pω(sk+1|sk, ak) G(τ) (18) ωJPπ G(PO)MDP = H l=0 H k=l ω log Pω(sk+1|sk, ak) γl R(sl , al , sl+1) N (19) Emanuele Ghelfi Reinforcement Learning in Configurable Environments