SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Deep Reinforcement Learning Through
Policy Optimization
John Schulman
October 20, 2016
Introduction and Overview
What is Reinforcement Learning?
Branch of machine learning concerned with taking
sequences of actions
Usually described in terms of agent interacting with a
previously unknown environment, trying to maximize
cumulative reward
Agent Environment
action
observation, reward
What Is Deep Reinforcement Learning?
Reinforcement learning using neural networks to approximate
functions
Policies (select next action)
Value functions (measure goodness of states or
state-action pairs)
Models (predict next states and rewards)
Motor Control and Robotics
Robotics:
Observations: camera images, joint angles
Actions: joint torques
Rewards: stay balanced, navigate to target locations,
serve and protect humans
Business Operations
Inventory Management
Observations: current inventory levels
Actions: number of units of each item to purchase
Rewards: profit
In Other ML Problems
Hard Attention1
Observation: current image window
Action: where to look
Reward: classification
Sequential/structured prediction, e.g., machine
translation2
Observations: words in source language
Actions: emit word in target language
Rewards: sentence-level metric, e.g. BLEU score
1
V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing
Systems. 2014, pp. 2204–2212.
2
H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3
(2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured
Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level
training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).
How Does RL Relate to Other ML Problems?
How Does RL Relate to Other ML Problems?
Supervised learning:
Environment samples input-output pair (xt, yt) ∼ ρ
Agent predicts ˆyt = f (xt)
Agent receives loss (yt, ˆyt)
Environment asks agent a question, and then tells her the
right answer
How Does RL Relate to Other ML Problems?
Contextual bandits:
Environment samples input xt ∼ ρ
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an
unknown probability distribution
Environment asks agent a question, and gives her a noisy
score on her answer
Application: personalized recommendations
How Does RL Relate to Other ML Problems?
Reinforcement learning:
Environment samples input xt ∼ P(xt | xt−1, yt−1)
Input depends on your previous actions!
Agent takes action ˆyt = f (xt)
Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a
probability distribution unknown to the agent.
How Does RL Relate to Other Machine Learning
Problems?
Summary of differences between RL and supervised learning:
You don’t have full access to the function you’re trying to
optimize—must query it through interaction.
Interacting with a stateful world: input xt depend on your
previous actions
Should I Use Deep RL On My Practical Problem?
Might be overkill
Other methods worth investigating first
Derivative-free optimization (simulated annealing, cross
entropy method, SPSA)
Is it a contextual bandit problem?
Non-deep RL methods developed by Operations
Research community3
3
W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John
Wiley & Sons, 2007.
Recent Success Stories in Deep RL
ATARI using deep Q-learning4
, policy gradients5
,
DAGGER6
Superhuman Go using supervised learning + policy
gradients + Monte Carlo tree search + value functions7
Robotic manipulation using guided policy search8
Robotic locomotion using policy gradients9
3D games using policy gradients10
4
V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013).
5
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
6
X. Guo et al. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In:
Advances in Neural Information Processing Systems. 2014, pp. 3338–3346.
7
D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587
(2016), pp. 484–489.
8
S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015).
9
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015).
10
V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint
arXiv:1602.01783 (2016).
Markov Decision Processes
Definition
Markov Decision Process (MDP) defined by (S, A, P),
where
S: state space
A: action space
P(r, s | s, a): transition + reward probability distribution
Extra objects defined depending on problem setting
µ: Initial state distribution
Optimization problem: maximize expected cumulative
reward
Episodic Setting
In each episode, the initial state is sampled from µ, and
the agent acts until the terminal state is reached. For
example:
Taxi robot reaches its destination (termination = good)
Waiter robot finishes a shift (fixed time)
Walking robot falls over (termination = bad)
Goal: maximize expected reward per episode
Policies
Deterministic policies: a = π(s)
Stochastic policies: a ∼ π(a | s)
Episodic Setting
s0 ∼ µ(s0)
a0 ∼ π(a0 | s0)
s1, r0 ∼ P(s1, r0 | s0, a0)
a1 ∼ π(a1 | s1)
s2, r1 ∼ P(s2, r1 | s1, a1)
. . .
aT−1 ∼ π(aT−1 | sT−1)
sT , rT−1 ∼ P(sT | sT−1, aT−1)
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]
Episodic Setting
μ0
a0
s0 s1
a1 aT-1
sT
π
P
Agent
r0 r1 rT-1
Environment
s2
Objective:
maximize η(π), where
η(π) = E[r0 + r1 + · · · + rT−1 | π]
Parameterized Policies
A family of policies indexed by parameter vector θ ∈ Rd
Deterministic: a = π(s, θ)
Stochastic: π(a | s, θ)
Analogous to classification or regression with input s,
output a.
Discrete action space: network outputs vector of
probabilities
Continuous action space: network outputs mean and
diagonal covariance of Gaussian
Black-Box Optimization Methods
Derivative Free Optimization Approach
Objective:
maximize E[R | π(·, θ)]
View θ → → R as a black box
Ignore all other information other than R collected during
episode
Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
I. Szita and A. L¨orincz. “Learning Tetris using
the noisy cross-entropy method”. In: Neural
computation 18.12 (2006), pp. 2936–2941
V. Gabillon, M. Ghavamzadeh, and
B. Scherrer. “Approximate Dynamic
Programming Finally Performs Well in the
Game of Tetris”. In: Advances in Neural
Information Processing Systems. 2013
Cross-Entropy Method
Evolutionary algorithm
Works embarrassingly well
A similar algorithm, Covariance Matrix Adaptation, has
become standard in graphics:
Cross-Entropy Method
Initialize µ ∈ Rd
, σ ∈ Rd
for iteration = 1, 2, . . . do
Collect n samples of θi ∼ N(µ, diag(σ))
Perform a noisy evaluation Ri ∼ θi
Select the top p% of samples (e.g. p = 20), which we’ll
call the elite set
Fit a Gaussian distribution, with diagonal covariance,
to the elite set, obtaining a new µ, σ.
end for
Return the final µ.
Cross-Entropy Method
Analysis: a very similar algorithm is an
minorization-maximization (MM) algorithm, guaranteed
to monotonically increase expected reward
Recall that Monte-Carlo EM algorithm collects samples,
reweights them, and them maximizes their logprob
We can derive MM algorithm where each iteration you
maximize i log p(θi )Ri
Policy Gradient Methods
Policy Gradient Methods: Overview
Problem:
maximize E[R | πθ]
Intuitions: collect a bunch of trajectories, and ...
1. Make the good trajectories more probable
2. Make the good actions more probable
3. Push the actions towards good actions (DPG11
, SVG12
)
11
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014.
12
N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural
Information Processing Systems. 2015, pp. 2926–2934.
Score Function Gradient Estimator
Consider an expectation Ex∼p(x | θ)[f (x)]. Want to compute
gradient wrt θ
θEx [f (x)] = θ dx p(x | θ)f (x)
= dx θp(x | θ)f (x)
= dx p(x | θ)
θp(x | θ)
p(x | θ)
f (x)
= dx p(x | θ) θ log p(x | θ)f (x)
= Ex [f (x) θ log p(x | θ)].
Last expression gives us an unbiased gradient estimator. Just
sample xi ∼ p(x | θ), and compute ˆgi = f (xi ) θ log p(xi | θ).
Need to be able to compute and differentiate density p(x | θ)
wrt θ
Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Let’s say that f (x) measures how good the sample x is.
Moving in the direction ˆgi pushes up the logprob of the
sample, in proportion to how good it is
Valid even if f (x) is discontinuous, and unknown, or
sample space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition
ˆgi = f (xi ) θ log p(xi | θ)
Score Function Gradient Estimator for Policies
Now random variable x is a whole trajectory
τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT )
θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)]
Just need to write out p(τ | θ):
p(τ | θ) = µ(s0)
T−1
t=0
[π(at | st, θ)P(st+1, rt | st, at)]
log p(τ | θ) = log µ(s0) +
T−1
t=0
[log π(at | st, θ) + log P(st+1, rt | st, at)]
θ log p(τ | θ) = θ
T−1
t=0
log π(at | st, θ)
θEτ [R] = Eτ R θ
T−1
t=0
log π(at | st, θ)
Interpretation: using good trajectories (high R) as supervised
examples in classification / regression
Policy Gradient: Use Temporal Structure
Previous slide:
θEτ [R] = Eτ
T−1
t=0
rt
T−1
t=0
θ log π(at | st, θ)
We can repeat the same argument to derive the gradient
estimator for a single reward term rt .
θE [rt ] = E rt
t
t=0
θ log π(at | st, θ)
Sum this formula over t, we obtain
θE [R] = E
T−1
t=0
rt
t
t=0
θ log π(at | st, θ)
= E
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
rt
Policy Gradient: Introduce Baseline
Further reduce variance by introducing a baseline b(s)
θEτ [R] = Eτ
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
rt − b(st)
For any choice of b, gradient estimator is unbiased.
Near optimal choice is expected return,
b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1]
Interpretation: increase logprob of action at proportionally
to how much returns T−1
t=t rt are better than expected
Discounts for Variance Reduction
Introduce discount factor γ, which ignores delayed effects
between actions and rewards
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ)
T−1
t =t
γt −t
rt − b(st)
Now, we want
b(st) ≈ E rt + γrt+1 + γ2
rt+2 + · · · + γT−1−t
rT−1
Write gradient estimator more generally as
θEτ [R] ≈ Eτ
T−1
t=0
θ log π(at | st, θ) ˆAt
ˆAt is the advantage estimate
“Vanilla” Policy Gradient Algorithm
Initialize policy parameter θ, baseline b
for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current
policy
At each timestep in each trajectory, compute
the return Rt = T−1
t =t γt −t
rt , and
the advantage estimate ˆAt = Rt − b(st).
Re-fit the baseline, by minimizing b(st) − Rt
2
,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ˆg,
which is a sum of terms θ log π(at | st, θ) ˆAt
end for
Extension: Step Sizes and Trust Regions
Why are step sizes a big deal in RL?
Supervised learning
Step too far → next update will fix it
Reinforcement learning
Step too far → bad policy
Next batch: collected under bad policy
Can’t recover, collapse in performance!
Extension: Step Sizes and Trust Regions
Trust Region Policy Optimization: limit KL divergence
between action distribution of pre-update and
post-update policy13
Es DKL
(πold(· | s) π(· | s)) ≤ δ
Closely elated to previous natural policy gradient
methods14
13
J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015).
14
S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and
J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In:
Neurocomputing 71.7 (2008), pp. 1180–1190.
Extension: Further Variance Reduction
Use value functions for more variance reduction (at the
cost of bias): actor-critic methods15
Reparameterization trick: instead of increasing the
probability of the good actions, push the actions towards
(hopefully) better actions16
15
J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv
preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In:
arXiv preprint arXiv:1602.01783 (2016).
16
D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning
continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing
Systems. 2015, pp. 2926–2934.
Demo
Fin
Thank you. Questions?

Mais conteúdo relacionado

Mais procurados

Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processJie-Han Chen
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)Kyunghwan Kim
 

Mais procurados (20)

Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
5 csp
5 csp5 csp
5 csp
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Optmization techniques
Optmization techniquesOptmization techniques
Optmization techniques
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)
 

Destaque

Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例としてGibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例としてTakeshi Nagae
 
스웨덴개황%282009.7%29
스웨덴개황%282009.7%29스웨덴개황%282009.7%29
스웨덴개황%282009.7%29drtravel
 
Simulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy methodSimulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy methodArthur Breitman
 
「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2Chihiro Kusunoki
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1Chihiro Kusunoki
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)Eric Sartre
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics Koichi Hamada
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLNaoto Yoshida
 
強化学習その3
強化学習その3強化学習その3
強化学習その3nishio
 
ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33horihorio
 
最近のDQN
最近のDQN最近のDQN
最近のDQNmooopan
 
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesBuild Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesDouglas Lanman
 
Deep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningDeep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningRenārs Liepiņš
 
Presentation on driverless cars by shahin hussan
Presentation on driverless cars by shahin hussan Presentation on driverless cars by shahin hussan
Presentation on driverless cars by shahin hussan Shahinhussan
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017LinkedIn
 

Destaque (20)

Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例としてGibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
Gibbs cloner を用いた組み合わせ最適化と cross-entropy を用いた期待値推計: 道路ネットワーク強靭化のための耐震化戦略を例として
 
스웨덴개황%282009.7%29
스웨덴개황%282009.7%29스웨덴개황%282009.7%29
스웨덴개황%282009.7%29
 
Simulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy methodSimulation of rare events and optimisation with the cross-entropy method
Simulation of rare events and optimisation with the cross-entropy method
 
「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2「これからの強化学習」勉強会#2
「これからの強化学習」勉強会#2
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)正則化つき線形モデル(「入門機械学習第6章」より)
正則化つき線形モデル(「入門機械学習第6章」より)
 
NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics  NIPS 2016 Overview and Deep Learning Topics
NIPS 2016 Overview and Deep Learning Topics
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
 
ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33ロジスティック回帰の考え方・使い方 - TokyoR #33
ロジスティック回帰の考え方・使い方 - TokyoR #33
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesBuild Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
 
Deep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningDeep Learning and Reinforcement Learning
Deep Learning and Reinforcement Learning
 
Presentation on driverless cars by shahin hussan
Presentation on driverless cars by shahin hussan Presentation on driverless cars by shahin hussan
Presentation on driverless cars by shahin hussan
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017
 

Semelhante a Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithminfopapers
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptAnshika865276
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 

Semelhante a Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI (20)

Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
 
Lec0
Lec0Lec0
Lec0
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 

Último

Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 

Último (20)

Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

  • 1. Deep Reinforcement Learning Through Policy Optimization John Schulman October 20, 2016
  • 2.
  • 4. What is Reinforcement Learning? Branch of machine learning concerned with taking sequences of actions Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward Agent Environment action observation, reward
  • 5. What Is Deep Reinforcement Learning? Reinforcement learning using neural networks to approximate functions Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)
  • 6. Motor Control and Robotics Robotics: Observations: camera images, joint angles Actions: joint torques Rewards: stay balanced, navigate to target locations, serve and protect humans
  • 7. Business Operations Inventory Management Observations: current inventory levels Actions: number of units of each item to purchase Rewards: profit
  • 8. In Other ML Problems Hard Attention1 Observation: current image window Action: where to look Reward: classification Sequential/structured prediction, e.g., machine translation2 Observations: words in source language Actions: emit word in target language Rewards: sentence-level metric, e.g. BLEU score 1 V. Mnih et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing Systems. 2014, pp. 2204–2212. 2 H. Daum´e Iii, J. Langford, and D. Marcu. “Search-based structured prediction”. In: Machine learning 75.3 (2009), pp. 297–325; S. Ross, G. J. Gordon, and D. Bagnell. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In: AISTATS. vol. 1. 2. 2011, p. 6; M. Ranzato et al. “Sequence level training with recurrent neural networks”. In: arXiv preprint arXiv:1511.06732 (2015).
  • 9. How Does RL Relate to Other ML Problems?
  • 10. How Does RL Relate to Other ML Problems? Supervised learning: Environment samples input-output pair (xt, yt) ∼ ρ Agent predicts ˆyt = f (xt) Agent receives loss (yt, ˆyt) Environment asks agent a question, and then tells her the right answer
  • 11. How Does RL Relate to Other ML Problems? Contextual bandits: Environment samples input xt ∼ ρ Agent takes action ˆyt = f (xt) Agent receives cost ct ∼ P(ct | xt, ˆyt) where P is an unknown probability distribution Environment asks agent a question, and gives her a noisy score on her answer Application: personalized recommendations
  • 12. How Does RL Relate to Other ML Problems? Reinforcement learning: Environment samples input xt ∼ P(xt | xt−1, yt−1) Input depends on your previous actions! Agent takes action ˆyt = f (xt) Agent receives cost ct ∼ P(ct | xt, ˆyt) where P a probability distribution unknown to the agent.
  • 13. How Does RL Relate to Other Machine Learning Problems? Summary of differences between RL and supervised learning: You don’t have full access to the function you’re trying to optimize—must query it through interaction. Interacting with a stateful world: input xt depend on your previous actions
  • 14. Should I Use Deep RL On My Practical Problem? Might be overkill Other methods worth investigating first Derivative-free optimization (simulated annealing, cross entropy method, SPSA) Is it a contextual bandit problem? Non-deep RL methods developed by Operations Research community3 3 W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Vol. 703. John Wiley & Sons, 2007.
  • 15. Recent Success Stories in Deep RL ATARI using deep Q-learning4 , policy gradients5 , DAGGER6 Superhuman Go using supervised learning + policy gradients + Monte Carlo tree search + value functions7 Robotic manipulation using guided policy search8 Robotic locomotion using policy gradients9 3D games using policy gradients10 4 V. Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: arXiv preprint arXiv:1312.5602 (2013). 5 J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015). 6 X. Guo et al. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In: Advances in Neural Information Processing Systems. 2014, pp. 3338–3346. 7 D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 8 S. Levine et al. “End-to-end training of deep visuomotor policies”. In: arXiv preprint arXiv:1504.00702 (2015). 9 J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015). 10 V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1602.01783 (2016).
  • 17. Definition Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space P(r, s | s, a): transition + reward probability distribution Extra objects defined depending on problem setting µ: Initial state distribution Optimization problem: maximize expected cumulative reward
  • 18. Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected reward per episode
  • 19. Policies Deterministic policies: a = π(s) Stochastic policies: a ∼ π(a | s)
  • 20. Episodic Setting s0 ∼ µ(s0) a0 ∼ π(a0 | s0) s1, r0 ∼ P(s1, r0 | s0, a0) a1 ∼ π(a1 | s1) s2, r1 ∼ P(s2, r1 | s1, a1) . . . aT−1 ∼ π(aT−1 | sT−1) sT , rT−1 ∼ P(sT | sT−1, aT−1) Objective: maximize η(π), where η(π) = E[r0 + r1 + · · · + rT−1 | π]
  • 21. Episodic Setting μ0 a0 s0 s1 a1 aT-1 sT π P Agent r0 r1 rT-1 Environment s2 Objective: maximize η(π), where η(π) = E[r0 + r1 + · · · + rT−1 | π]
  • 22. Parameterized Policies A family of policies indexed by parameter vector θ ∈ Rd Deterministic: a = π(s, θ) Stochastic: π(a | s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian
  • 24. Derivative Free Optimization Approach Objective: maximize E[R | π(·, θ)] View θ → → R as a black box Ignore all other information other than R collected during episode
  • 25. Cross-Entropy Method Evolutionary algorithm Works embarrassingly well I. Szita and A. L¨orincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941 V. Gabillon, M. Ghavamzadeh, and B. Scherrer. “Approximate Dynamic Programming Finally Performs Well in the Game of Tetris”. In: Advances in Neural Information Processing Systems. 2013
  • 26. Cross-Entropy Method Evolutionary algorithm Works embarrassingly well A similar algorithm, Covariance Matrix Adaptation, has become standard in graphics:
  • 27. Cross-Entropy Method Initialize µ ∈ Rd , σ ∈ Rd for iteration = 1, 2, . . . do Collect n samples of θi ∼ N(µ, diag(σ)) Perform a noisy evaluation Ri ∼ θi Select the top p% of samples (e.g. p = 20), which we’ll call the elite set Fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new µ, σ. end for Return the final µ.
  • 28. Cross-Entropy Method Analysis: a very similar algorithm is an minorization-maximization (MM) algorithm, guaranteed to monotonically increase expected reward Recall that Monte-Carlo EM algorithm collects samples, reweights them, and them maximizes their logprob We can derive MM algorithm where each iteration you maximize i log p(θi )Ri
  • 30. Policy Gradient Methods: Overview Problem: maximize E[R | πθ] Intuitions: collect a bunch of trajectories, and ... 1. Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards good actions (DPG11 , SVG12 ) 11 D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014. 12 N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing Systems. 2015, pp. 2926–2934.
  • 31. Score Function Gradient Estimator Consider an expectation Ex∼p(x | θ)[f (x)]. Want to compute gradient wrt θ θEx [f (x)] = θ dx p(x | θ)f (x) = dx θp(x | θ)f (x) = dx p(x | θ) θp(x | θ) p(x | θ) f (x) = dx p(x | θ) θ log p(x | θ)f (x) = Ex [f (x) θ log p(x | θ)]. Last expression gives us an unbiased gradient estimator. Just sample xi ∼ p(x | θ), and compute ˆgi = f (xi ) θ log p(xi | θ). Need to be able to compute and differentiate density p(x | θ) wrt θ
  • 32. Score Function Gradient Estimator: Intuition ˆgi = f (xi ) θ log p(xi | θ) Let’s say that f (x) measures how good the sample x is. Moving in the direction ˆgi pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set
  • 33. Score Function Gradient Estimator: Intuition ˆgi = f (xi ) θ log p(xi | θ)
  • 34. Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT ) θEτ [R(τ)] = Eτ [ θ log p(τ | θ)R(τ)] Just need to write out p(τ | θ): p(τ | θ) = µ(s0) T−1 t=0 [π(at | st, θ)P(st+1, rt | st, at)] log p(τ | θ) = log µ(s0) + T−1 t=0 [log π(at | st, θ) + log P(st+1, rt | st, at)] θ log p(τ | θ) = θ T−1 t=0 log π(at | st, θ) θEτ [R] = Eτ R θ T−1 t=0 log π(at | st, θ) Interpretation: using good trajectories (high R) as supervised examples in classification / regression
  • 35. Policy Gradient: Use Temporal Structure Previous slide: θEτ [R] = Eτ T−1 t=0 rt T−1 t=0 θ log π(at | st, θ) We can repeat the same argument to derive the gradient estimator for a single reward term rt . θE [rt ] = E rt t t=0 θ log π(at | st, θ) Sum this formula over t, we obtain θE [R] = E T−1 t=0 rt t t=0 θ log π(at | st, θ) = E T−1 t=0 θ log π(at | st, θ) T−1 t =t rt
  • 36. Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) θEτ [R] = Eτ T−1 t=0 θ log π(at | st, θ) T−1 t =t rt − b(st) For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(st) ≈ E [rt + rt+1 + rt+2 + · · · + rT−1] Interpretation: increase logprob of action at proportionally to how much returns T−1 t=t rt are better than expected
  • 37. Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards θEτ [R] ≈ Eτ T−1 t=0 θ log π(at | st, θ) T−1 t =t γt −t rt − b(st) Now, we want b(st) ≈ E rt + γrt+1 + γ2 rt+2 + · · · + γT−1−t rT−1 Write gradient estimator more generally as θEτ [R] ≈ Eτ T−1 t=0 θ log π(at | st, θ) ˆAt ˆAt is the advantage estimate
  • 38. “Vanilla” Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, . . . do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return Rt = T−1 t =t γt −t rt , and the advantage estimate ˆAt = Rt − b(st). Re-fit the baseline, by minimizing b(st) − Rt 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆg, which is a sum of terms θ log π(at | st, θ) ˆAt end for
  • 39. Extension: Step Sizes and Trust Regions Why are step sizes a big deal in RL? Supervised learning Step too far → next update will fix it Reinforcement learning Step too far → bad policy Next batch: collected under bad policy Can’t recover, collapse in performance!
  • 40. Extension: Step Sizes and Trust Regions Trust Region Policy Optimization: limit KL divergence between action distribution of pre-update and post-update policy13 Es DKL (πold(· | s) π(· | s)) ≤ δ Closely elated to previous natural policy gradient methods14 13 J. Schulman et al. “Trust Region Policy Optimization”. In: arXiv preprint arXiv:1502.05477 (2015). 14 S. Kakade. “A Natural Policy Gradient.” In: NIPS. vol. 14. 2001, pp. 1531–1538; J. A. Bagnell and J. Schneider. “Covariant policy search”. In: IJCAI. 2003; J. Peters and S. Schaal. “Natural actor-critic”. In: Neurocomputing 71.7 (2008), pp. 1180–1190.
  • 41. Extension: Further Variance Reduction Use value functions for more variance reduction (at the cost of bias): actor-critic methods15 Reparameterization trick: instead of increasing the probability of the good actions, push the actions towards (hopefully) better actions16 15 J. Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015); V. Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: arXiv preprint arXiv:1602.01783 (2016). 16 D. Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014; N. Heess et al. “Learning continuous control policies by stochastic value gradients”. In: Advances in Neural Information Processing Systems. 2015, pp. 2926–2934.
  • 42. Demo