SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Proximal Policy Optimization Algorithms,
Schulman et al, 2017
옥찬호
utilForever@gmail.com
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• In recent years, several different approaches have been
proposed for reinforcement learning with neural network
function approximators. The leading contenders are deep Q-
learning, “vanilla” policy gradient methods, and trust region /
natural policy gradient methods.
• However, there is room for improvement in developing a
method that is scalable (to large models and parallel
implementations), data efficient, and robust (i.e., successful
on a variety of problems without hyperparameter tuning).
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• DQN
• fails on many simple problems and is poorly understood
• A3C – “Vanilla” policy gradient methods
• have poor data efficiency and robustness
• TRPO
• relatively complicated
• is not compatible with architectures that include noise (such as dropout) or
parameter sharing (between the policy and value function, or with auxiliary
tasks)
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• We propose a novel objective with clipped probability ratios,
which forms a pessimistic estimate (i.e., lower bound) of the
performance of the policy.
• This paper seeks to improve the current state of affairs by
introducing an algorithm that attains the data efficiency and
reliable performance of TRPO, while using only first-order
optimization.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• To optimize policies, we alternate between sampling data from
the policy and performing several epochs of optimization on
the sampled data.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• Our experiments compare the performance of various different
versions of the surrogate objective, and find that the version
with the clipped probability ratios performs best.
• We also compare PPO to several previous algorithms from the
literature.
• On continuous control tasks, it performs better than the algorithms we
compare against.
• On Atari, it performs significantly better (in terms of sample complexity) than
A2C and similarly to ACER though it is much simpler.
PPO, Schulman et al, 2017Background: Policy Optimization
• Policy gradient methods work by computing an estimator of
the policy gradient and plugging it into a stochastic gradient
ascent algorithm. The most commonly used gradient estimator
has the form
ො𝑔 = ෡𝔼 𝑡 ∇ 𝜃 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
• where 𝜋 𝜃 is a stochastic policy and መ𝐴 𝑡 is an estimator of the advantage
function at timestep 𝑡. Here, the expectation ෡𝔼 𝑡 … indicates the empirical
average over a finite batch of samples, in an algorithm that alternates
between sampling and optimization.
PPO, Schulman et al, 2017Background: Policy Optimization
• Implementations that use automatic differentiation software
work by constructing an objective function whose gradient is
the policy gradient estimator; the estimator ො𝑔 is obtained by
differentiating the objective
𝐿 𝑃𝐺
𝜃 = ෡𝔼 𝑡 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
PPO, Schulman et al, 2017Background: Policy Optimization
• While it is appealing to perform multiple steps of optimization
on this loss 𝐿 𝑃𝐺
using the same trajectory, doing so is not well-
justified, and empirically it often leads to destructively large
policy updates.
PPO, Schulman et al, 2017Background: Policy Optimization
• In TRPO, an objective function (the “surrogate” objective) is
maximized subject to a constraint on the size of the policy
update. Specifically,
maximize
𝜃
෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡
subject to ෡𝔼 𝑡 𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 ≤ 𝛿
• Here, 𝜃old is the vector of policy parameters before the update.
PPO, Schulman et al, 2017Background: Policy Optimization
• This problem can efficiently be approximately solved using the
conjugate gradient algorithm, after making a linear
approximation to the objective and a quadratic approximation
to the constraint.
PPO, Schulman et al, 2017Background: Policy Optimization
• The theory justifying TRPO actually suggests using a penalty
instead of a constraint, i.e., solving the unconstrained
optimization problem for some coefficient 𝛽.
maximize
𝜃
෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
PPO, Schulman et al, 2017Background: Policy Optimization
• This follows from the fact that a certain surrogate objective
(which computes the max KL over states instead of the mean)
forms a lower bound (i.e., a pessimistic bound) on the
performance of the policy 𝜋.
• TRPO uses a hard constraint rather than a penalty because it is
hard to choose a single value of 𝜷 that performs well across
different problems—or even within a single problem, where the
characteristics change over the course of learning.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• Let 𝑟𝑡(𝜃) denote the probability ratio 𝑟𝑡 𝜃 =
𝜋 𝜃(𝑎 𝑡|𝑠 𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠 𝑡)
,
so 𝑟𝑡 𝜃old = 1. TRPO maximizes a “surrogate” objective
𝐿 𝐶𝑃𝐼
𝜃 = ෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 = ෡𝔼 𝑡 𝑟𝑡 𝜃 መ𝐴 𝑡
• The superscript CPI refers to conservative policy iteration,
where this objective was proposed.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• Without a constraint, maximization of 𝐿 𝐶𝑃𝐼
would lead to an
excessively large policy update; hence, we now consider how
to modify the objective, to penalize changes to the policy that
move 𝑟𝑡(𝜃) away from 1.
• The main objective we propose is the following:
𝐿 𝐶𝐿𝐼𝑃
𝜃 = ෡𝔼 𝑡 min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡)
• where epsilon is a hyperparameter, say, 𝜖 = 0.2.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• The motivation for this objective is as follows.
• The first term inside the min is 𝐿 𝐶𝑃𝐼 𝜃 .
• The second term, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡, modifies the surrogate objective
by clipping the probability ratio, which removes the incentive for moving
𝑟𝑡 𝜃 outside of the interval 1 − 𝜖, 1 + 𝜖 .
• Finally, we take the minimum of the clipped and unclipped objective, so the
final objective is a lower bound (i.e., a pessimistic bound) on the unclipped
objective.
PPO, Schulman et al, 2017Clipped Surrogate Objective
Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function
𝐿 𝐶𝐿𝐼𝑃
as a function of the probability ratio 𝑟, for positive advantages (left) and negative
advantages (right). The red circle on each plot shows the starting point for the
optimization, i.e., 𝑟 = 1. Note that 𝐿 𝐶𝐿𝐼𝑃
sums many of these terms.
PPO, Schulman et al, 2017Clipped Surrogate Objective
Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter
𝜃old, and the updated policy parameter, which we compute after one iteration of PPO.
The updated policy has a KL divergence of about 0.02 from the initial policy, and this is
the point at which 𝐿 𝐶𝐿𝐼𝑃
is maximal.
PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient
• Another approach, which can be used as an alternative to the
clipped surrogate objective, or in addition to it, is to use a
penalty on KL divergence, and to adapt the penalty coefficient
so that we achieve some target value of the KL divergence
𝑑targ each policy update.
• In our experiments, we found that the KL penalty performed worse than the
clipped surrogate objective, however, we’ve included it here because it’s an
important baseline.
PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient
• In the simplest instantiation of this algorithm, we perform the
following steps in each policy update:
• Using several epochs of minibatch SGD, optimize the KL-penalized objective
𝐿 𝐾𝐿𝑃𝐸𝑁 𝜃 = ෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
• Compute 𝒅 = ෡𝔼 𝒕 𝑲𝑳 𝝅 𝜽 𝒐𝒍𝒅
∙ 𝒔 𝒕 , 𝝅 𝜽 ∙ 𝒔 𝒕
• If 𝑑 < Τ𝑑 𝑡𝑎𝑟𝑔 1.5 , 𝛽 ← Τ𝛽 2
• If 𝑑 > 𝑑 𝑡𝑎𝑟𝑔 × 1.5, 𝛽 ← 𝛽 × 2
• The updated 𝛽 is used for the next policy update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Most techniques for computing variance-reduced advantage-
function estimators make use a learned state-value function 𝑉(𝑠)
• Generalized advantage estimation
• The finite-horizon estimators
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• If using a neural network architecture that shares parameters
between the policy and value function, we must use a loss function
that combines the policy surrogate and a value function error term.
• This objective can further be augmented by adding an entropy bonus
to ensure sufficient exploration, as suggested in past work (A3C).
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Combining these terms, we obtain the following objective,
which is (approximately) maximized each iteration:
𝐿 𝑡
𝐶𝐿𝐼𝑃+𝑉𝐹+𝑆
𝜃 = ෡𝔼 𝑡 𝐿 𝑡
𝐶𝐿𝐼𝑃
𝜃 − 𝑐1 𝐿 𝑡
𝑉𝐹
𝜃 + 𝑐2 𝑆 𝜋 𝜃 𝑠𝑡
• where 𝑐1, 𝑐2 are coefficients, and 𝑆 denotes an entropy bonus,
and 𝐿 𝑡
𝑉𝐹
is a squared-error loss 𝑉𝜃 𝑠𝑡 − 𝑉𝑡
targ 2
.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• One style of policy gradient implementation, popularized in
A3C and well-suited for use with recurrent neural networks,
runs the policy for 𝑻 timesteps (where 𝑻 is much less than the
episode length), and uses the collected samples for an update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• This style requires an advantage estimator that does not look
beyond timestep 𝑻. The estimator used by A3C is
መ𝐴 𝑡 = −𝑉 𝑠𝑡 + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1
𝑟 𝑇−1 + 𝛾 𝑇−𝑡
𝑉 𝑠 𝑇
• where 𝑡 specifies the time index in 0, 𝑇 ,
within a given length-𝑇 trajectory segment.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Generalizing this choice, we can use a truncated version of
generalized advantage estimation, which reduces to previous
equation when 𝜆 = 1:
መ𝐴 𝑡 = 𝛿𝑡 + 𝛾𝜆 𝛿𝑡+1 + ⋯ + ⋯ + 𝛾𝜆 𝑇−𝑡+1
𝛿 𝑇−1
where 𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡+1 − 𝑉 𝑠𝑡
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• A proximal policy optimization (PPO) algorithm that uses
fixed-length trajectory segments is shown below.
• Each iteration, each of 𝑁 (parallel) actors collect 𝑇 timesteps of data.
• Then we construct the surrogate loss on these 𝑁𝑇 timesteps of data, and
optimize it with minibatch SGD (or usually for better performance, Adam),
for 𝐾 epochs.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
• First, we compare several different surrogate objectives under
different hyperparameters. Here, we compare the surrogate
objective 𝐿 𝐶𝐿𝐼𝑃
to several natural variations and ablated
versions.
• No clipping or penalty: 𝐿 𝑡(𝜃) = 𝑟𝑡 𝜃 መ𝐴 𝑡
• Clipping: 𝐿 𝑡(𝜃) = min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡)
• KL penalty (fixed or adaptive): 𝐿 𝑡 𝜃 = 𝑟𝑡 𝜃 መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
, 𝜋 𝜃
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion
• We have introduced proximal policy optimization, a family of
policy optimization methods that use multiple epochs of
stochastic gradient ascent to perform each policy update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion
• These methods have the stability and reliability of trust-region
methods but are much simpler to implement, requiring only
few lines of code change to a vanilla policy gradient (A3C)
implementation, applicable in more general settings (for
example, when using a joint architecture for the policy and
value function), and have better overall performance.
References
• https://reinforcement-learning-kr.github.io/2018/06/22/7_ppo/
• https://lynnn.tistory.com/73
• https://jay.tech.blog/2018/10/09/trpo%EC%99%80-ppo/
• https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient-
Algorithms
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Thank you!

Mais conteúdo relacionado

Mais procurados

강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2Dongmin Lee
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLKumud Arora
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Rehan Guha
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesSangwoo Mo
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)찬희 이
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection methodAmir Razmjou
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현정주 김
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기NAVER D2
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 

Mais procurados (20)

강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in ML
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 
은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)은닉 마르코프 모델, Hidden Markov Model(HMM)
은닉 마르코프 모델, Hidden Markov Model(HMM)
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 

Semelhante a Proximal Policy Optimization Algorithms, Schulman et al, 2017

Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionIJECEIAES
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of AlgorithmsBulbul Agrawal
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...IAEME Publication
 

Semelhante a Proximal Policy Optimization Algorithms, Schulman et al, 2017 (20)

Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Pydata presentation
Pydata presentationPydata presentation
Pydata presentation
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 

Mais de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer GraphicsChris Ohk
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2Chris Ohk
 

Mais de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2
 

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Proximal Policy Optimization Algorithms, Schulman et al, 2017

  • 1. Proximal Policy Optimization Algorithms, Schulman et al, 2017 옥찬호 utilForever@gmail.com
  • 2. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • In recent years, several different approaches have been proposed for reinforcement learning with neural network function approximators. The leading contenders are deep Q- learning, “vanilla” policy gradient methods, and trust region / natural policy gradient methods. • However, there is room for improvement in developing a method that is scalable (to large models and parallel implementations), data efficient, and robust (i.e., successful on a variety of problems without hyperparameter tuning).
  • 3. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • DQN • fails on many simple problems and is poorly understood • A3C – “Vanilla” policy gradient methods • have poor data efficiency and robustness • TRPO • relatively complicated • is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks)
  • 4. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • We propose a novel objective with clipped probability ratios, which forms a pessimistic estimate (i.e., lower bound) of the performance of the policy. • This paper seeks to improve the current state of affairs by introducing an algorithm that attains the data efficiency and reliable performance of TRPO, while using only first-order optimization.
  • 5. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • To optimize policies, we alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.
  • 6. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • Our experiments compare the performance of various different versions of the surrogate objective, and find that the version with the clipped probability ratios performs best. • We also compare PPO to several previous algorithms from the literature. • On continuous control tasks, it performs better than the algorithms we compare against. • On Atari, it performs significantly better (in terms of sample complexity) than A2C and similarly to ACER though it is much simpler.
  • 7. PPO, Schulman et al, 2017Background: Policy Optimization • Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form ො𝑔 = ෡𝔼 𝑡 ∇ 𝜃 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡 • where 𝜋 𝜃 is a stochastic policy and መ𝐴 𝑡 is an estimator of the advantage function at timestep 𝑡. Here, the expectation ෡𝔼 𝑡 … indicates the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization.
  • 8. PPO, Schulman et al, 2017Background: Policy Optimization • Implementations that use automatic differentiation software work by constructing an objective function whose gradient is the policy gradient estimator; the estimator ො𝑔 is obtained by differentiating the objective 𝐿 𝑃𝐺 𝜃 = ෡𝔼 𝑡 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡
  • 9. PPO, Schulman et al, 2017Background: Policy Optimization • While it is appealing to perform multiple steps of optimization on this loss 𝐿 𝑃𝐺 using the same trajectory, doing so is not well- justified, and empirically it often leads to destructively large policy updates.
  • 10. PPO, Schulman et al, 2017Background: Policy Optimization • In TRPO, an objective function (the “surrogate” objective) is maximized subject to a constraint on the size of the policy update. Specifically, maximize 𝜃 ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 subject to ෡𝔼 𝑡 𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 ≤ 𝛿 • Here, 𝜃old is the vector of policy parameters before the update.
  • 11. PPO, Schulman et al, 2017Background: Policy Optimization • This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.
  • 12. PPO, Schulman et al, 2017Background: Policy Optimization • The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem for some coefficient 𝛽. maximize 𝜃 ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
  • 13. PPO, Schulman et al, 2017Background: Policy Optimization • This follows from the fact that a certain surrogate objective (which computes the max KL over states instead of the mean) forms a lower bound (i.e., a pessimistic bound) on the performance of the policy 𝜋. • TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of 𝜷 that performs well across different problems—or even within a single problem, where the characteristics change over the course of learning.
  • 14. PPO, Schulman et al, 2017Clipped Surrogate Objective • Let 𝑟𝑡(𝜃) denote the probability ratio 𝑟𝑡 𝜃 = 𝜋 𝜃(𝑎 𝑡|𝑠 𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠 𝑡) , so 𝑟𝑡 𝜃old = 1. TRPO maximizes a “surrogate” objective 𝐿 𝐶𝑃𝐼 𝜃 = ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 = ෡𝔼 𝑡 𝑟𝑡 𝜃 መ𝐴 𝑡 • The superscript CPI refers to conservative policy iteration, where this objective was proposed.
  • 15. PPO, Schulman et al, 2017Clipped Surrogate Objective • Without a constraint, maximization of 𝐿 𝐶𝑃𝐼 would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move 𝑟𝑡(𝜃) away from 1. • The main objective we propose is the following: 𝐿 𝐶𝐿𝐼𝑃 𝜃 = ෡𝔼 𝑡 min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡) • where epsilon is a hyperparameter, say, 𝜖 = 0.2.
  • 16. PPO, Schulman et al, 2017Clipped Surrogate Objective • The motivation for this objective is as follows. • The first term inside the min is 𝐿 𝐶𝑃𝐼 𝜃 . • The second term, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving 𝑟𝑡 𝜃 outside of the interval 1 − 𝜖, 1 + 𝜖 . • Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.
  • 17. PPO, Schulman et al, 2017Clipped Surrogate Objective Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function 𝐿 𝐶𝐿𝐼𝑃 as a function of the probability ratio 𝑟, for positive advantages (left) and negative advantages (right). The red circle on each plot shows the starting point for the optimization, i.e., 𝑟 = 1. Note that 𝐿 𝐶𝐿𝐼𝑃 sums many of these terms.
  • 18. PPO, Schulman et al, 2017Clipped Surrogate Objective Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter 𝜃old, and the updated policy parameter, which we compute after one iteration of PPO. The updated policy has a KL divergence of about 0.02 from the initial policy, and this is the point at which 𝐿 𝐶𝐿𝐼𝑃 is maximal.
  • 19. PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient • Another approach, which can be used as an alternative to the clipped surrogate objective, or in addition to it, is to use a penalty on KL divergence, and to adapt the penalty coefficient so that we achieve some target value of the KL divergence 𝑑targ each policy update. • In our experiments, we found that the KL penalty performed worse than the clipped surrogate objective, however, we’ve included it here because it’s an important baseline.
  • 20. PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient • In the simplest instantiation of this algorithm, we perform the following steps in each policy update: • Using several epochs of minibatch SGD, optimize the KL-penalized objective 𝐿 𝐾𝐿𝑃𝐸𝑁 𝜃 = ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 • Compute 𝒅 = ෡𝔼 𝒕 𝑲𝑳 𝝅 𝜽 𝒐𝒍𝒅 ∙ 𝒔 𝒕 , 𝝅 𝜽 ∙ 𝒔 𝒕 • If 𝑑 < Τ𝑑 𝑡𝑎𝑟𝑔 1.5 , 𝛽 ← Τ𝛽 2 • If 𝑑 > 𝑑 𝑡𝑎𝑟𝑔 × 1.5, 𝛽 ← 𝛽 × 2 • The updated 𝛽 is used for the next policy update.
  • 21. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Most techniques for computing variance-reduced advantage- function estimators make use a learned state-value function 𝑉(𝑠) • Generalized advantage estimation • The finite-horizon estimators
  • 22. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • If using a neural network architecture that shares parameters between the policy and value function, we must use a loss function that combines the policy surrogate and a value function error term. • This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration, as suggested in past work (A3C).
  • 23. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Combining these terms, we obtain the following objective, which is (approximately) maximized each iteration: 𝐿 𝑡 𝐶𝐿𝐼𝑃+𝑉𝐹+𝑆 𝜃 = ෡𝔼 𝑡 𝐿 𝑡 𝐶𝐿𝐼𝑃 𝜃 − 𝑐1 𝐿 𝑡 𝑉𝐹 𝜃 + 𝑐2 𝑆 𝜋 𝜃 𝑠𝑡 • where 𝑐1, 𝑐2 are coefficients, and 𝑆 denotes an entropy bonus, and 𝐿 𝑡 𝑉𝐹 is a squared-error loss 𝑉𝜃 𝑠𝑡 − 𝑉𝑡 targ 2 .
  • 24. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • One style of policy gradient implementation, popularized in A3C and well-suited for use with recurrent neural networks, runs the policy for 𝑻 timesteps (where 𝑻 is much less than the episode length), and uses the collected samples for an update.
  • 25. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • This style requires an advantage estimator that does not look beyond timestep 𝑻. The estimator used by A3C is መ𝐴 𝑡 = −𝑉 𝑠𝑡 + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟 𝑇−1 + 𝛾 𝑇−𝑡 𝑉 𝑠 𝑇 • where 𝑡 specifies the time index in 0, 𝑇 , within a given length-𝑇 trajectory segment.
  • 26. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Generalizing this choice, we can use a truncated version of generalized advantage estimation, which reduces to previous equation when 𝜆 = 1: መ𝐴 𝑡 = 𝛿𝑡 + 𝛾𝜆 𝛿𝑡+1 + ⋯ + ⋯ + 𝛾𝜆 𝑇−𝑡+1 𝛿 𝑇−1 where 𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡+1 − 𝑉 𝑠𝑡
  • 27. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments is shown below. • Each iteration, each of 𝑁 (parallel) actors collect 𝑇 timesteps of data. • Then we construct the surrogate loss on these 𝑁𝑇 timesteps of data, and optimize it with minibatch SGD (or usually for better performance, Adam), for 𝐾 epochs.
  • 28. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments • First, we compare several different surrogate objectives under different hyperparameters. Here, we compare the surrogate objective 𝐿 𝐶𝐿𝐼𝑃 to several natural variations and ablated versions. • No clipping or penalty: 𝐿 𝑡(𝜃) = 𝑟𝑡 𝜃 መ𝐴 𝑡 • Clipping: 𝐿 𝑡(𝜃) = min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡) • KL penalty (fixed or adaptive): 𝐿 𝑡 𝜃 = 𝑟𝑡 𝜃 መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old , 𝜋 𝜃
  • 29. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 30. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 31. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 32. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 33. Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion • We have introduced proximal policy optimization, a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update.
  • 34. Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion • These methods have the stability and reliability of trust-region methods but are much simpler to implement, requiring only few lines of code change to a vanilla policy gradient (A3C) implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
  • 35. References • https://reinforcement-learning-kr.github.io/2018/06/22/7_ppo/ • https://lynnn.tistory.com/73 • https://jay.tech.blog/2018/10/09/trpo%EC%99%80-ppo/ • https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient- Algorithms Proximal Policy Optimization Algorithms, Schulman et al, 2017