SlideShare a Scribd company logo
1 of 45
Download to read offline
Chapter 13: Policy Gradient Methods
Seungjae Ryan Lee
Preview: Policy Gradients
● Action-value Methods
○ Learn values of actions and select actions with estimated action values
○ Policy derived from action-value estimates
● Policy Gradient Methods
○ Learn parameterized policy that can select action without a value function
○ Can still use value function to learn the policy parameter
Policy Gradient Methods
● Define a performance measure J(θ) to maximize
● Learn policy parameter θ through approximate gradient ascent
Stochastic estimate of J(θ)
Soft-max in Action Preferences
● Numerical preference h(s, a, θ) for each state-action pair
● Action selection through soft-max
Soft-max in Action Preferences: Advantages
1. Approximate policy can approach deterministic policy
○ No “limit” like ε-greedy methods
○ Using soft-max on action values cannot approach deterministic policy
Soft-max in Action Preferences: Advantages
2. Allow stochastic policy
○ Best approximate policy can be stochastic in problems with significant function approximation
● Consider small corridor with -1 reward on each step
○ States are indistinguishable
○ Action transition is reversed in the second state
Soft-max in Action Preferences: Advantages
2. Allow stochastic policy
○ ε-greedy methods (ε=0.1) cannot find optimal policy
Soft-max in Action Preferences: Advantages
3. Policy may be simpler to approximate
○ Differs among problems
http://proceedings.mlr.press/v48/simsek16.pdf
Theoretical Advantage of Policy Gradient Methods
● Smooth transition of policy for parameter changes
● Allows for stronger convergence guarantees
h = 1 h = 0.9
0.4750.525
h = 1 h = 1.1
0.475 0.525
Q = 1 Q = 0.9
0.050.95
Q = 1 Q = 1.1
0.5 0.95
Policy Gradient Theorem
● Define performance measure as value of the start state
● Want to compute w.r.t. policy parameter
Policy Gradient Theorem
Episodic: Average episode length
Continuing: 1
On-policy state distribution
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Product rule
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Unrolling
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Policy Gradient Theorem: Proof
Stochastic Gradient Descent
● Need samples with expectation
Stochastic Gradient Descent
is an on-policy state distribution of
Stochastic Gradient Descent
Stochastic Gradient Descent
Stochastic Gradient Descent: REINFORCE
REINFORCE (1992)
● Sample return like Monte Carlo
● Increment proportional to return
● Increment inverse proportional to action probability
○ Prevent frequent actions dominating due to frequent updates
REINFORCE: Pseudocode
Eligibility vector
REINFORCE: Results
REINFORCE with Baseline
● REINFORCE
○ Good theoretical convergence
○ Bad convergence speed due to high variance
Baseline
REINFORCE with Baseline: Pseudocode
Learn state value with MC to use as baseline
REINFORCE with Baseline: Results
● Learn approximations for both policy (Actor) and value function (Critic)
● Critic vs Baseline in REINFORCE
○ Critic is used for bootstrapping
○ Bootstrapping introduces bias and relies on state representation
○ Bootstrapping reduces variance and accelerates learning
Actor-Critic Methods
Actor Critic
One-step Actor Critic
● Replace return with one-step return
● Replace baseline with approximated value function (Critic)
○ Learned with semi-gradient TD(0)
REINFORCE:
One-step AC:
One-step Actor-Critic
Update Critic (value function) parameters
Update Actor (policy) parameters
Actor-Critic with Eligibility Traces
Average Reward for Continuing Problems
● Measure performance in terms of average reward
Actor-Critic for Continuing Problems
Policy Gradient Theorem Proof (Continuing Case)
1. Same procedure:
2. Rearrange equation:
Policy Gradient Theorem Proof (Continuing Case)
3. Sum over all states weighted by state-distribution
a. Nothing changes since neither side depend on s and
Policy Gradient Theorem Proof (Continuing Case)
4. Use ergodicity:
● Policy based methods can handle continuous action spaces
● Learn statistics of the probability distribution
○ ex) mean and variance of Gaussian
● Choose action from the learned distribution
○ ex) Gaussian distribution
Policy Parameterization for Continuous Actions
Policy Parametrization to Gaussian Distribution
● Divide policy parameter vector into mean and variance:
● Approximate mean with linear function:
● Approximate variance with exponential of linear function:
○ Guaranteed positive
● All PG algorithms can be applied to the parameter vector
Summary
● Policy Gradient methods have many advantages over action-value methods
○ Represent stochastic policy and approach deterministic policies
○ Learn appropriate levels of exploration
○ Handle continuous action spaces
○ Compute effect of policy parameter on performance with Policy Gradient Theorem
● Actor-Critic estimates value function for bootstrapping
○ Introduces bias but is often desirable due to lower variance
○ Similar to preferring TD over MC
“Policy Gradient methods provide a significantly different set of strengths and
weaknesses than action-value methods.”
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Reinforcement Learning 13. Policy Gradient Methods