SlideShare a Scribd company logo
1 of 30
Download to read offline
Chapter 7: n-step Bootstrapping
Seungjae Ryan Lee
● Monte Carlo: wait until end of episode
● 1-step TD / TD(0): wait until next time step
Bootstrapping target
TD error
MC error
Recap: MC vs TD
Return
n-step Bootstrapping
● Perform update based on intermediate number of rewards
● Freed from the “tyranny of the time step” of TD
○ Different time step for action selection (1) and bootstrapping interval (n)
● Called n-step TD since they still bootstrap
n-step Bootstrapping
n-step TD Prediction
● Use truncated n-step return as target
○ Use n rewards and bootstrap
● Needs future rewards not available at timestep
● cannot be updated until timestep
n-step TD Prediction: Pseudocode
Compute n-step return
Update V
n-step TD Prediction: Convergence
● The n-step return has the error reduction property
○ Expectation of n-step return is a better estimate of than in the worst-state sense
● Converges to true value under appropriate technical conditions
Random Walk Example
● Rewards only on exit (-1 on left exit, 1 on right exit)
● n-step return: propagate reward up to n latest states
S17 S18 S19S1 S2 S3
R = -1 R = 1
Sample
trajectory
1-step
2-step
Random Walk Example: n-step TD Prediction
● Intermediate n does best
n-step Sarsa
● Extend n-step TD Prediction to Control (Sarsa)
○ Need to use Q instead of V
○ Use ε-greedy policy
● Redefine n-step return with Q
● Naturally extend to Sarsa
n-step Sarsa vs. Sarsa(0)
● Gridworld with nonzero reward only at the end
● n-step can learn much more from one episode
n-step Sarsa: Pseudocode
n-step Expected Sarsa
● Same update as Sarsa except the last element
○ Consider all possible actions in the last step
● Same n-step return as Sarsa except the last step
● Same update as Sarsa
Off-policy n-step Learning
● Need importance sampling
● Update target policy’s values with behavior policy’s returns
● Generalizes the on-policy case
○ If , then
Off-policy n-step Sarsa
● Update Q instead of V
● Importance sampling ratio starts one step later for Q values
○ is already chosen
Off-policy n-step Sarsa: Pseudocode
Off-policy n-step Expected Sarsa
● Importance sampling ratio ends one step earlier for Expected Sarsa
● Use expected n-step return
Per-decision Off-policy Methods: Intuition*
● More efficient off-policy n-step method
● Write returns recursively:
● Naive importance sampling
○ If ,
○ Estimate shrinks, higher variance
Per-decision Off-policy Methods*
● Better: If , leave the estimate unchanged
● Expected update is unchanged since
● Used with TD update without importance sampling
Control Variate
Per-decision Off-policy Methods: Q*
● Use Expected Sarsa’s n-step return
● Off-policy form with control variate:
● Analogous to Expected Sarsa after combining with TD update algorithm
http://auai.org/uai2018/proceedings/papers/282.pdf
n-step Tree Backup Algorithm
● Off-policy without importance sampling
● Update from entire tree of estimated action values
○ Leaf action nodes (not selected) contribute to the target
○ Selected action nodes does not contribute but weighs all next-level action values
n-step Tree Backup Algorithm: n-step Return
● 1-step return
● 2-step return
n-step Tree Backup Algorithm: n-step Return
● 2-step return
● n-step return
n-step Tree Backup Algorithm: Pseudocode
A Unifying Algorithm: n-step *
● Unify Sarsa, Tree Backup and Expected Sarsa
○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)
A Unifying Algorithm: n-step : Equations*
● : degree of sampling on timestep t
● Slide linearly between two weights:
○ Sarsa: Importance sampling ratio
○ Tree Backup: Policy probability
A Unifying Algorithm: n-step : Pseudocode*
Summary
● n-step: Look ahead to the next n rewards, states, and actions
+ Perform better than either MC or TD
+ Escapes the tyranny of the single time step
- Delay of n steps before learning
- More memory and computation per timestep
● Extended to Eligibility Traces (Ch. 12)
+ Minimize additional memory and computation
- More complex
● Two approaches to off-policy n-step learning
○ Importance sampling: high variance
○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)
Summary
Thank you!
Original content from
● Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
● github.com/seungjaeryanlee
● www.endtoend.ai

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationSeung Jae Lee
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsSangwoo Mo
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with Approximation
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 

Similar to Reinforcement Learning 7. n-step Bootstrapping

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningMarsan Ma
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAminaRepo
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learningguruprasad110
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 

Similar to Reinforcement Learning 7. n-step Bootstrapping (20)

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Reinforcement Learning 7. n-step Bootstrapping

  • 1. Chapter 7: n-step Bootstrapping Seungjae Ryan Lee
  • 2. ● Monte Carlo: wait until end of episode ● 1-step TD / TD(0): wait until next time step Bootstrapping target TD error MC error Recap: MC vs TD Return
  • 3. n-step Bootstrapping ● Perform update based on intermediate number of rewards ● Freed from the “tyranny of the time step” of TD ○ Different time step for action selection (1) and bootstrapping interval (n) ● Called n-step TD since they still bootstrap
  • 5. n-step TD Prediction ● Use truncated n-step return as target ○ Use n rewards and bootstrap ● Needs future rewards not available at timestep ● cannot be updated until timestep
  • 6. n-step TD Prediction: Pseudocode Compute n-step return Update V
  • 7. n-step TD Prediction: Convergence ● The n-step return has the error reduction property ○ Expectation of n-step return is a better estimate of than in the worst-state sense ● Converges to true value under appropriate technical conditions
  • 8. Random Walk Example ● Rewards only on exit (-1 on left exit, 1 on right exit) ● n-step return: propagate reward up to n latest states S17 S18 S19S1 S2 S3 R = -1 R = 1 Sample trajectory 1-step 2-step
  • 9. Random Walk Example: n-step TD Prediction ● Intermediate n does best
  • 10. n-step Sarsa ● Extend n-step TD Prediction to Control (Sarsa) ○ Need to use Q instead of V ○ Use ε-greedy policy ● Redefine n-step return with Q ● Naturally extend to Sarsa
  • 11. n-step Sarsa vs. Sarsa(0) ● Gridworld with nonzero reward only at the end ● n-step can learn much more from one episode
  • 13. n-step Expected Sarsa ● Same update as Sarsa except the last element ○ Consider all possible actions in the last step ● Same n-step return as Sarsa except the last step ● Same update as Sarsa
  • 14. Off-policy n-step Learning ● Need importance sampling ● Update target policy’s values with behavior policy’s returns ● Generalizes the on-policy case ○ If , then
  • 15. Off-policy n-step Sarsa ● Update Q instead of V ● Importance sampling ratio starts one step later for Q values ○ is already chosen
  • 17. Off-policy n-step Expected Sarsa ● Importance sampling ratio ends one step earlier for Expected Sarsa ● Use expected n-step return
  • 18. Per-decision Off-policy Methods: Intuition* ● More efficient off-policy n-step method ● Write returns recursively: ● Naive importance sampling ○ If , ○ Estimate shrinks, higher variance
  • 19. Per-decision Off-policy Methods* ● Better: If , leave the estimate unchanged ● Expected update is unchanged since ● Used with TD update without importance sampling Control Variate
  • 20. Per-decision Off-policy Methods: Q* ● Use Expected Sarsa’s n-step return ● Off-policy form with control variate: ● Analogous to Expected Sarsa after combining with TD update algorithm http://auai.org/uai2018/proceedings/papers/282.pdf
  • 21. n-step Tree Backup Algorithm ● Off-policy without importance sampling ● Update from entire tree of estimated action values ○ Leaf action nodes (not selected) contribute to the target ○ Selected action nodes does not contribute but weighs all next-level action values
  • 22. n-step Tree Backup Algorithm: n-step Return ● 1-step return ● 2-step return
  • 23. n-step Tree Backup Algorithm: n-step Return ● 2-step return ● n-step return
  • 24. n-step Tree Backup Algorithm: Pseudocode
  • 25. A Unifying Algorithm: n-step * ● Unify Sarsa, Tree Backup and Expected Sarsa ○ Decide on each step to use sample action (Sarsa) or expectation of all actions (Tree Backup)
  • 26. A Unifying Algorithm: n-step : Equations* ● : degree of sampling on timestep t ● Slide linearly between two weights: ○ Sarsa: Importance sampling ratio ○ Tree Backup: Policy probability
  • 27. A Unifying Algorithm: n-step : Pseudocode*
  • 28. Summary ● n-step: Look ahead to the next n rewards, states, and actions + Perform better than either MC or TD + Escapes the tyranny of the single time step - Delay of n steps before learning - More memory and computation per timestep ● Extended to Eligibility Traces (Ch. 12) + Minimize additional memory and computation - More complex ● Two approaches to off-policy n-step learning ○ Importance sampling: high variance ○ Tree backup: limited to few-step bootstrapping if policies are very different (even if n is large)
  • 30. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai