A summary of Chapter 8: Planning and Learning with Tabular Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Check my website for more slides of books and papers!
https://www.endtoend.ai
2. Major Themes of Chapter 8
1. Unifying planning and learning methods
â Model-based methods rely on planning
â Model-free methods rely on learning
2. Benefits of planning in small, incremental steps
â Key to efficient combination of planning and learning
3. Model of the Environment
â Anything the agent can use to predict the environment
â Used to mimic or simulate experience
1. Distribution models
â Describe all possibilities and probabilities
â Stronger but difficult to obtain
2. Sample models
â Produce one possibility sampled according to the probabilities
â Weaker but easier to obtain
4. Planning
â Process of using a model to produce / improve policy
â State-space Planning
â Search through the state space for optimal policy
â Includes RL approaches introduced until now
â Plan-space Planning
â Search through space of plans
â Not efficient in RL
â ex) Evolutionary methods
5. State-space Planning
â All state-space planning methods share common structure
â Involves computing value functions
â Compute value functions by updates or backup operations
â ex) Dynamic Programming
Distribution
model
Sweeps
through
state space
Policy
Evaluation
Policy
Improvement
6. Relationship between Planning and Learning
â Both estimate value function by backup update operations
â Planning uses simulated experience from model
â Learning uses real experience from environment
Learning Planning
Real
Experience
Simulated
Experience
7. Random-sample one-step tabular Q-planning
â Planning and Learning is similar enough to transfer algorithms
â Same convergence guarantees as one-step tabular Q-learning
â All states and actions selected infinitely many times
â Step size decrease over time
â Need just the sample model
9. On-line Planning
â Need to incorporate new experience with planning
â Divide computational resources between decision making and model learning
10. Roles of Real Experience
1. Model learning: Improve the model to increase accuracy
2. Direct RL: Directly improve the value function and policy
11. Direct Reinforcement Learning
â Improve value functions and policy with real experience
â Simpler and not affected by bias from model design
12. Indirect Reinforcement Learning
â Improve value functions and policy by improving the model
â Achieve better policy with fewer environmental interactions
13. Dyna-Q
â Simple on-line planning agent with all processes
â Planning: random-sample one-step tabular Q-planning
â Direct RL: one-step tabular Q-learning
â Model learning: Return last observed next state and reward as prediction
14. Dyna-Q: Implementation
â Order of process on a serial computer
â Can be parallelized to a reactive, deliberative agent
â reactive: responding instantly to latest information
â deliberative: always planning in the background
â Planning is most computationally intensive
â Complete n iteration of Q-planning algorithm on each timestep
Acting Direct RL
Model
Learning
Planning
18. Dyna Maze Example: Intuition
â Without planning, each episode adds only one step to the policy
â With planning, extensive policy is developed by the end of episode
Black square : location of the agent
Halfway through second episode
19. Possibility of a Wrong Model
â Model can be wrong for various reasons:
â Stochastic environment & limited number of samples
â Function approximation
â Environment has changed
â Can lead to suboptimal policy
20. Optimistic Wrong Model: Blocking Maze Example
â Agent can correct modelling error for optimistic models
â Agent attempts to âexploitâ false opportunities
â Agent discovers they do not exist
Environment changes after 1000 timesteps
22. Pessimistic Wrong Model: Shortcut Maze Example
â Agent might never learn modelling error for optimistic models
â Agent never realizes a better path exists
â Unlikely even with an Δ-greedy policy
Environment changes after 1000 timesteps
24. Dyna-Q+
â Need to balance exploration and exploitation
â Keep track of elapsed time since last visit for each state-action pair
â Add bonus reward during planning
â Allow untried state-action pair to be visited in planning
â Costly, but the computational curiosity is worth the cost
25. Prioritized Sweeping: Intuition
â Focused simulated transitions and updates can make planning more efficient
â At the start of second episode, only the penultimate state has nonzero value estimate
â Almost updates do nothing
26. Prioritized Sweeping: Backward Focusing
â Intuition: work backward from âgoal statesâ
â Work back from any state whose value changed
â Typically implies other statesâ values also changed
â Update predecessor states of changed state
V(sâ) = -1
a
s
V(sâ) = -1
a
s
27. Prioritized Sweeping
â Not all changes are equally useful
â magnitude of change in value
â transition probabilities
â Prioritize updates via priority queue
â Pop max-priority pair and update
â Insert all predecessor pairs with effect above some small threshold
â (Only keep higher priority if already exists)
â Repeat until quiescence
30. Prioritized Sweeping: Rod Maneuvering Example
â Maneuver rod around obstacles
â 14400 potential states, 4 actions (translate, rotate)
â Too large to be solved without prioritization
31. Prioritized Sweeping: Limitations
â Stochastic environment
â Use expected update instead of sample updates
â Can waste computation on low-probability transitions
â How about sample updates?
32. Expected vs Sample Updates
â Recurring theme throughout RL
â Any can be used for planning
33. Expected vs Sample Updates
â Expected updates yields better estimate
â Sample updates requires less computation
34. Expected vs Sample Updates: Computation
â Branching factor d: number of possible next states
â Determines computation needed for expected update
â T(Expected update) â d * T(Sample update)
â If enough time for expected update, resulting estimate is usually better
â No sampling error
â If not enough time, sample update can at least somewhat improve estimate
â Smaller steps
35. Expected vs Sample Updates: Analysis
â Sample updates reduce error according to
â Does not account sample update having better estimate of successor states
36. Distributing Updates
1. Dynamic Programming
â Sweep through entire state space (or state-action space)
2. Dyna-Q
â Sample uniformly
â Both suffers from updating irrelevant states most of the time
37. Trajectory Sampling
â Gather experience by sampling explicit individual trajectories
â Sample state transitions and rewards from the model
â Sample actions from a distribution
â Effects of on-policy distribution
â Can ignore vast, uninteresting parts of the space
â Significantly advantageous when function approximation is used
S1 S2 S3
R2 = +1 R3 = -1
Model ModelPolicy Policy Policy
38. Trajectory Sampling: On-policy Distribution Example
â Faster planning initially, but retarded in the long run
â Better when branching factor b is small (can focus on just few states)
39. Trajectory Sampling: On-policy Distribution Example
â Long-lasting advantage when state space is large
â Focusing on states have bigger impact when state space is large
40. Real-time Dynamic Programming (RTDP)
â On-policy trajectory-sampling value iteration algorithm
â Gather real or simulated trajectories
â Asynchronous DP: nonsystematic sweeps
â Update with value iteration
S1 S2 S3
R2 = +1 R3 = -1
Model ModelPolicy Policy Policy
41. RTDP: Prediction and Control
â Prediction problem: skip any state not reachable by policy
â Control problem: find the optimal partial policy
â A policy that is optimal for relevant states but arbitrary for other states
â Finding such policy requires visiting all (s, a) infinitely many times
42. Stochastic Optimal Path Problems
â Conditions
â Undiscounted episodic task with absorbing goal states
â Initial value of every goal state is 0
â At least one policy can definitively reach a goal state from any starting state
â All rewards from non-goal states are negative
â All initial values are optimistic
â Guarantees
â RTDP does not need to visit all (s, a) infinite times to find optimal partial policy
43. Stochastic Optimal Path Problem: Racetrack
â 9115 reachable states, 599 relevant
â RTDP needs half the update of DP
â Visits almost all states at least once
â Focuses to relevant states quickly
44. DP vs. RTDP: Checking Convergence
â DP: Update with exhaustive sweeps until Îv is sufficiently small
â Unaware of policy performance until value function has converged
â Could lead to overcomputation
â RTDP: Update with trajectories
â Check policy performance via trajectories
â Detect convergence earlier than DP
45. â Background Planning
â Gradually improve policy or value function
â Not focused on the current state
â Better when low-latency action selection is required
â Decision-Time Planning
â Select single action through planning
â Focused on the current state
â Typically discard value / policy used in planning after each action selection
â Most useful when fast responses are not required
Background Planning vs. Decision-Time Planning
46. Heuristic Search
â Generate tree of possible continuations for each encountered states
â Compute best action with the search tree
â More computation is needed
â Slower response time
â Value function can be held constant or updated
â Works best with perfect model and imperfect Q
47. Heuristic Search: Focusing States
â Focus on states that might immediate follow the current state
â Computations: Generate tree with current state as head
â Memory: Store estimates only for relevant states
â Particularly efficient when state space is large (ex. chess)
48. Heuristic Search: Focusing Updates
â Focus distribution of updates on the current state
â Construct search tree
â Perform one-step updates from bottom-up
49. Rollout Algorithms
â Estimate Q by averaging simulated trajectories from a rollout policy
â Choose action with highest Q
â Does not compute Q for all states / actions (unlike MC control)
â Not a learning algorithm since values and policies are not stored
http://www.wbc.poznan.pl/Content/351255/Holfeld_Denise_Simroth_Axel_A_Monte_Carlo_Rollout_algorithm_for_stock_control.pdf
50. â Satisfies the policy improvement theorem for the policy
â Same as one step of the policy iteration algorithm
â Rollout seeks to improve upon the default policy, not to find the optimal policy
â Better default policy â Better estimates â Better policy from rollout algorithm
Rollout Algorithms: Policy
51. â Good rollout policies need a lot of trajectories
â Rollout algorithms often have strict time constraints
Possible Mitigations
1. Run trajectories on separate processors
2. Truncate simulated trajectories before termination
â Use stored evaluation function
3. Prune unlikely candidate actions
Rollout Algorithms: Time Constraints
52. Monte Carlo Tree Search (MCTS)
â Rollout algorithm with directed simulations
â Accumulate value estimates in a tree structure
â Direct simulations toward more high-rewarding trajectories
â Behind successes in Go
53. Monte Carlo Tree Search: Algorithm
â Repeat until termination:
a. Selection: Select beginning of trajectory
b. Expansion: Expand tree
c. Simulation: Simulate an episode
d. Backup: Update values
â Select action with some criteria
a. Largest action value
b. Largest visit count
54. Monte Carlo Tree Search: Selection
â Start at the root node
â Traverse down the tree to select a leaf node
55. Monte Carlo Tree Search: Expansion
â Expand the selected leaf node
â Add one or more child nodes via unexplored actions
56. Monte Carlo Tree Search: Simulation
â From leaf node or new child node, simulate a complete episode
â Generates a Monte Carlo trial
â Selected first by the tree policy
â Selected beyond the tree by the rollout policy
57. Monte Carlo Tree Search: Backup
â Update or Initialize values of nodes traversed in tree policy
â No values saved for the rollout policy beyond the tree
60. Monte Carlo Tree Search: Insight
â Can use online, incremental, sample-based methods
â Can focus MC trials on segments with high-return trajectories
â Can efficiently grow a partial value table
â Does not need to save all values
â Does not need function approximation
61. Summary of Chapter 8
â Planning requires a model of the environment
â Distribution model consists of transition probabilities
â Sample model produces single transitions and rewards
â Planning and Learning share many similarities
â Any learning method can be converted to planning method
â Planning can vary in size of updates
â ex) 1-step sample updates
â Planning can vary in distribution of updates
â ex) Prioritized sweeping, On-policy trajectory sampling, RTDP
â Planning can focus forward from pertinent states
â Decision-time planning
â ex) Heuristic search, Rollout algorithms, MCTS
62. Summary of Part I
â Three underlying ideas:
a. Estimate value functions
b. Back up values along actual / possible state trajectories
c. Use Generalized Policy Iteration (GPI)
â Keep approximate value function and policy
â Use one to improve another
63. Summary of Part I: Dimensions
â Three important dimensions:
a. Sample update vs Expected update
â Sample update: Using a sample trajectory
â Expected update: Using the distribution model
b. Depth of updates: degree of bootstrapping
c. On-policy vs Off-policy methods
â One undiscussed important dimension:
Function Approximation
64. Summary of Part I: Other Dimensions
â Episodic vs. Continuing returns
â Discounted vs. Undiscounted returns
â Action values vs. State values vs. Afterstate values
â Exploration methods
â Δ-greedy, optimistic initialization, softmax, UCB
â Synchronous vs. Asynchronous updates
â Real vs. Simulated experience
â Location, Timing, and Memory of updates
â Which state or state-action pair to update in model-based methods?
â Should updates be part of selected actions, or only afterward?
â How long should the updated values be retained?
65. Thank you!
Original content from
â Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
â github.com/seungjaeryanlee
â www.endtoend.ai