The document describes the Dreamer model for reinforcement learning. Dreamer learns a world model from images of its experience. It then learns behaviors by imagining future sequences predicted by the world model and backpropagating value gradients through the imagined sequences. Experiments show Dreamer outperforms prior model-free and model-based methods on a variety of visual control tasks, demonstrating it can learn behaviors purely from latent imagination to solve challenging problems.
3. 1. RL Comparison
4/18/2023 딥논읽 세미나 - 강화학습 2
INDEX
Introduction
Methods
Performance
Conclusion
Model-Free RL
• No Model
• Learn value function(and/or policy) from real experience
Model–Based RL
• Learn a model from real experience
• Plan value function(and/or policy) from the simulated experience
RL Comparison, from slides of Sergey Levine
4. 2. World Model
4/18/2023 딥논읽 세미나 - 강화학습 3
INDEX
Introduction
Methods
Performance
Conclusion
“Intelligent agents can achieve goals in complex environments
even through they never encouter the exact same situation twice.”
“This ability requires building representations of the world from past
experience that enable generalization to novel situations.”
“World models offer an explicit way to represent an agent’s knowledge
about the world in a parametric model that can make predictions about the
future”
A World Model, from Scott McCloud’s Understanding Comics.
5. 3. Visual Control
4/18/2023 딥논읽 세미나 - 강화학습 4
INDEX
Introduction
Methods
Performance
Conclusion
“Sensory inputs are high-dimensional images, latent dynamic models can
abstract observations to predict forward in compact state spaces.”
→ latent states have a small memory footprint
“Behaviors can be derived from dynamic models in many ways.”
→ Considering only rewards within a fixed imagination horizon results in shortsighted behaviors
→ Prior work commonly resorts to derivative-free optimization for robustness
6. 4. PlaNet
4/18/2023 딥논읽 세미나 - 강화학습 5
INDEX
Introduction
Methods
Performance
Conclusion
An RL agent that learns the environment dynamics from
images and chooses actions through fast online planning in
latent space.
Learning Latent Dynamics for Planning from Pixels(Danijar Hafner et al., 2019)
7. 5. Dreamer
4/18/2023 딥논읽 세미나 - 강화학습 6
INDEX
Introduction
Methods
Performance
Conclusion
An RL agent that learns long-horizontal behaviors from
images purely by latent imagination.
The three processes of the Dreamer agent.
1. The world model is learned from past experience.
2. From predictions of this model, the agent then learns a value network
to predict future rewards and an actor network to select actions.
3. The actor network is used to interact with the environment.
9. 1. Learning the World Model
4/18/2023 딥논읽 세미나 - 강화학습 8
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer learns a world model from experience. Using past images 𝑜1 ~ 𝑜3 and actions 𝑎1 ~
𝑎2, it computes a sequence of compact model states (green circles) from which it reconstructs
the images 𝑜1 ~ 𝑜3 and predicts the rewards 𝑟1 ~ 𝑟3. → Leveraging PlaNet
10. 2. Learning Behavior in Imagination
4/18/2023 딥논읽 세미나 - 강화학습 9
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer learns long-sighted behaviors from predicted sequences of model states. It first
learns the long-term value 𝑣2 ~ 𝑣3 of each state, and then predicts actions 𝑎1 ~ 𝑎2 that lead to
high rewards and values by backpropagating them through the state sequence to the actor
network.
11. 2. Learning Behavior in Imagination
4/18/2023 딥논읽 세미나 - 강화학습 10
INDEX
Introduction
Methods
Performance
Conclusion
PlaNet vs Dreamer
• For a given situation in the environment, PlaNet searches for the best action among many predictions for different
action sequences.
• Dreamer side-steps this expensive search by decoupling planning and acting. Once its actor network has been
trained on predicted sequences, it computes the actions for interacting with the environment without additional search.
In addition, Dreamer considers rewards beyond the planning horizon using a value function and leverages
backpropagation for efficient planning.
12. 3. Act in the Environment
4/18/2023 딥논읽 세미나 - 강화학습 11
INDEX
Introduction
Methods
Performance
Conclusion
The agent encodes the history of the episode to compute the current model state and the next
action to execute in the environment.
14. 4. Explained
4/18/2023 딥논읽 세미나 - 강화학습 13
INDEX
Introduction
Methods
Performance
Conclusion
1. Transition Model
2. Reward Model
3. Policy
4. Objective
(Imgained Rewards)
Actor-Critic Method
From papers on Deep Multi-Agent(Taiki Fuji et al.)
5. Actor-Critic Model
(Parametrized by 𝜙 & 𝜓, respectively)
15. 4. Explained
4/18/2023 딥논읽 세미나 - 강화학습 14
INDEX
Introduction
Methods
Performance
Conclusion
6. Value Estimation
7. Learning Object
Flow of Actor-Critic Method
From slides of Deep RL, Sergey Levine
w/ Imgained Trajectories
↖ Exponential decaying for old Trajectory
↖ Rewards beyond k steps with the learned value model
↖ Actor
↖ Critic
17. 1. Control Tasks
4/18/2023 딥논읽 세미나 - 강화학습 16
INDEX
Introduction
Methods
Performance
Conclusion
• Dreamer learns to solve 20 challenging continuous control tasks with
image inputs, 5 of which are displayed here.
The tasks are designed to pose a variety of challenges to the RL agent, including difficult to
predict collisions, sparse rewards, chaotic dynamics, small but relevant objects, high degrees
of freedom, and 3D perspectives
• The visualizations show the same 64x64 images that the agent receives
from the environment.
18. 2. Comparison
4/18/2023 딥논읽 세미나 - 강화학습 17
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer outperforms the previous best model-free
(D4PG) and model-based (PlaNet) methods on the
benchmark of 20 tasks in terms of final performance,
data efficiency, and computation time.
19. 3. Atari Games
4/18/2023 딥논읽 세미나 - 강화학습 18
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer learns successful behaviors on Atari games and DeepMind Lab
levels, which feature discrete actions and visually more diverse scenes,
including 3D environments with multiple objects.
21. Conclusion
4/18/2023 딥논읽 세미나 - 강화학습 20
INDEX
Introduction
Methods
Performance
Conclusion
1. Learning behaviors from sequences predicted by world models alone can solve
challenging visual control tasks from image inputs, surpassing the performance
of previous model-free approaches.
2. Dreamer demonstrates that learning behaviors by backpropagating value gradients through
predicted sequences of compact model states is successful and robust, solving a diverse
collection of continuous and discrete control tasks.
22. My Questions
4/18/2023 딥논읽 세미나 - 강화학습 21
INDEX
Introduction
Methods
Performance
Conclusion
1. Is there any relation between world model and “common sense” mentioned by Yann LeCun?
2. Is there any evidence for the mechanism of human prediction and dream?
What we see is based on our brain’s prediction of the future,
A. Kitaoka.Kanzen. 2002.
25. Thank You For Your Listening!
4/18/2023 딥논읽 세미나 - 강화학습 24
INDEX
Introduction
Methods
Performance
Conclusion
Learning Behaviors by Latent Imagination
DQN model for Text2image, PixRay