This document discusses deep reinforcement learning and concept network reinforcement learning. It begins with an introduction to reinforcement learning concepts like Markov decision processes and value-based methods. It then describes Concept-Network Reinforcement Learning which decomposes complex tasks into high-level concepts or actions. This allows composing existing solutions to sub-problems without retraining. The document provides examples of using concept networks for lunar lander and robot pick-and-place tasks. It concludes by discussing how concept networks can improve sample efficiency, especially for sparse reward problems.
3. A Reinforcement Learning Example
3
Rocket Trajectory Optimization:
OpenAI Gym’s LunarLander Simulator
4. A Reinforcement Learning Example
4
State:
x_position
y_position
x_velocity
y_velocity
angle
angular velocity
left_leg
right_leg
Action (Discrete):
do nothing (0)
fire left engine (1)
fire main engine (2)
fire right engine (3)
Action (Continuous):
main engine power
left/right engine power
Reward: Moving from the top of the screen to landing pad and
zero speed has about 100-140 points. Episode finishes if the
lander crashes or comes to rest, additional -100 or +100.
Each leg ground contact is +10. Firing main engine has -0.3
points each frame.
5. Basic RL Concepts
5
Reward Hypothesis
Goals can be described by maximizing the expected cumulative reward .
Sequential Decision Making
Actions may have long-term consequences.
Rewards may be delayed, like a financial investment.
Sometimes the agent sacrifices instant rewards to maximize long-term reward (just like life!)
State Data
Sequential and non i.i.d
Agent’s actions affect the next data samples.
6. Definitions
Policy
Dictates agent’s behavior, and maps from state to action:
Deterministic policy: a = Л(s)
Stochastic policy: Л(a|s) = P(At
= a|St
= s)
Value function
Determines how good each state (and action) is:
VЛ
(s)=EЛ
[ Rt+1
+ Rt+2
+ 2
Rt+3
+... | St
=s ]
QЛ
(s,a)
Model
Predicts what the environment will do next (simulator’s job for instance)
6
7. Agent and Environment
At each time step, the agent:
Receives observation
Receives reward
Takes action
The environment:
Receives action
Sends next observation
Sends next reward
7
8. Markov Decision Processes (MDP)
8
Mathematical framework for sequential decision making.
An environment in which all states are Markovian:
Markov Decision Process is a tuple:
Pictures from David Silver’s Slides
9. Exploration vs. Exploitation
Exploration vs. Exploitation Dilemma
● Reinforcement learning (specially model-free) is like trial-and-error learning.
● The agent should find a good policy that maximizes future rewards from its experiences
of the environment, in a potentially very large state space.
● Exploration finds more information about the environment, while Exploitation exploits
known information to maximise reward.
9
10. Value Based Methods: Q-Learning
What are the Problems:
● The iterative update is not scalable enough:
● Computing Q(s,a) for every state-action pair is not feasible most of the times.
Solution:
● Use a function approximator to estimate Q(s,a). such as a neural network! (differentiable)
10
Using Bellman equation as an iterative update, to find optimal policy:
11. Value Based Methods: Q-Learning
Use a function approximator to estimate the action-value function:
Q(s, a; ) ≅ Q*(s, a)
is the function parameter (weights of NN)
Function approximator can be a deep neural network: DQN
11
Loss Function:
12. Value Based Methods: DQN
Learning from batches of consecutive samples is problematic and costly:
- Sample correlation: Samples are correlated, which in return, makes inefficient learning
- Bad feedback loops: Current Q-network parameters dictates next training samples and can
lead to bad feedback loops (e.g if maximizing action is to move left, training samples will
be dominated by samples from left-hand size)
To solve them, use Experience Replay
- Continually update a replay memory table of transitions (st
, at
, rt
, st+1
).
- Train Q-network on random mini-batches of transitions from the replay memory.
12
13. Concept Network Reinforcement Learning
● Solving complex tasks by decomposing them to high level actions or "concepts".
● “Multi-level hierarchical RL” approach, inspired by Sutton’s Options:
○ enables efficient exploration by the abstractions over low level actions,
○ improving sample efficiency significantly,
○ especially in “sparse reward”.
● Allows existing solutions to sub-problems to be composed into an overall solution
without requiring re-training.
13
14. Temporal Abstractions
● At each time t for each state st
, a higher level “selector” chooses concept ct
among all
possible concepts available to the selector.
● Each concept remains active for some time, until a predefined terminal state is reached.
● An internal critic evaluates how close the agent is to satisfying a terminal condition of ct
,
and sends reward rc
(t) to the selector.
● Similar to baseline RL, except that an extra layer of abstraction is defined on the set of
“primitive” actions, forming a concept, so that execution of each concept corresponds to
a certain action.
14
19. Robotics Pick and Place with Concepts
19
Deep Reinforcement Learning for Dexterous Manipulation with Concept Networks
https://arxiv.org/abs/1709.06977
22. Definitions
State
The agent’s internal representation in the environment.
Information the agent uses to pick the next action.
Policy
Dictates agent’s behavior, and maps from state to action:
Deterministic policy: a = Л(s)
Stochastic policy: Л(a|s) = P(At
= a|St
= s)
Value function
Determines how good each state (and action) is:
VЛ
(s)=EЛ
[ Rt+1
+ Rt+2
+ 2
Rt+3
+... | St
=s ]
QЛ
(s,a)
Model
Predicts what the environment will do next (simulator’s job for instance)
22
25. Learning vs Planning
25
Learning (Model-Free Reinforcement Learning):
The environment is initially unknown
The agent interacts with the environment, not knowing about the environment
The agent improves its policy based on previous interactions
Planning (Model-based Reinforcement Learning):
A model of the environment is known or acquired
The agent performs computations with the model, without any external interaction
The agent improves its policy based on those computations with the model
27. Introduction to RL: Challenges
27
Playing Atari with Deep Reinforcement Learning, Mnih et al, Deepmind
28. Policy-Based Methods
● The Q-function can be complex and unnecessary. All we want is best action!!
● Example: In a very high-dimensional state, it is wasteful and costly to learn exact value
of every (state, action) pair.
28
● Defining parameterized policies:
● For each policy, define its value:
● Gradient ascent on policy parameters to find the optimal policy!