Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

Planning in Reinforcement
Learning
Yuandong Tian
Research Scientist
Facebook AI Research

AI works in a lot of situations
Medical Translation
Personalization Surveillance
Object Recognition
Smart Design
Speech Recognition
Board game

What AI still needs to improve
Very few supervised data
Complicated environments
Lots of Corner cases.
Home Robotics
Autonomous Driving
ChatBot
StarCraftQuestion Answering
Common Sense
Exponential space
to explore

Human level
A scary trend of slowing down
“Man, we need more data”
Initial Enthusiasm
“It really works! All in AI!”
Trying all possible hacks
“How can that be…” Despair
“No way, it doesn’t work”
Performance
Efforts

Human level
A scary trend of slowing down
“Man, we need more data”
Initial Enthusiasm
“It really works! All in AI!”
Trying all possible hacks
“How can that be…” Despair
“No way, it doesn’t work”
Performance
Efforts
We need novel algorithms

Reinforcement Learning
Action
State Reward
Agent
Environment
[R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction]
Atari Games Go
DoTA 2 Doom
Quake 3 StarCraft

Why Planning is important?
Just one example

AlphaGo Zero
Update
Models
Generate
Training data
Self-Replays
Zero-human knowledge
[Silver et al, Mastering the game of Go without human knowledge, Nature 2017]

AlphaGo Zero Strength
• 3 days version
• 4.9M Games, 1600 rollouts/move
• 20 block ResNet
• Defeat AlphaGo Lee.
• 40 days version
• 29M Games, 1600 rollouts/move
• 40 blocks ResNet.
• Defeat AlphaGo Master by 89:11

ELF: Extensive, Lightweight and Flexible
Framework for Game Research
Larry ZitnickQucheng Gong Wenling Shang Yuxin WuYuandong Tian
https://github.com/facebookresearch/ELF
[Y. Tian et al, ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games, NIPS 2017]

ELF: A simple for-loop Action
State Reward
Agent
Environment
C++
Python

How ELF works
Game
Threads
(C++)
0
1
2
3
4
5
6
7
Batch BatchBatch Batch Batch
Python

Distributed ELF
Server
Evaluate/Selfplay
Training
Send request
(game params)
Receive
experiences
Client
Client
Client Client Client
Client
Client
AlphaGoZero (more synchronization)
AlphaZero (less synchronization)
Putting AlphaGoZero and AlphaZero
into the same framework

ELF OpenGo
• System can be trained with 2000 GPUs in 2 weeks.
• Decent performance against professional players and strong bots.
• Abundant ablation analysis
• Decoupled design, code highly reusable for other games.
We open source the code and the pre-trained model for the Go and ML community
https://github.com/pytorch/ELF
Simple tutorial in experimental branch (tutorial, tutorial_distri)

ELF OpenGo Performance
20-0Name (rank) ELO (world rank) Result
Kim Ji-seok 3590 (#3) 5-0
Shin Jin-seo 3570 (#5) 5-0
Park Yeonghun 3481 (#23) 5-0
Choi Cheolhan 3466 (#30) 5-0
Single GPU, 80k rollouts, 50 seconds
Offer unlimited thinking time for the players
Vs top professional players
Vs strong bot (LeelaZero)
[158603eb, 192x15, Apr. 25, 2018]: 980 wins, 18 losses (98.2%)
Vs professional players
Single GPU, 2k rollouts, 27-0 against Taiwanese pros.

Planning is how new knowledge is created
Game Start Game End
Random Moves
Meaningful Moves
Already dan level even if the
opening doesn’t make much sense.
Planning #rollouts
Win rate against
bot without planning
50%
100%
Training is almost always constrained
by model capacity (why 40b > 20b)
Where the
reward signal is

Planning is how new knowledge is created
T-step
looking forward
Tree searchOne-step
looking forward
Monte Carlo Sampling
Learning a neural network that directly predicts the optimal value/policy
Temporal Difference (TD) in Reinforcement Learning:

Planning is how game AI is created
Extensive search/planning Evaluate Consequence
Black wins
White wins
Black wins
White wins
Black wins
Current game situation
Lufei Ruan vs. Yifan Hou (2010)
AlphaBeta Pruning + Iterative Deepening
Monte Carlo Tree Search
…

How to plan without a known model?
If you don’t have a ground truth dynamics model …
• Only (human/expert) trajectories, no world model.
• Limited access of world models.
• Cannot restart the model, cannot query any (s, a) pair
• Noisy signals from the world model
• …
If you have a ground truth dynamics model …
• Infinite access of the exact world model
• May query any (s, a) pair
• …
Current
state
Next
state
Action to
take
Build one

Navigation
Target
Yi Wu Georgia Gkioxari Yuxin Wu How to plan the trajectory?

Build a semantic model
outdoor
living room
sofa
Find “oven”
car
chair
dining room
kitchen
oven
Incomplete model of the environment
[Y. Wu et al, Learning and Planning with a Semantic Model, submitted to ICLR 2019]

Build a semantic model
Bayesian Inference
𝑃(𝑧|𝑌)
car
chair
dining room
kitchen
oven
0.7
0.95
0.8
0.5
0.6
0.7
Next step
“kitchen”
outdoor
living room
sofa
Learning experience 𝑌

LEAPS
LEArning and Planning with a Semantic model
living room
kitchen
chair
sofa
Dining room
𝑃(𝑧kitchen,living room)
𝑃(𝑧sofa,living room)
𝑃(𝑧chair,living room)
𝑃(𝑧dining,living room)
living room
kitchen
chair
sofa
Dining room
𝑃(𝑧kitchen,living room)
𝑃 𝑧sofa,living room 𝑌 = 1
𝑃(𝑧chair,living room)
𝑃(𝑧dining,living room)
Planning the trajectory
and more exploration
Planning the trajectory
and more exploration

House3D
SUNCG dataset, 45K scenes, all objects are fully labeled.
https://github.com/facebookresearch/House3D
Depth
Segmentation maskRGB image
Top-down map

Learning the Prior between Different Rooms

Test Performance on ConceptNav

Case Study
• A case study
• Go to “outdoor”
Prior: 𝑃(𝑧)
Birth
Outdoor
Living
room
Garage
0.12
0.38
0.76
0.73 0.28
Sub-Goal: Outdoor

Case Study
• A case study
Sub-Goal: Outdoor
Failed!
Birth
Outdoor
Living
room
Garage
0.12
0.38
0.01
0.73 0.28
Posterior: 𝑃(𝑧|𝐸)

• A case study
Sub-Goal: Garage
Birth
Outdoor
Living
room
Garage
0.12
0.01
0.73 0.28
Failed!
Case Study
0.08

Case Study
• A case study
Sub-Goal: Living Room
Birth
Outdoor
Living
room
Garage
0.12
0.08
0.01
0.28
Success
0.99

Case Study
• A case study
Sub-Goal: Outdoor
Success
Birth
Outdoor
Living
room
Garage
0.99
0.08
0.01
0.280.99

Iterative LQR
What if the dynamics is nonlinear?
New Policy
Sample according to the new policy

Non-differentiable Plans
• Direct predicting combinatorial solutions.
[O. Vinyals. et al, Pointer Networks, NIPS 2015]
Convex hull
Seq2seq model
[H. Mao et al, Resource Management with Deep
Reinforcement Learning, ACM Workshop on Hot
Topics in Networks, 2016]
Schedule the job
to i-th slot
Policy gradient

Neural network rewriter
st
Sample 𝒈 𝒕 ∼ 𝑺𝑷 𝒈 𝒕 , 𝒈 𝒕 ⊂ 𝒔 𝒕 Sample 𝒂 𝒕 ∼ 𝑹𝑺 𝒈 𝒕
𝒔 𝒕+𝟏 = 𝒇(𝒔 𝒕, 𝒈 𝒕, 𝒂 𝒕)
Input Encoder Score Predictor Rule Selector
[X. Chen and Y. Tian, Learning to Progressively Plan, submitted to ICLR 2019]
Is it simpler to improve the solution progressively?

Job Scheduling
Scheduling
1
2
3
T=2, S=1
T=3, S=2
T=1, S=3
0
1
2
3
1 32 4 5 6
0
1
2
3
1
2
3
1 32 4 5 6
1
2
3
Graph representationJobs
10
Resource 1
Resource 2
time
time
Slow down
Slow
down

Job Scheduling
g2 0.1
g4 0.7
g5 0.2
0.1 -0.3 0.2
DAG-LSTM
Embedding
0
1
5
3
42
FC
FC
FC
Softmax
St
at
St+1
0
1
5
3
42
-0.2 0.5
𝐠 𝐭 0
1
5
4
32

Job Scheduling
#Resources 2 5 10 20
Shortest Job First 4.80 5.83 5.58 5.00
Shortest First Search 4.25 5.05 5.54 4.98
DeepRM 2.81 6.52 9.20 10.18
Neural Rewriter (Ours) 2.80 3.36 4.50 4.63
Optim (LB) 2.57 3.02 4.08 4.26
Earliest Job First (UB) 11.11 13.62 22.13 24.23

Expression Simplification
a1 0
…
a5 1
(Constant Reduction)
…
a19 0
Embedding
Tree-LSTM FC
0
-1 0.7
FC
Softmax
St St+1
𝐠 𝐭
at
<=
min -
v0 v2 v1 v1
<=
min -
v0 v2 v1 v1
<=
min 0
v0 v2

Expression Simplification
Expr Len
Reduction
Tree Size
Reduction
Halide Rule-based
Rewriter
36.13 9.68
Heuristic Search 43.27 12.09
Neural Rewriter (Ours) 46.98 13.53

Example
≤
5
max
v 3
+
3
5 ≤ max 𝑣, 3 + 3
≤
5
v 3
max
+ +
33
≤
5
v 3
+
≤
5
3 3
+
OR
≤
5
v 3
+
True
OR
True

Future Directions
Multi-Agent
Admiral (General) Captain Lieutenant
Hierarchical RL
RL Systems Model-based RL
Policy
Optimization
Model
Estimation
RL applications
RL for Optimization

How to do well in Reinforcement Learning?
Experience on
(distributed) systems
Strong math skills
Strong coding skills
Parameter tuning skills

Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (11)

Semelhante a Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

Semelhante a Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning (20)

Mais de AI Frontiers

Mais de AI Frontiers (20)

Último

Último (20)

Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

Notas do Editor