Deep Reinforcement Learning (DRL) has made strong progress in many tasks, such as board games, robotics, navigation, neural architecture search, etc. I will present our recent open-sourced DRL frameworks to facilitate game research and development. Our framework is scalable so we can can reproduce AlphaGoZero and AlphaZero using 2000 GPUs, achieving super-human performance of Go AI that beats 4 top-30 professional players. We also show usability of our platform by training agents in real-time strategy games, and show interesting behaviors with a small amount of resource.
2. AI works in a lot of situations
Medical Translation
Personalization Surveillance
Object Recognition
Smart Design
Speech Recognition
Board game
3. What AI still needs to improve
Very few supervised data
Complicated environments
Lots of Corner cases.
Home Robotics
Autonomous Driving
ChatBot
StarCraftQuestion Answering
Common Sense
Exponential space
to explore
4. What AI still needs to improve
Human level
A scary trend of slowing down
“Man, we need more data”
Initial Enthusiasm
“It really works! All in AI!”
Trying all possible hacks
“How can that be…” Despair
“No way, it doesn’t work”
Performance
Efforts
5. What AI still needs to improve
Human level
A scary trend of slowing down
“Man, we need more data”
Initial Enthusiasm
“It really works! All in AI!”
Trying all possible hacks
“How can that be…” Despair
“No way, it doesn’t work”
Performance
Efforts
We need novel algorithms
9. AlphaGo Zero Strength
• 3 days version
• 4.9M Games, 1600 rollouts/move
• 20 block ResNet
• Defeat AlphaGo Lee.
• 40 days version
• 29M Games, 1600 rollouts/move
• 40 blocks ResNet.
• Defeat AlphaGo Master by 89:11
10. ELF: Extensive, Lightweight and Flexible
Framework for Game Research
Larry ZitnickQucheng Gong Wenling Shang Yuxin WuYuandong Tian
https://github.com/facebookresearch/ELF
[Y. Tian et al, ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games, NIPS 2017]
11. ELF: A simple for-loop Action
State Reward
Agent
Environment
C++
Python
14. ELF OpenGo
• System can be trained with 2000 GPUs in 2 weeks.
• Decent performance against professional players and strong bots.
• Abundant ablation analysis
• Decoupled design, code highly reusable for other games.
We open source the code and the pre-trained model for the Go and ML community
https://github.com/pytorch/ELF
Simple tutorial in experimental branch (tutorial, tutorial_distri)
15. ELF OpenGo Performance
20-0Name (rank) ELO (world rank) Result
Kim Ji-seok 3590 (#3) 5-0
Shin Jin-seo 3570 (#5) 5-0
Park Yeonghun 3481 (#23) 5-0
Choi Cheolhan 3466 (#30) 5-0
Single GPU, 80k rollouts, 50 seconds
Offer unlimited thinking time for the players
Vs top professional players
Vs strong bot (LeelaZero)
[158603eb, 192x15, Apr. 25, 2018]: 980 wins, 18 losses (98.2%)
Vs professional players
Single GPU, 2k rollouts, 27-0 against Taiwanese pros.
16. Planning is how new knowledge is created
Game Start Game End
Random Moves
Meaningful Moves
Already dan level even if the
opening doesn’t make much sense.
Planning #rollouts
Win rate against
bot without planning
50%
100%
Training is almost always constrained
by model capacity (why 40b > 20b)
Where the
reward signal is
17. Planning is how new knowledge is created
T-step
looking forward
Tree searchOne-step
looking forward
Monte Carlo Sampling
Learning a neural network that directly predicts the optimal value/policy
Temporal Difference (TD) in Reinforcement Learning:
18. Planning is how game AI is created
Extensive search/planning Evaluate Consequence
Black wins
White wins
Black wins
White wins
Black wins
Current game situation
Lufei Ruan vs. Yifan Hou (2010)
AlphaBeta Pruning + Iterative Deepening
Monte Carlo Tree Search
…
19. How to plan without a known model?
If you don’t have a ground truth dynamics model …
• Only (human/expert) trajectories, no world model.
• Limited access of world models.
• Cannot restart the model, cannot query any (s, a) pair
• Noisy signals from the world model
• …
If you have a ground truth dynamics model …
• Infinite access of the exact world model
• May query any (s, a) pair
• …
Current
state
Next
state
Action to
take
Build one
21. Build a semantic model
outdoor
living room
sofa
Find “oven”
car
chair
dining room
kitchen
oven
Incomplete model of the environment
[Y. Wu et al, Learning and Planning with a Semantic Model, submitted to ICLR 2019]
22. Build a semantic model
Bayesian Inference
𝑃(𝑧|𝑌)
car
chair
dining room
kitchen
oven
0.7
0.95
0.8
0.5
0.6
0.7
Next step
“kitchen”
outdoor
living room
sofa
Learning experience 𝑌
23. LEAPS
LEArning and Planning with a Semantic model
living room
kitchen
chair
sofa
Dining room
𝑃(𝑧kitchen,living room)
𝑃(𝑧sofa,living room)
𝑃(𝑧chair,living room)
𝑃(𝑧dining,living room)
living room
kitchen
chair
sofa
Dining room
𝑃(𝑧kitchen,living room)
𝑃 𝑧sofa,living room 𝑌 = 1
𝑃(𝑧chair,living room)
𝑃(𝑧dining,living room)
Planning the trajectory
and more exploration
Planning the trajectory
and more exploration
24. House3D
SUNCG dataset, 45K scenes, all objects are fully labeled.
https://github.com/facebookresearch/House3D
Depth
Segmentation maskRGB image
Top-down map
27. Case Study
• A case study
• Go to “outdoor”
Prior: 𝑃(𝑧)
Birth
Outdoor
Living
room
Garage
0.12
0.38
0.76
0.73 0.28
Sub-Goal: Outdoor
28. Case Study
• A case study
• Go to “outdoor”
Sub-Goal: Outdoor
Failed!
Birth
Outdoor
Living
room
Garage
0.12
0.38
0.01
0.73 0.28
Posterior: 𝑃(𝑧|𝐸)
29. • A case study
• Go to “outdoor”
Posterior: 𝑃(𝑧|𝐸)
Sub-Goal: Garage
Birth
Outdoor
Living
room
Garage
0.12
0.01
0.73 0.28
Failed!
Case Study
0.08
30. Case Study
• A case study
• Go to “outdoor”
Sub-Goal: Living Room
Posterior: 𝑃(𝑧|𝐸)
Birth
Outdoor
Living
room
Garage
0.12
0.08
0.01
0.28
Success
0.99
31. Case Study
• A case study
• Go to “outdoor”
Sub-Goal: Outdoor
Posterior: 𝑃(𝑧|𝐸)
Success
Birth
Outdoor
Living
room
Garage
0.99
0.08
0.01
0.280.99
33. Iterative LQR
What if the dynamics is nonlinear?
New Policy
Sample according to the new policy
34. Non-differentiable Plans
• Direct predicting combinatorial solutions.
[O. Vinyals. et al, Pointer Networks, NIPS 2015]
Convex hull
Seq2seq model
[H. Mao et al, Resource Management with Deep
Reinforcement Learning, ACM Workshop on Hot
Topics in Networks, 2016]
Schedule the job
to i-th slot
Policy gradient
35. Neural network rewriter
st
Sample 𝒈 𝒕 ∼ 𝑺𝑷 𝒈 𝒕 , 𝒈 𝒕 ⊂ 𝒔 𝒕 Sample 𝒂 𝒕 ∼ 𝑹𝑺 𝒈 𝒕
𝒔 𝒕+𝟏 = 𝒇(𝒔 𝒕, 𝒈 𝒕, 𝒂 𝒕)
Input Encoder Score Predictor Rule Selector
[X. Chen and Y. Tian, Learning to Progressively Plan, submitted to ICLR 2019]
Is it simpler to improve the solution progressively?
RL is one of the methods that might come to rescue. The basic idea of RL is very simple. We have an agent who perceives the state from the environment, takes an action and receives a reward. The environment receives the action, changes its internal state and repeats.
With virtual environments, we could potentially get infinite amount of data to train our models.
Since then, deep reinforcement learning has make substantial progress in different kind of games, including Atari games, Go, DoTA 2, etc.
Then the question is, what is the next step?
In this talk, I am going to talk about a few recent works that mainly explore the power of planning in reinforcement learning setting.
1.
I will start with one example why planning is important.
We all know that last year, DeepMind has released a paper in Nature about AlphaGoZero, which learns super-human Go bot without any human knowledge.
The idea is very simple, starting from random generated models, we use planning algorithms like MCTS to find the best moves in each step, and save the selfplay into a replay buffer. Then we update the models based on the self-replay buffer. Then repeat.
This approach is surprisingly simple, yet it gives very strong performance. 3 days with thousands of TPUs and the model can already beat AlphaGo Lee, after 40 days it defeats AlphaGo Master.
Inspired by this interesting and exciting results, we thus try reproducing AlphaGoZero with our recent published ELF platform. The goal is to understand why such a simple approach yields a strong performance.
ELF is an Extensive, Lightweight and Flexible framework that makes a practical RL system easy and manageable.
For this, ELF puts all the implementation and design details into the C++ side, leaving the Python side a simple for-loop, as advertised in the text book.
Moreover, each time the python side returns a batch for a neural network model to operate on, improving its efficiency.
The key idea of ELF is to achieve dynamic batching from multiple game instances. In this platform, many games are running simultaneously. From time to time each have requests to call deep learning API (like PyTorch) for computing the next states and actions, or store the game history into replay buffers. ELF provides a dynamic batching interface so that requests can be automatically batched for high thoroughput.
We improve our framework to support distributed setting, and putting AGZ and AZ together.
We then releases ELF OpenGo, reproduces AZ and AGZ.
One question is that, ok, what can we learn from it?
One interesting observation from this experiment is that the knowledge is propagated backwards. You start to see meaningful moves from the end of the game, where there is reward signal. Over iterations, the meaningful moves are backpropagated all the way to the opening of the game.
The right figure shows that even with well-trained models, still with the additional planning the strength is much higher. In fact, the strength of the bot is always constrained by the capacity of the model.
We are working on an arXiv paper to discuss about the training details.
Not only training AGZ/AZ requires planning. For general reinforcement learning, during training, planning is also very important. This not only includes 1 step look forward, but also including T-step looking forward as well as more complicated planning mechanism like tree search. A large portion of DRL is to try pushing the results given by planning into neural networks.
Planning is also very important for general game AI. Chess or other games are other important examples.