The document discusses how AlphaGo, a computer program developed by DeepMind, was able to defeat world champion Lee Sedol at the game of Go. It achieved this through a combination of deep learning and tree search techniques. Four deep neural networks were used: three convolutional networks to reduce the action space and search depth through imitation learning, self-play reinforcement learning, and value prediction; and a smaller network for faster simulations. This combination of deep learning and search allowed AlphaGo to master the complex game of Go, demonstrating the capabilities of modern AI.
2. ABOUT MYSELF
Ms.c. In computer Science, HUJI
Research interest: Deep Learning in Computer
Vision, NLP, Reinforcement learning.
Also, DL Theory and other ML stuff.
Works in a DL start-up (Imubit)
Contact: mangate@gmail.com
3. CREDITS
A lot of slides were taken from the following publicly
available slideshows:
https://www.slideshare.net/ShaneSeungwhanMoon/how-
alphago-works
https://www.slideshare.net/ckmarkohchang/alphago-in-depth
https://www.slideshare.net/KarelHa1/alphago-mastering-the-
game-of-go-with-deep-neural-networks-and-tree-search
Original AlphaGo article:
Silver, David, et al. "Mastering the game of Go with
deep neural networks and tree search.“Nature 529.7587
(2016): 484-489.
Available here:
http://web.iitd.ac.in/~sumeet/Silver16.pdf
4. DEEP LEARNING IS CHANGING OUR LIVES
Search Engine (also for images and audio)
Spam filters
Recommender systems (Netflix, Youtube)
Self-Driving Cars
Cyber security (and regular one via computer
vision)
Machine translation.
Speech to text, audio recognition.
Image recognition, smart shopping
And more and more and more…
5. AI VERSUS HUMAN
In 1997, a super computer called Deep Blue (IBM) won Garry
Kasparov.
This was the first defeat of a reigning world chess champion
by a computer under tournament conditions.
6. AI VERSUS HUMAN
In 2011 Watson, another super-computer by IBM, “crashed”
the 2 best player in Jepoerdy, a popular question-answering
tv-show.
7. GO
An ancient Chinese Game
(2,500 years old!)
Despite its relatively simple
rules, Go is very complex,
even more so than chess.
Winning Go requires a
great deal of intuition and
therefore was considered
unachievable by computer for at least the next 30
years.
8. AI VESUS HUMAN
In 2016 a AlphaGo, a computer program by
DeepMind (part of Google) played a 5-games Go
match aginst Lee Sedol.
Lee Sedol:
professional 9-Dan (highest ranking in Go) considered
among top 3 players in the world.
2nd in international titles.
Won 97 out of 100 games
against european Go
champion Fan Hui.
9. AI VERSUS HUMAN
“I’m confident that I can win, at least this time” – Lee Sedol
Alpha Go won 4-1
“I kind of felt powerless… misjudged the capabilities of
AlphaGo” – Lee Sedol
How is it possible? Deep Learning.
10. AI IN GAME PLAYING
Almost every game can be “simulated” with a tree search.
A move is done if it has to most chances to end in a victory.
11. AI IN GAMES
More formally: an optimal value function V*(s)
determines the outcome of the game:
From every board position (state=s)
Under perfect play by all players.
This is done by going over the tree containing
possible move sequences where:
b is the games breadth (number of legal moves in each
position)
d is the game depth (game length in moves)
Tic-Tac-Toe:
Chess:
d
b
4, 4b d
35 80b d
12. TREE SEARCH IN GO
However in GO:
This is more than the number of atoms in the entire universe!
Go Is more complex than chess!
250, 150b d
100
10 ( )Googol
13. KEY: REDUCE THE SEARCH SPACE
Reducing b (possible actions space)
14. KEY: REDUCE THE SEARCH SPACE
Reducing d – Position evaluation ahead of time
Instead of simulating all the way to the end:
Both reductions are done with Deep Learning.
15. SOME CONCEPTS
Supervised Learning (classification)
On a given data, predict a class (or choose 1 option
out of some known number of options)
19. REDUCING ACTION CANDIDATES
Done by learning to “imitate” expert moves
Data: Online Go experts. 160K Games 300M moves.
This is supervised classification (on given data predict the
expert action out of all possible ones)
20. REDUCING ACTION CANDIDATES
This deep CNN achieved 55% test accuracy on predicting
expert moves.
Imitators with no Deep Learning reached only 22% accuracy.
Small improvement in accuracy lead to big improvement in
playing ability.
21. ROLLOUT NETWORK
Train additional smaller network
(Ppi ) for imitating.
This network achieves only 24.2%
accuracy.
Works 1000 times faster (2us
compared to 3ms).
This network is used for rollouts
(explained later).
22. IMPROVING THE NETWORK
Improve the imitator network through self playing
(Reinforcement learning)
An entire game is played and the parameters are
updates according to the results.
23. IMPROVING THE NETWORK
Keep generating better models by self-play newer models
against old ones
The final network also won 85% against the best GO software
(model without self play won only 11%)
However, the model was eventually not used during the
games. It was used to generate the value function.
24. REDUCING SEARCH DEPTH - DATASET
Self-play with the imitator model for some steps (0
to 450).
Make some random move. This is the starting
position ‘s’.
Self play until the end with the RL network (latest
model).
If black won z=1 otherwise z=0.
Save (s,z) to the dataset.
Generated 30M (s,z) pairs from 30M games.
25. REDUCING SEARCH DEPTH –
VALUE FUNCTION
Regression task, for a given position S give number between
0 and 1.
Now, for each possible position we can have an evaluation of
how “good” it is for the black player.
27. PUTTING IT ALL TOGETHER - MCST
During game time a method called Monte-Carlo
Search Tree (MCTS) is applied.
This method have 4 steps:
Selection
Expansion
Evaluation
Backup (update)
For each play in the game this process is repeated
about 10K times.
28. MCTS - SELECTION
At each step we have a starting
position (the board at this point).
An action is selected
using a combination of the imitator
network and some other value
(Q) which is set to 0 at the start.
we divide by the
times a state/action pair was
visited to encourage diversity.
( , )
( )
1 ( , )
P s a
u p
N s a
29. MCTS - EXPANSION
When building the tree,
position can be expended once
(create new leafs in the tree)
with the imitator network.
This way we have the new u(P)
for the next searches.
30. MCTS - EVALUATION
After simulating 3-4 steps
with the imitating network
we evaluate the board
position.
This is done in two ways:
The value network prediction.
Using the smaller imitator
network to self-play to the end
(rollout), and save the result
(1 for black win 0 for white)
Both evaluation are combined
to give this board position a
number between 0 and 1.
31. MCTS – BACKUP (UPDATE)
After the simulation we
update the tree.
Update Q (which was
0 in the beginning) with
the value computed with
the value network and the
rollouts.
Update N(s,a): Increase
by one for each
state/action pair visited.
32. CHOOSING AN ACTION
For each step during the game MCTS is done for
10K times.
In the end the action which was visited the most
times from the root position (the current board) is
taken.
Notes:
Since this process is long they had to use the smaller
network for rollouts to keep it feasible (otherwise each
move would have taken the computer several days to
compute).
The imitator network was better in choosing the first
actions compared to the RL network, probably due to
human taking more diverse actions.
33. ALPHA GO WEAKNESSES
In the 4th game, Lee Sedol got the board to a
position which was not on Alpha Go search tree,
causing the program to choose worse actions and
losing the game eventually.
Most assumptions made for Alpha-Go are not
relevant in real life RL problems. See:
https://medium.com/@karpathy/alphago-in-context-
c47718cb95a5
34. RETIREMENT
In March 2017 alpha go won Ke Jie, the 1st ranked in the
world, 3-0.
Google’s DeepMind unit announced that it would be the last
event match the AI plays.
35. SUMMARY
To this day, AlphaGo is considered one of the greatest AI
achievements in recent history.
This achievement was made by combining Deep
Learning with standard method (like MCST) to “simplify”
the very complex game of Go.
4 Deep Neural Networks were used:
3 almost identical Convolutional Neural Network:
Imitating network for action space reduction.
RL network created through self-play, for generating the dataset
for the value network.
Value network for search depth reduction.
1 small network for rollouts.
Deep Learning keeps achieving new amazing goals
every day, and is one of the fastest growing fields in
both academy and industry.