Game Playing RL Agent

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Principal Solutions Architect – AWS Deep Learning
Amazon Web Services
Game Playing RL Agent

Inspiration From Nature
https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design

Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html

Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
I1 I2 B
O
w2 w3
! "#, %# = Φ() + Σ#(%#. "#))
Φ " = .
1, 0! " ≥ 0.5
0, 0! " < 0.5
w1
5 ∧ Q
8 9 8 ∧ Q
: : :
: ; ;
; : ;
; : ;

Process of Learning

Process of Learning
Agent
Environment!"#$
!"
state
%"#$
%"
reward
&"
action
Sutton and Barto

Markov State
An information state (a.k.a. Markov state) contains all
useful information from the history.
A state St is Markov if and only if:
! "#$%|"# = ! "#$% "%, … , "*]

Expected Return
• Expected return: !" : sequence of rewards, potentially discounted by a factor
# where # ∈ 0,1
!" = )"*+ + #)"*- + #-)"*. + … = 0
123
4
#1)"*1*+

Bellman Expectation Equations
!" # = % &' (' = # = % )'*+ + -!" #'*+ #' = #
Value of s is the expected return at state s following policy
. subsequently.
This is Bellman Expectation Equation that can be also
expressed as action-value function for policy .
/" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1
= ℛ3
4 + - 5
3678
9336
4
!" #′
Value of taking action a at state s under policy .
.
# → !"(#)
#, 1 → /"(#, 1)
#′ → !"(#′)
1
#′
>

Bellman Equations - Example
5.5
510 -3
! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5
4.4
2
R=5
P=.5 R=2
P=.5
5
P=.4 P=.5
! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4
P=.5
P=.25
P=.25
P=.1

Optimal Policy
max
s,a
r
s’
a’ s’
s
a
!
p r
max
9
510 -3
"∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9
R = -1
R = 2
R = 3
A policy is better if "B $ ≥ "BD $ ∀ $ ∈ G
"∗ s ≡ max "B $ ∀ $ ∈ G

Iterative Policy Evaluation
1 2 3
4 5 6 7
8 9 10 11
12 13 14
!" = −1
& → . = & ↑ . =
& ↓ . = & ← . = .25

0.00
1
0.00
2
0.00
3
0.00
4
0.00
5
0.00
6
0.00
7
0.00
8
0.00
9
0.00
10
0.00
11
0.00
12
0.00
13
0.00
14
!" = −1
& = 0
0
15
(: !*+,-. /-0123
( → . = ( ↑ . =
( ↓ . = ( ← . = .25
9" = −1

!"#$ 1 = .25× −1 + 0.(0)
"#2 →
+.25× −1 + 0.($)
"#2 ↑
+
.25× −1 + 0.(5)
"#2 ↓
+.25× −1 + 0.(7)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
0.00
1
0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00
0
0
: → . = : ↑ . =
: ↓ . = : ← . = .25
;< = −1
= = 0 = = 1
!"#$ 7 =.25× −1 + 0.(?)
"#2 →
+.25× −1 + 0.(@)
"#2 ↑
+
.25× −1 + 0.($$)
"#2 ↓
+.25× −1 + 0.(A)
"#2 ←
= −.25 − .25 − .25 − .25 = −9

!"#$ 1
=.25× −1 + −1.00.($)
"#1 →
+ . 25× −1 + −1.00.(1)
"#1 ↑
+
.25× −1 + −1.00.(4)
"#1 ↓
+.25× −1 + 0.(6)
"#1 ←
= .25× −8 − 8 − 8 − 9 = −9. :;
-1.75 -2.00 -2.00
-2.00 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
!"#$ 7
= −1 ×.25 − 1.00.(=)
"#1 →
+ −1 ×.25 − 1.00.(>)
"#1 ↑
+
−1 ×.25 − 1.00.(11)
"#1 ↓
+ −1 ×.25 − 1.00
.
.(?)
"#1 ←
=
= .25× −8 − 8 − 8 − 9 = −8
@ → . = @ ↑ . =
@ ↓ . = @ ← . = .25
AB = −1
C = 1 C = 2

!"#$ 1
=.25× −1 + −2.00.(0)
"#0 →
+ . 25× −1 + −1.75.(4)
"#0 ↑
+
.25× −1 + −2.00.(6)
"#0 ↓
+.25× −1 + 0.(8)
"#0 ←
= .25× −: − ;. <= − : − > = −;. ?:
-2.43 -2.93 -3.00
-2.43 -2.93 -3.00 -2.93
-2.93 -3.00 -2.93 -2.43
-3.00 -2.93 -2.43
0
0
-1.75 -2.00 -2.00
-1.75 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
!"#$ 7
= −1 ×.25 − 2.00.(@)
"#0 →
+ −1 ×.25 − 2.00.($)
"#0 ↑
+
−1 ×.25 − 1.75.(44)
"#0 ↓
+ −1 ×.25 − 2.00
.
.(A)
"#0 ←
=
.25× −: − : − ;. <= − : = −;.93
D → . = D ↑ . =
D ↓ . = D ← . = .25
EF = −1
G = 2 G = 3

Policy Improvement and Control
! "
! → $%
! → &'(()*(")
Evaluation
Improvement
!∗
"∗

Policy Improvement and Control

GridWorld Demo
https://github.com/rlcode/reinforcement-learning

Limitation of Dynamic Programming
• Assumption of full knowledge of MDP
• DP is using full-width backup.
• Number of states can grow rapidly.
• Suitable for medium problem of just a few million states.
…

Monte Carlo Learning
• Model-Free learning
• Learning from episode of experience
• All episodes much have a terminal state
… …

Temporal Difference (TD) Learning
• Learning from episodes of experience.
• Model-Free
• TD learns from incomplete episodes.
• Updating an estimate towards an estimate.
…
TD(1)
TD(2)

Exploration and Exploitation
• Exploitation is maximizing reward using known
information about a system.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, graduating as quickly as possible through
taking all the recommended degree courses, getting a job, putting money in retirement
schemes, retiring at a middle-class house comfortably.
• Always following a system based on known information
results in missing out on potentials for better results.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, taking a course in Neural Networks out of
curiosity, changing subject, graduating, starting an AI company, growing the company,
becoming a billionaire, never retiring J

Q-Learning

Q-Learning
• The Q learning updates the Q value, slightly in the
direction of best possible next Q value.
s,a
r
s’
max
! ", $ ← ! ", $ + '() + * max
./
! "′, 1′ − !(", $))

Q-Learning Properties
• Model-free
• Change of task (reinforcement) requires re-training
• A special kind of Temporal Difference learning
• Convergence assured only for Markov states
• Tabular approach requires every observed state-action
pair to have an entry

Action Selection
• Greedy – always pick the actions with highest value
• Break ties randomly
• !-greedy – choose random with low probability !
• Softmax – always choose randomly, weighted by
respective Q-values

Reinforcement Function
• Implicitly supplies the goal to the agent
• Designing the function is an art
• Mistakes result in agent learning wrong behavior
• When need to learn behavior with shortest duration,
penalize every action a little for “wasting time”.

Q –Learning Demos
https://github.com/dbatalov/reinforcement-learning
Rocket Lander DemoGrid World Demo
https://github.com/rlcode/reinforcement-learning

Tabular Approach and its Limitation

DQN

Universal Function Approximation Theorem
• Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+.
• 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;.
• ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; .
?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-,
• (+ 2+-'6'5 F
• 5'(3 )*+,-(+-, BG, &GCℝ
• 5'(3 B')-*5, IGCℝ;
, Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+':
N E = O
GPQ
R
BG$(IG
T
E + &G )
(, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2,
N E − 7 E < C
7*5 (33 E 2+ :;

Deep Reinforcement Learning
• An Artificial Neural Network is a
Universal Function
Approximator.
• We can use a ANN as an
approximation of an agent to
choose what action to take to
maximize reward.
Check this link for proof of the theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theorem
David Silver

DQN Network
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
• DQN Agent achieves >75%
of the human score in 29
our of 49 games
• DQN Agent beats human
score (>100%) in 22 games
!"#$%% =
()*%+, !"#$% − ./+0#1 23/4 !"#$%)
(671/+ !"#$% − ./+0#1 23/4 !"#$%)
8 100

DQN for Breakout
https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn

DQN Algorithm
• Techniques that increase stability and better convergence
• !- greedy Exploration
• Technique: Choose action as per optimal policy with (1-") and random action
with " probability
• Advantage: Minimize overfitting of the network
• Experience (#$, &$, '$, #$()) Replay
• Technique: Store agent’s experiences and use samples from them to update Q-
network
• Advantage: Removes correlations in observation sequence
• Periodic update of Q towards target
• Technique: Every C updates, clone the Q-network and used cloned (*Q) for
generating target for the following C updates to Q-network
• Advantage: Reduces correlations with the target

DQN-Algorithm
DQN
(Cloned) DQN
!"#$, &"#$, '"#$, !"
!"#(, &"#(, '"#(, !"#$
!"#), &"#), '"#), !"#(
!"#*, &"#*, '"#*, !"#*+$
Initialize replay memory (N = 1M)
Random play
Initialize DQNs with random ,-
./ !, &; ,1
# / !, &; ,1
Episode 1: Select 2$ and get !$
Time step 1:
&$ = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !$, &; ,- , AB2A
Observe reward '$ and move to 2(
Add (!$, &$, '$, !() to D
Generate training data:
U(D) = Random sample of D
For each !E, &E, 'E, !E+$ ∈ G H :
IE = J
'E, AK;276A :A8;5&:A2
'E + M max
@Q
./ !E+$, &R; ,-
#
,1
#
= ,-
,1 = ,-
!$
S !$, . ; ,-
!E+$
US !E+$, . ; ,- Update DQN using U(D) with ys
,1 = ,$
Time step 2:
&( = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !(, &; ,$ , AB2A
Observe reward '( and move to 2)
Add (!(, &(, '(, !)) to D
!(
S !(, . ; ,$
,1
#
= ,$-V ,1 = ,$-V
Every 10K steps, Clone DQN: ,1
#
= ,1
,1 = ,(
Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$
Time step t:
&" = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !", &; ,1 , AB2A
Observe reward '" and move to 2"+$
Add (!", &", '", !"+$) to D

Function Approximation
!∗ #, % = '() * + , max
0)
!∗ #1, %1 |#, %
!3 #, % = '() * + , max
0)
!345 #1
, %1
|#, %
!∗ #, % ≈ !(#, %; 9)
! #, %; 93 ≈ '() * + , max
0)
! #1, %1; 93
4
|#, %
!3 → !∗
%# < → ∞
>3 93 = '(,0,? '() @|#, % − !(#, %; 93) B
where, @ = * + , max
0)
! #1
, %1
; 93
4
Bellman equation
Iterative update
Function Approximation
Modified Iterative update
Loss function to minimize

Network Architecture
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
84 X 84 X 4
!(#) %(#, '; ))

Deep Convolutional Network - Nature
DQN = gluon.nn.Sequential()
with DQN.name_scope():
#first layer
DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#second layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
#tird layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
DQN.add(gluon.nn.Flatten())
#fourth layer
DQN.add(gluon.nn.Dense(512,activation ='relu'))
#fifth layer
DQN.add(gluon.nn.Dense(num_action,activation ='relu'))

Issues with DQN
• Q-Learning does overestimate action values due to
maximization term over estimated values.
• Over-estimation is being associated with noise and
insufficiently flexible function approximation.
• DQN provides a flexible function approximation.
• Deterministic nature of Atari games eliminates noise.
• DQN still significantly overestimates action values.

Double Q Learning and DDQN

Double Q-Learning
• The max operator uses the same values for evaluation and action
selection. This leads to over-optimism
• Decoupling evaluation and action-selection can prevent
overoptimization. This is the idea behind Double Q-Learning.
• In Double QL two value functions are learned by randomly assigning
experiences to update either of the two, resulting in two sets of
weights, ! and !′.
• For each update one set of weights is used to determine greedy
policy and the other for determining its value.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$
#$
5%6
≡ '$() + + max
/
0 1$(), 3; !$
7

Untangling Evaluation and
• For action selection we are using !
• For evaluation we are using !′.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$ → #$
%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !$
#$
9%:
≡ '$() + + max
/
0 1$(), 3; !$
;
→ #$
9<=>?@%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !′$

Over-optimism and Error Estimation – upper bound
• Thurn and Schwartz showed that the upper bound of
error due to over-optimization is where action values are
uniformly distributed in an interval [−#, #] is &#
'()
'*)

Over-optimism and Error Estimation – lower bound
• Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # .
• Let !. be are arbitrary value estimates that are on the
whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not
all correct, such that
5
6
∑0 !. #, % − '∗ 3
7
= 8 for
some 8 > 0, where + > 2 is the number of actions in #.
• Then max
0
!. #, % ≥ '∗ # +
A
6B5
.
• The lower bound is tight. The lower bound on the
absolute error of the Double Q-Learning is zero.

Number of Actions and Bias

Bias in Q-Learning vs Double Q-Learning

DDQN
• Using DQN’s target network for value estimation
• Using DQN’s online network for evaluating greedy
policy.
!"
#$%&'(#)*
≡ ,"-. + 01 2"-., argmax
9
1 2"-., :; <" ; <′"

Results

Bayesian DQN or BDQN

Focusing on Efficient Exploration
• The central claim is that mechanism such as !-greedy
exploration are inefficient.
• Thompson Sampling allows for targeted exploration at
higher dimension but is computationally too expensive.
• BDQN targets to implement Thompson Sampling at
scale though function approximation.
• BDQN combines DQN with a BLR (Bayesian Linear
Regression) model on the last layer.

Thomson Sampling
• Thompson sampling involves maintaining a prior
distribution over the environment models (reward
and/or dynamics)
• The distribution is updated as observations are made
• To choose an action, a sample from the posterior belief
is drawn and an action is selected that maximizes the
expected return under the sampled belief.
• For more information please refer to,”A Tutorial on
Thompson Sampling, Daniel Russo et al.”
https://arxiv.org/abs/1707.02038

TS vs !-Greedy
• !-greedy focuses on greedy action.
• TS explores actions with higher estimated return with
higher probability.
• TS based strategy advances the
exploration/exploitation balance by making a trade-off
between the expected returns and the uncertainties,
while ε− greedy strategy ignores all of this information.

TS vs !-Greedy

TS vs !-Greedy
• TS finds optimal Q-Function faster.
• Randomizes over Q-Functions with high promising returns
and high uncertainty.
• When true Q-Function is selected, it increases posterior
probability.
• When other function are selected, wrong values are
estimated and the posterior probability is set to zero.
• !-greedy agent randomizes its action with probability of
!, even after having chosen the true Q-Function,
therefore, it takes exponentially many trials in order to
get to the target.

BDQN Algorithm

Network Architecture

BDQN Performance

Closing Words

Value Alignment

References
• DQN: https://www.nature.com/articles/nature14236
• DDQN: https://arxiv.org/abs/1509.06461
• BDQN: https://arxiv.org/abs/1802.04412
• DQN MXNet Code:
• https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
• DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb
• DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb
• BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon

Game Playing RL Agent

Recommended

Recommended

More Related Content

Similar to Game Playing RL Agent

Similar to Game Playing RL Agent (20)

More from Apache MXNet

More from Apache MXNet (20)

Recently uploaded

Recently uploaded (20)

Game Playing RL Agent