SlideShare a Scribd company logo
1 of 63
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Principal Solutions Architect – AWS Deep Learning
Amazon Web Services
Game Playing RL Agent
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Inspiration From Nature
https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hardware of Learning
http://www.biologyreference.com/Mo-Nu/Neuron.html
I1 I2 B
O
w2 w3
! "#, %# = Φ() + Σ#(%#. "#))
Φ " = .
1, 0! " ≥ 0.5
0, 0! " < 0.5
w1
5 ∧ Q
8 9 8 ∧ Q
: : :
: ; ;
; : ;
; : ;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process of Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Process of Learning
Agent
Environment!"#$
!"
state
%"#$
%"
reward
&"
action
Sutton and Barto
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Markov State
An information state (a.k.a. Markov state) contains all
useful information from the history.
A state St is Markov if and only if:
! "#$%|"# = ! "#$% "%, … , "*]
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Expected Return
• Expected return: !" : sequence of rewards, potentially discounted by a factor
# where # ∈ 0,1
!" = )"*+ + #)"*- + #-)"*. + … = 0
123
4
#1)"*1*+
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bellman Expectation Equations
!" # = % &' (' = # = % )'*+ + -!" #'*+ #' = #
Value of s is the expected return at state s following policy
. subsequently.
This is Bellman Expectation Equation that can be also
expressed as action-value function for policy .
/" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1
= ℛ3
4 + - 5
3678
9336
4
!" #′
Value of taking action a at state s under policy .
.
# → !"(#)
#, 1 → /"(#, 1)
#′ → !"(#′)
1
#′
>
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bellman Equations - Example
5.5
510 -3
! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5
4.4
2
R=5
P=.5 R=2
P=.5
5
P=.4 P=.5
! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4
P=.5
P=.25
P=.25
P=.1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimal Policy
max
s,a
r
s’
a’ s’
s
a
!
p r
max
9
510 -3
"∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9
R = -1
R = 2
R = 3
A	policy	is	better	if	"B $ ≥ "BD $ ∀ $ ∈ G
"∗ s ≡ max "B $ ∀ $ ∈ G
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
1 2 3
4 5 6 7
8 9 10 11
12 13 14
!" = −1
& → . = & ↑ . =
& ↓ . = & ← . = .25
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
0.00
1
0.00
2
0.00
3
0.00
4
0.00
5
0.00
6
0.00
7
0.00
8
0.00
9
0.00
10
0.00
11
0.00
12
0.00
13
0.00
14
!" = −1
& = 0
0
15
(: !*+,-. /-0123
( → . = ( ↑ . =
( ↓ . = ( ← . = .25
9" = −1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1 = .25× −1 + 0.(0)
"#2 →
+.25× −1 + 0.($)
"#2 ↑
+
.25× −1 + 0.(5)
"#2 ↓
+.25× −1 + 0.(7)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
0.00
1
0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00
0
0
: → . = : ↑ . =
: ↓ . = : ← . = .25
;< = −1
= = 0 = = 1
!"#$ 7 =.25× −1 + 0.(?)
"#2 →
+.25× −1 + 0.(@)
"#2 ↑
+
.25× −1 + 0.($$)
"#2 ↓
+.25× −1 + 0.(A)
"#2 ←
= −.25 − .25 − .25 − .25 = −9
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1
=.25× −1 + −1.00.($)
"#1 →
+ . 25× −1 + −1.00.(1)
"#1 ↑
+
.25× −1 + −1.00.(4)
"#1 ↓
+.25× −1 + 0.(6)
"#1 ←
= .25× −8 − 8 − 8 − 9 = −9. :;
-1.75 -2.00 -2.00
-2.00 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
-1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00
0
0
!"#$ 7
= −1 ×.25 − 1.00.(=)
"#1 →
+ −1 ×.25 − 1.00.(>)
"#1 ↑
+
−1 ×.25 − 1.00.(11)
"#1 ↓
+ −1 ×.25 − 1.00
.
.(?)
"#1 ←
=
= .25× −8 − 8 − 8 − 9 = −8
@ → . = @ ↑ . =
@ ↓ . = @ ← . = .25
AB = −1
C = 1 C = 2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Iterative Policy Evaluation
!"#$ 1
=.25× −1 + −2.00.(0)
"#0 →
+ . 25× −1 + −1.75.(4)
"#0 ↑
+
.25× −1 + −2.00.(6)
"#0 ↓
+.25× −1 + 0.(8)
"#0 ←
= .25× −: − ;. <= − : − > = −;. ?:
-2.43 -2.93 -3.00
-2.43 -2.93 -3.00 -2.93
-2.93 -3.00 -2.93 -2.43
-3.00 -2.93 -2.43
0
0
-1.75 -2.00 -2.00
-1.75 -2.00 -2.00 -2.00
-2.00 -2.00 -2.00 -1.75
-2.00 -2.00 -1.75
0
0
!"#$ 7
= −1 ×.25 − 2.00.(@)
"#0 →
+ −1 ×.25 − 2.00.($)
"#0 ↑
+
−1 ×.25 − 1.75.(44)
"#0 ↓
+ −1 ×.25 − 2.00
.
.(A)
"#0 ←
=
.25× −: − : − ;. <= − : = −;.93
D → . = D ↑ . =
D ↓ . = D ← . = .25
EF = −1
G = 2 G = 3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Policy Improvement and Control
! "
! → $%
! → &'(()*(")
Evaluation
Improvement
!∗
"∗
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Policy Improvement and Control
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GridWorld Demo
https://github.com/rlcode/reinforcement-learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Limitation of Dynamic Programming
• Assumption of full knowledge of MDP
• DP is using full-width backup.
• Number of states can grow rapidly.
• Suitable for medium problem of just a few million states.
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monte Carlo Learning
• Model-Free learning
• Learning from episode of experience
• All episodes much have a terminal state
… …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Temporal Difference (TD) Learning
• Learning from episodes of experience.
• Model-Free
• TD learns from incomplete episodes.
• Updating an estimate towards an estimate.
…
TD(1)
TD(2)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Exploration and Exploitation
• Exploitation is maximizing reward using known
information about a system.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, graduating as quickly as possible through
taking all the recommended degree courses, getting a job, putting money in retirement
schemes, retiring at a middle-class house comfortably.
• Always following a system based on known information
results in missing out on potentials for better results.
• Going to school, applying to college, choosing a degree such as engineering and
medicine that has a better comparative yield, taking a course in Neural Networks out of
curiosity, changing subject, graduating, starting an AI company, growing the company,
becoming a billionaire, never retiring J
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning
• The Q learning updates the Q value, slightly in the
direction of best possible next Q value.
s,a
r
s’
max
! ", $ ← ! ", $ + '() + * max
./
! "′, 1′ − !(", $))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q-Learning Properties
• Model-free
• Change of task (reinforcement) requires re-training
• A special kind of Temporal Difference learning
• Convergence assured only for Markov states
• Tabular approach requires every observed state-action
pair to have an entry
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Action Selection
• Greedy – always pick the actions with highest value
• Break ties randomly
• !-greedy – choose random with low probability !
• Softmax – always choose randomly, weighted by
respective Q-values
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reinforcement Function
• Implicitly supplies the goal to the agent
• Designing the function is an art
• Mistakes result in agent learning wrong behavior
• When need to learn behavior with shortest duration,
penalize every action a little for “wasting time”.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q –Learning Demos
https://github.com/dbatalov/reinforcement-learning
Rocket Lander DemoGrid World Demo
https://github.com/rlcode/reinforcement-learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tabular Approach and its Limitation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Universal Function Approximation Theorem
• Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+.
• 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;.
• ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; .
?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-,
• (+ 2+-'6'5 F
• 5'(3 )*+,-(+-, BG, &GCℝ
• 5'(3 B')-*5, IGCℝ;
, Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+':
N E = O
GPQ
R
BG$(IG
T
E + &G )
(, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2,
N E − 7 E < C
7*5 (33 E 2+ :;
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Reinforcement Learning
• An Artificial Neural Network is a
Universal Function
Approximator.
• We can use a ANN as an
approximation of an agent to
choose what action to take to
maximize reward.
Check this link for proof of the theorem:
https://en.wikipedia.org/wiki/Universal_approximation_theorem
David Silver
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN Network
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
• DQN Agent achieves >75%
of the human score in 29
our of 49 games
• DQN Agent beats human
score (>100%) in 22 games
!"#$%% =
()*%+, !"#$% − ./+0#1 23/4 !"#$%)
(671/+ !"#$% − ./+0#1 23/4 !"#$%)
8 100
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN for Breakout
https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN Algorithm
• Techniques that increase stability and better convergence
• !- greedy Exploration
• Technique: Choose action as per optimal policy with (1-") and random action
with " probability
• Advantage: Minimize overfitting of the network
• Experience (#$, &$, '$, #$()) Replay
• Technique: Store agent’s experiences and use samples from them to update Q-
network
• Advantage: Removes correlations in observation sequence
• Periodic update of Q towards target
• Technique: Every C updates, clone the Q-network and used cloned (*Q) for
generating target for the following C updates to Q-network
• Advantage: Reduces correlations with the target
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DQN-Algorithm
DQN
(Cloned) DQN
!"#$, &"#$, '"#$, !"
!"#(, &"#(, '"#(, !"#$
!"#), &"#), '"#), !"#(
!"#*, &"#*, '"#*, !"#*+$
Initialize replay memory (N = 1M)
Random play
Initialize DQNs with random ,-
./ !, &; ,1
# / !, &; ,1
Episode 1: Select 2$ and get !$
Time step 1:
&$ = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !$, &; ,- , AB2A
Observe reward '$ and move to 2(
Add (!$, &$, '$, !() to D
Generate training data:
U(D) = Random sample of D
For each !E, &E, 'E, !E+$ ∈ G H :
IE = J
'E, AK;276A :A8;5&:A2
'E + M max
@Q
./ !E+$, &R; ,-
#
,1
#
= ,-
,1 = ,-
!$
S !$, . ; ,-
!E+$
US !E+$, . ; ,- Update DQN using U(D) with ys
,1 = ,$
Time step 2:
&( = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !(, &; ,$ , AB2A
Observe reward '( and move to 2)
Add (!(, &(, '(, !)) to D
!(
S !(, . ; ,$
,1
#
= ,$-V ,1 = ,$-V
Every 10K steps, Clone DQN: ,1
#
= ,1
,1 = ,(
Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$
Time step t:
&" = 4
'&5678 &9:;75, <' = =
&'>8&?@/ !", &; ,1 , AB2A
Observe reward '" and move to 2"+$
Add (!", &", '", !"+$) to D
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Function Approximation
!∗ #, % = '() * + , max
0)
!∗ #1, %1 |#, %
!3 #, % = '() * + , max
0)
!345 #1
, %1
|#, %
!∗ #, % ≈ !(#, %; 9)
! #, %; 93 ≈ '() * + , max
0)
! #1, %1; 93
4
|#, %
!3 → !∗
%# < → ∞
>3 93 = '(,0,? '() @|#, % − !(#, %; 93) B
where, @ = * + , max
0)
! #1
, %1
; 93
4
Bellman equation
Iterative update
Function Approximation
Modified Iterative update
Loss function to minimize
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network Architecture
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
84 X 84 X 4
!(#) %(#, '; ))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep Convolutional Network - Nature
DQN = gluon.nn.Sequential()
with DQN.name_scope():
#first layer
DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#second layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
#tird layer
DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1))
DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True))
DQN.add(gluon.nn.Activation('relu'))
DQN.add(gluon.nn.Flatten())
#fourth layer
DQN.add(gluon.nn.Dense(512,activation ='relu'))
#fifth layer
DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Issues with DQN
• Q-Learning does overestimate action values due to
maximization term over estimated values.
• Over-estimation is being associated with noise and
insufficiently flexible function approximation.
• DQN provides a flexible function approximation.
• Deterministic nature of Atari games eliminates noise.
• DQN still significantly overestimates action values.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Double Q Learning and DDQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Double Q-Learning
• The max operator uses the same values for evaluation and action
selection. This leads to over-optimism
• Decoupling evaluation and action-selection can prevent
overoptimization. This is the idea behind Double Q-Learning.
• In Double QL two value functions are learned by randomly assigning
experiences to update either of the two, resulting in two sets of
weights, ! and !′.
• For each update one set of weights is used to determine greedy
policy and the other for determining its value.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$
#$
5%6
≡ '$() + + max
/
0 1$(), 3; !$
7
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Untangling Evaluation and
• For action selection we are using !
• For evaluation we are using !′.
#$
%
≡ '$() + + max
/
0 1$(), 3; !$ → #$
%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !$
#$
9%:
≡ '$() + + max
/
0 1$(), 3; !$
;
→ #$
9<=>?@%
≡ '$() + +0 1$(), argmax
8
0 1$(), 3; !$ ; !′$
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Over-optimism and Error Estimation – upper bound
• Thurn and Schwartz showed that the upper bound of
error due to over-optimization is where action values are
uniformly distributed in an interval [−#, #] is &#
'()
'*)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Over-optimism and Error Estimation – lower bound
• Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # .
• Let !. be are arbitrary value estimates that are on the
whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not
all correct, such that
5
6
∑0 !. #, % − '∗ 3
7
= 8 for
some 8 > 0, where + > 2 is the number of actions in #.
• Then max
0
!. #, % ≥ '∗ # +
A
6B5
.
• The lower bound is tight. The lower bound on the
absolute error of the Double Q-Learning is zero.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Number of Actions and Bias
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bias in Q-Learning vs Double Q-Learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DDQN
• Using DQN’s target network for value estimation
• Using DQN’s online network for evaluating greedy
policy.
!"
#$%&'(#)*
≡ ,"-. + 01 2"-., argmax
9
1 2"-., :; <" ; <′"
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Results
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bayesian DQN or BDQN
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Focusing on Efficient Exploration
• The central claim is that mechanism such as !-greedy
exploration are inefficient.
• Thompson Sampling allows for targeted exploration at
higher dimension but is computationally too expensive.
• BDQN targets to implement Thompson Sampling at
scale though function approximation.
• BDQN combines DQN with a BLR (Bayesian Linear
Regression) model on the last layer.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thomson Sampling
• Thompson sampling involves maintaining a prior
distribution over the environment models (reward
and/or dynamics)
• The distribution is updated as observations are made
• To choose an action, a sample from the posterior belief
is drawn and an action is selected that maximizes the
expected return under the sampled belief.
• For more information please refer to,”A Tutorial on
Thompson Sampling, Daniel Russo et al.”
https://arxiv.org/abs/1707.02038
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
• !-greedy focuses on greedy action.
• TS explores actions with higher estimated return with
higher probability.
• TS based strategy advances the
exploration/exploitation balance by making a trade-off
between the expected returns and the uncertainties,
while ε− greedy strategy ignores all of this information.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
TS vs !-Greedy
• TS finds optimal Q-Function faster.
• Randomizes over Q-Functions with high promising returns
and high uncertainty.
• When true Q-Function is selected, it increases posterior
probability.
• When other function are selected, wrong values are
estimated and the posterior probability is set to zero.
• !-greedy agent randomizes its action with probability of
!, even after having chosen the true Q-Function,
therefore, it takes exponentially many trials in order to
get to the target.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDQN Algorithm
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Network Architecture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDQN Performance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Closing Words
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Value Alignment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
References
• DQN: https://www.nature.com/articles/nature14236
• DDQN: https://arxiv.org/abs/1509.06461
• BDQN: https://arxiv.org/abs/1802.04412
• DQN MXNet Code:
• https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
• DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb
• DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight-
dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb
• BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

Similar to Game Playing RL Agent

The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
Thoughtworks
 

Similar to Game Playing RL Agent (20)

RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
 
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
Deep Dive on MySQL Databases on Amazon RDS (DAT322) - AWS re:Invent 2018
 
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402...
 
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...
 
Speed up your Machine Learning workflows with build-in algorithms
Speed up your Machine Learning workflows with build-in algorithmsSpeed up your Machine Learning workflows with build-in algorithms
Speed up your Machine Learning workflows with build-in algorithms
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
 
Neptune, the Graph Database | AWS Floor28
Neptune, the Graph Database | AWS Floor28Neptune, the Graph Database | AWS Floor28
Neptune, the Graph Database | AWS Floor28
 
Machine Learning Fundamentals
Machine Learning FundamentalsMachine Learning Fundamentals
Machine Learning Fundamentals
 
SageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine LearningSageMaker Algorithms Infinitely Scalable Machine Learning
SageMaker Algorithms Infinitely Scalable Machine Learning
 
Building a Recommender System on AWS
Building a Recommender System on AWSBuilding a Recommender System on AWS
Building a Recommender System on AWS
 
Building Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonBuilding Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet Gluon
 
Neural Machine Translation with Sockeye
Neural Machine Translation with SockeyeNeural Machine Translation with Sockeye
Neural Machine Translation with Sockeye
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web Summit
 
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San FranciscoAmazon SageMaker Algorithms: Machine Learning Week San Francisco
Amazon SageMaker Algorithms: Machine Learning Week San Francisco
 
London Microservices Meetup: Lessons learnt adopting microservices
London Microservices  Meetup: Lessons learnt adopting microservicesLondon Microservices  Meetup: Lessons learnt adopting microservices
London Microservices Meetup: Lessons learnt adopting microservices
 
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
The Theory and Math Behind Data Privacy and Security Assurance (SEC301) - AWS...
 
Amazon sage maker infinitely scalable machine learning algorithms
Amazon sage maker infinitely scalable machine learning algorithmsAmazon sage maker infinitely scalable machine learning algorithms
Amazon sage maker infinitely scalable machine learning algorithms
 

More from Apache MXNet

More from Apache MXNet (20)

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question Answering
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLP
 
Introduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningIntroduction to object tracking with Deep Learning
Introduction to object tracking with Deep Learning
 
Introduction to GluonCV
Introduction to GluonCVIntroduction to GluonCV
Introduction to GluonCV
 
Introduction to Computer Vision
Introduction to Computer VisionIntroduction to Computer Vision
Introduction to Computer Vision
 
Image Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and Challenges
 
Introduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionIntroduction to Deep face detection and recognition
Introduction to Deep face detection and recognition
 
Generative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetGenerative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNet
 
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiDeep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
 
Using Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetUsing Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNet
 
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018
 
ONNX and Edge Deployments
ONNX and Edge DeploymentsONNX and Edge Deployments
ONNX and Edge Deployments
 
Distributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkDistributed Inference with MXNet and Spark
Distributed Inference with MXNet and Spark
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet GluonDebugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet Gluon
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Game Playing RL Agent

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Principal Solutions Architect – AWS Deep Learning Amazon Web Services Game Playing RL Agent
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Inspiration From Nature https://newatlas.com/bae-smartskin/33458/ https://www.nasa.gov/ames/feature/go-go-green-wing-mighty-morphing-materials-in-aircraft-design
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hardware of Learning http://www.biologyreference.com/Mo-Nu/Neuron.html
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hardware of Learning http://www.biologyreference.com/Mo-Nu/Neuron.html I1 I2 B O w2 w3 ! "#, %# = Φ() + Σ#(%#. "#)) Φ " = . 1, 0! " ≥ 0.5 0, 0! " < 0.5 w1 5 ∧ Q 8 9 8 ∧ Q : : : : ; ; ; : ; ; : ;
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process of Learning
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process of Learning Agent Environment!"#$ !" state %"#$ %" reward &" action Sutton and Barto
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Markov State An information state (a.k.a. Markov state) contains all useful information from the history. A state St is Markov if and only if: ! "#$%|"# = ! "#$% "%, … , "*]
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Expected Return • Expected return: !" : sequence of rewards, potentially discounted by a factor # where # ∈ 0,1 !" = )"*+ + #)"*- + #-)"*. + … = 0 123 4 #1)"*1*+
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bellman Expectation Equations !" # = % &' (' = # = % )'*+ + -!" #'*+ #' = # Value of s is the expected return at state s following policy . subsequently. This is Bellman Expectation Equation that can be also expressed as action-value function for policy . /" #, 1 = % )'*+ + -/" #'*+, 1'*+ #' = #, 1' = 1 = ℛ3 4 + - 5 3678 9336 4 !" #′ Value of taking action a at state s under policy . . # → !"(#) #, 1 → /"(#, 1) #′ → !"(#′) 1 #′ >
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bellman Equations - Example 5.5 510 -3 ! " = 10 × .5 + 5 × .25 + −3 × .25 = 5.5 4.4 2 R=5 P=.5 R=2 P=.5 5 P=.4 P=.5 ! " = 5 × .5 + .5[.4 × 2 + .5 × 5 + .1 × 4.4] = 4.4 P=.5 P=.25 P=.25 P=.1
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimal Policy max s,a r s’ a’ s’ s a ! p r max 9 510 -3 "∗ $ = max{−1 + 10, +2 + 5, +3 − 3} = 9 R = -1 R = 2 R = 3 A policy is better if "B $ ≥ "BD $ ∀ $ ∈ G "∗ s ≡ max "B $ ∀ $ ∈ G
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 !" = −1 & → . = & ↑ . = & ↓ . = & ← . = .25
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation 0.00 1 0.00 2 0.00 3 0.00 4 0.00 5 0.00 6 0.00 7 0.00 8 0.00 9 0.00 10 0.00 11 0.00 12 0.00 13 0.00 14 !" = −1 & = 0 0 15 (: !*+,-. /-0123 ( → . = ( ↑ . = ( ↓ . = ( ← . = .25 9" = −1
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 = .25× −1 + 0.(0) "#2 → +.25× −1 + 0.($) "#2 ↑ + .25× −1 + 0.(5) "#2 ↓ +.25× −1 + 0.(7) "#2 ← = −.25 − .25 − .25 − .25 = −9 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0 0.00 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0 : → . = : ↑ . = : ↓ . = : ← . = .25 ;< = −1 = = 0 = = 1 !"#$ 7 =.25× −1 + 0.(?) "#2 → +.25× −1 + 0.(@) "#2 ↑ + .25× −1 + 0.($$) "#2 ↓ +.25× −1 + 0.(A) "#2 ← = −.25 − .25 − .25 − .25 = −9
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 =.25× −1 + −1.00.($) "#1 → + . 25× −1 + −1.00.(1) "#1 ↑ + .25× −1 + −1.00.(4) "#1 ↓ +.25× −1 + 0.(6) "#1 ← = .25× −8 − 8 − 8 − 9 = −9. :; -1.75 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -1.75 -2.00 -2.00 -1.75 0 0 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0 0 !"#$ 7 = −1 ×.25 − 1.00.(=) "#1 → + −1 ×.25 − 1.00.(>) "#1 ↑ + −1 ×.25 − 1.00.(11) "#1 ↓ + −1 ×.25 − 1.00 . .(?) "#1 ← = = .25× −8 − 8 − 8 − 9 = −8 @ → . = @ ↑ . = @ ↓ . = @ ← . = .25 AB = −1 C = 1 C = 2
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iterative Policy Evaluation !"#$ 1 =.25× −1 + −2.00.(0) "#0 → + . 25× −1 + −1.75.(4) "#0 ↑ + .25× −1 + −2.00.(6) "#0 ↓ +.25× −1 + 0.(8) "#0 ← = .25× −: − ;. <= − : − > = −;. ?: -2.43 -2.93 -3.00 -2.43 -2.93 -3.00 -2.93 -2.93 -3.00 -2.93 -2.43 -3.00 -2.93 -2.43 0 0 -1.75 -2.00 -2.00 -1.75 -2.00 -2.00 -2.00 -2.00 -2.00 -2.00 -1.75 -2.00 -2.00 -1.75 0 0 !"#$ 7 = −1 ×.25 − 2.00.(@) "#0 → + −1 ×.25 − 2.00.($) "#0 ↑ + −1 ×.25 − 1.75.(44) "#0 ↓ + −1 ×.25 − 2.00 . .(A) "#0 ← = .25× −: − : − ;. <= − : = −;.93 D → . = D ↑ . = D ↓ . = D ← . = .25 EF = −1 G = 2 G = 3
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Improvement and Control ! " ! → $% ! → &'(()*(") Evaluation Improvement !∗ "∗
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Improvement and Control
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GridWorld Demo https://github.com/rlcode/reinforcement-learning
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Limitation of Dynamic Programming • Assumption of full knowledge of MDP • DP is using full-width backup. • Number of states can grow rapidly. • Suitable for medium problem of just a few million states. …
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monte Carlo Learning • Model-Free learning • Learning from episode of experience • All episodes much have a terminal state … …
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Temporal Difference (TD) Learning • Learning from episodes of experience. • Model-Free • TD learns from incomplete episodes. • Updating an estimate towards an estimate. … TD(1) TD(2)
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Exploration and Exploitation • Exploitation is maximizing reward using known information about a system. • Going to school, applying to college, choosing a degree such as engineering and medicine that has a better comparative yield, graduating as quickly as possible through taking all the recommended degree courses, getting a job, putting money in retirement schemes, retiring at a middle-class house comfortably. • Always following a system based on known information results in missing out on potentials for better results. • Going to school, applying to college, choosing a degree such as engineering and medicine that has a better comparative yield, taking a course in Neural Networks out of curiosity, changing subject, graduating, starting an AI company, growing the company, becoming a billionaire, never retiring J
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning • The Q learning updates the Q value, slightly in the direction of best possible next Q value. s,a r s’ max ! ", $ ← ! ", $ + '() + * max ./ ! "′, 1′ − !(", $))
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q-Learning Properties • Model-free • Change of task (reinforcement) requires re-training • A special kind of Temporal Difference learning • Convergence assured only for Markov states • Tabular approach requires every observed state-action pair to have an entry
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Action Selection • Greedy – always pick the actions with highest value • Break ties randomly • !-greedy – choose random with low probability ! • Softmax – always choose randomly, weighted by respective Q-values
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reinforcement Function • Implicitly supplies the goal to the agent • Designing the function is an art • Mistakes result in agent learning wrong behavior • When need to learn behavior with shortest duration, penalize every action a little for “wasting time”.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q –Learning Demos https://github.com/dbatalov/reinforcement-learning Rocket Lander DemoGrid World Demo https://github.com/rlcode/reinforcement-learning
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tabular Approach and its Limitation
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Universal Function Approximation Theorem • Let $ 0 &' ( )*+,-(+-, &*/+0'0, (+0 1*+*-2)(334 2+)5'(,2+6 7/+)-2*+. • 9'- :; 0'+*-' -ℎ' 1 021'+-2*+(3 /+2- ℎ4='5)/&' 0,1 ;. • ?ℎ' @=()' *7 )*+-2+/*/, 7/+)-2*+, *+ :; 2, 0'+*-'0 &4 A :; . ?ℎ'+ 62B'+ C > 0 (+0 (+4 7/+)-2*+ 7CA :; , -ℎ'5' 'E2,-, • (+ 2+-'6'5 F • 5'(3 )*+,-(+-, BG, &GCℝ • 5'(3 B')-*5, IGCℝ; , Iℎ'5' 2 = 1,2, … , F, ,/)ℎ -ℎ(- I' 1(4 0'72+': N E = O GPQ R BG$(IG T E + &G ) (, (+ (==5*E21(-2*+ 5'(32W(-2*+ *7) 7/+)-2*+ 7 Iℎ'5' 2, 2+0'='+0'+- *7 $; -ℎ(- 2, N E − 7 E < C 7*5 (33 E 2+ :;
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Reinforcement Learning • An Artificial Neural Network is a Universal Function Approximator. • We can use a ANN as an approximation of an agent to choose what action to take to maximize reward. Check this link for proof of the theorem: https://en.wikipedia.org/wiki/Universal_approximation_theorem David Silver
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN Network https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf • DQN Agent achieves >75% of the human score in 29 our of 49 games • DQN Agent beats human score (>100%) in 22 games !"#$%% = ()*%+, !"#$% − ./+0#1 23/4 !"#$%) (671/+ !"#$% − ./+0#1 23/4 !"#$%) 8 100
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN for Breakout https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN Algorithm • Techniques that increase stability and better convergence • !- greedy Exploration • Technique: Choose action as per optimal policy with (1-") and random action with " probability • Advantage: Minimize overfitting of the network • Experience (#$, &$, '$, #$()) Replay • Technique: Store agent’s experiences and use samples from them to update Q- network • Advantage: Removes correlations in observation sequence • Periodic update of Q towards target • Technique: Every C updates, clone the Q-network and used cloned (*Q) for generating target for the following C updates to Q-network • Advantage: Reduces correlations with the target
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DQN-Algorithm DQN (Cloned) DQN !"#$, &"#$, '"#$, !" !"#(, &"#(, '"#(, !"#$ !"#), &"#), '"#), !"#( !"#*, &"#*, '"#*, !"#*+$ Initialize replay memory (N = 1M) Random play Initialize DQNs with random ,- ./ !, &; ,1 # / !, &; ,1 Episode 1: Select 2$ and get !$ Time step 1: &$ = 4 '&5678 &9:;75, <' = = &'>8&?@/ !$, &; ,- , AB2A Observe reward '$ and move to 2( Add (!$, &$, '$, !() to D Generate training data: U(D) = Random sample of D For each !E, &E, 'E, !E+$ ∈ G H : IE = J 'E, AK;276A :A8;5&:A2 'E + M max @Q ./ !E+$, &R; ,- # ,1 # = ,- ,1 = ,- !$ S !$, . ; ,- !E+$ US !E+$, . ; ,- Update DQN using U(D) with ys ,1 = ,$ Time step 2: &( = 4 '&5678 &9:;75, <' = = &'>8&?@/ !(, &; ,$ , AB2A Observe reward '( and move to 2) Add (!(, &(, '(, !)) to D !( S !(, . ; ,$ ,1 # = ,$-V ,1 = ,$-V Every 10K steps, Clone DQN: ,1 # = ,1 ,1 = ,( Episode 2: Select 2$ and get !$Episode m: Select 2$ and get !$ Time step t: &" = 4 '&5678 &9:;75, <' = = &'>8&?@/ !", &; ,1 , AB2A Observe reward '" and move to 2"+$ Add (!", &", '", !"+$) to D
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Function Approximation !∗ #, % = '() * + , max 0) !∗ #1, %1 |#, % !3 #, % = '() * + , max 0) !345 #1 , %1 |#, % !∗ #, % ≈ !(#, %; 9) ! #, %; 93 ≈ '() * + , max 0) ! #1, %1; 93 4 |#, % !3 → !∗ %# < → ∞ >3 93 = '(,0,? '() @|#, % − !(#, %; 93) B where, @ = * + , max 0) ! #1 , %1 ; 93 4 Bellman equation Iterative update Function Approximation Modified Iterative update Loss function to minimize
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network Architecture https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf 84 X 84 X 4 !(#) %(#, '; ))
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Convolutional Network - Nature DQN = gluon.nn.Sequential() with DQN.name_scope(): #first layer DQN.add(gluon.nn.Conv2D(channels=32, kernel_size=8,strides = 4,padding = 0)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) #second layer DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=4,strides = 2)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) #tird layer DQN.add(gluon.nn.Conv2D(channels=64, kernel_size=3,strides = 1)) DQN.add(gluon.nn.BatchNorm(axis = 1, momentum = 0.1,center=True)) DQN.add(gluon.nn.Activation('relu')) DQN.add(gluon.nn.Flatten()) #fourth layer DQN.add(gluon.nn.Dense(512,activation ='relu')) #fifth layer DQN.add(gluon.nn.Dense(num_action,activation ='relu'))
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Issues with DQN • Q-Learning does overestimate action values due to maximization term over estimated values. • Over-estimation is being associated with noise and insufficiently flexible function approximation. • DQN provides a flexible function approximation. • Deterministic nature of Atari games eliminates noise. • DQN still significantly overestimates action values.
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Double Q Learning and DDQN
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Double Q-Learning • The max operator uses the same values for evaluation and action selection. This leads to over-optimism • Decoupling evaluation and action-selection can prevent overoptimization. This is the idea behind Double Q-Learning. • In Double QL two value functions are learned by randomly assigning experiences to update either of the two, resulting in two sets of weights, ! and !′. • For each update one set of weights is used to determine greedy policy and the other for determining its value. #$ % ≡ '$() + + max / 0 1$(), 3; !$ #$ 5%6 ≡ '$() + + max / 0 1$(), 3; !$ 7
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Untangling Evaluation and • For action selection we are using ! • For evaluation we are using !′. #$ % ≡ '$() + + max / 0 1$(), 3; !$ → #$ % ≡ '$() + +0 1$(), argmax 8 0 1$(), 3; !$ ; !$ #$ 9%: ≡ '$() + + max / 0 1$(), 3; !$ ; → #$ 9<=>?@% ≡ '$() + +0 1$(), argmax 8 0 1$(), 3; !$ ; !′$
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Over-optimism and Error Estimation – upper bound • Thurn and Schwartz showed that the upper bound of error due to over-optimization is where action values are uniformly distributed in an interval [−#, #] is &# '() '*)
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Over-optimism and Error Estimation – lower bound • Consider state s at which !∗ #, % = '∗ # ()* #)+, '∗ # . • Let !. be are arbitrary value estimates that are on the whole unbiased so that ∑0(!. #, % − '∗ 3 ), but are not all correct, such that 5 6 ∑0 !. #, % − '∗ 3 7 = 8 for some 8 > 0, where + > 2 is the number of actions in #. • Then max 0 !. #, % ≥ '∗ # + A 6B5 . • The lower bound is tight. The lower bound on the absolute error of the Double Q-Learning is zero.
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Number of Actions and Bias
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bias in Q-Learning vs Double Q-Learning
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DDQN • Using DQN’s target network for value estimation • Using DQN’s online network for evaluating greedy policy. !" #$%&'(#)* ≡ ,"-. + 01 2"-., argmax 9 1 2"-., :; <" ; <′"
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bayesian DQN or BDQN
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Focusing on Efficient Exploration • The central claim is that mechanism such as !-greedy exploration are inefficient. • Thompson Sampling allows for targeted exploration at higher dimension but is computationally too expensive. • BDQN targets to implement Thompson Sampling at scale though function approximation. • BDQN combines DQN with a BLR (Bayesian Linear Regression) model on the last layer.
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thomson Sampling • Thompson sampling involves maintaining a prior distribution over the environment models (reward and/or dynamics) • The distribution is updated as observations are made • To choose an action, a sample from the posterior belief is drawn and an action is selected that maximizes the expected return under the sampled belief. • For more information please refer to,”A Tutorial on Thompson Sampling, Daniel Russo et al.” https://arxiv.org/abs/1707.02038
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy • !-greedy focuses on greedy action. • TS explores actions with higher estimated return with higher probability. • TS based strategy advances the exploration/exploitation balance by making a trade-off between the expected returns and the uncertainties, while ε− greedy strategy ignores all of this information.
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. TS vs !-Greedy • TS finds optimal Q-Function faster. • Randomizes over Q-Functions with high promising returns and high uncertainty. • When true Q-Function is selected, it increases posterior probability. • When other function are selected, wrong values are estimated and the posterior probability is set to zero. • !-greedy agent randomizes its action with probability of !, even after having chosen the true Q-Function, therefore, it takes exponentially many trials in order to get to the target.
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDQN Algorithm
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network Architecture
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDQN Performance
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Closing Words
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Value Alignment
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. References • DQN: https://www.nature.com/articles/nature14236 • DDQN: https://arxiv.org/abs/1509.06461 • BDQN: https://arxiv.org/abs/1802.04412 • DQN MXNet Code: • https://github.com/apache/incubator-mxnet/tree/master/example/reinforcement-learning/dqn • DQN MXNet/Gluon Code:https://github.com/zackchase/mxnet-the-straight- dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb • DDQN MXNet/Gluon Code: https://github.com/zackchase/mxnet-the-straight- dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb • BDQN MXNet/Gluon Code: https://github.com/kazizzad/BDQN-MxNet-Gluon
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.