18. a
20
a : a 8 b
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
19. > - )
21
• [ Tdt
• r L T m T
• Tnk oy Tie
• o W l ( , 19 c
nk] [ gb a f TpQ [Tnk
状態 行動↑ 行動↓ 行動← 行動→
s = 1 7 9 3 0
s = 2 4 7 0 0
1
2
Q-Table
, := > 9 ) > 2 ,) f
D dts
20. 20 0- i
22
• g g a P
• e
• 4 e 5
• 4 e 5
• CL Q 2 1 l
• 4 5 4
g 8 6
52. 3B C 75 [ 3 L6
59
( 3 L6 b _ a]
1 D 3 L6 R"
C L LC = 0 P L LC : = ""
C L =LC = 0 P L =LC : = ""
1 L "
"
C L LC 0 P L ""
LC = 0 2 "
=LC = 0 4C = L ("
LC 0 8 ', )() ( ', ( ) .-( . ' 9
出力
53. A: D 1 ]Q P : 2 C P
60
( 3a_ V U
C: 3 D D A: D A: D *
= D C # # D : D # D A == D A:DD C ) *
2 A D
F 3 D D# C D
2 A D (
F C D
C 1 1 D : # D A == D A:DD C #
C 1 1 D : D A == D A:DD C #D A == D A:DD C #
C 1 1 D : D A == D A:DD C # D : D
= :CC C # *
A :DA C C
A :DA C C A
D A: D C : D L:C . , D :C C C A
DL L: D F: A:F
D : D DL : D F: D
G D 3 D D # D : D
025 ]QRQ[
G D F
54. 2D=E AL 2=LN9 FA
61
* b c
_d
NEGE AL D=E AL NEGE AL 0 =G A A (
NEGE AL AN B
a
C=GG= -
A EF CLAA R
AQ F LAL D=E ALLF AQ F LAL 2 N= N4 EF 6LAA R
A EF )# L= G = NE B A P = NE = A =G FA
AQ ALEA A LA F=R
LA F=R BBAL D=E ALLF LA F=R BBAL A F=R1 BBAL = = ENR ,
DE F=G = Q.Q = NR A BF =N)(# R 5=F A
=CA N D=E ALLF =CA N 3:
B # NEGE AL# LA F=R BBAL# C=GG=# AQ F LAL#
LA F=R N=LN E A # =NA E NALP=F #
N=LCAN =NA E NALP=F # DE DE
55. :3 1 L P 3 C0 <
62
)
3 #
, C
3 ,
, .3<
1 ,
C ,
: < C 3 C +
35C , 3 C 35C23 2C 3 3 CF < 3C( # 3
# 3 # # 2 , C 35C
1 , 3
C ,
3 C C 2 23 2C 3 # 3 #
3 C 3 < 3=