SlideShare uma empresa Scribd logo
1 de 86
Baixar para ler offline
&
•
•
• 2 2
•
•
3
• G
• :61 / 1 0
C
46 : A 2 , 1 .
• C A
• 0/ /7 2 ,
K
6077 ./ :1
•
• L 1:3 1 1 /4 .
I F
72 01 3/ ,1 61
2
9
• 4 . 1 .
2
3
•
( (
)
10
• ) .(
•
• )
•
+ e+
11
•
• . 0 -
• 0
• 1- 0
( (
)
12
獲得できる報酬を
最大化するような行動を学習
( ) .
13
•
•
•
•
•
•
• :
γ=0
γ=0.9
14
15
将来にわたって得られる報酬(収益)を
最大化するような行動を学習
( ) .
16
17
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
18
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
19
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
a
20
a : a 8 b
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
> - )
21
• [ Tdt
• r L T m T
• Tnk oy Tie
• o W l ( , 19 c
nk] [ gb a f TpQ [Tnk
状態 行動↑ 行動↓ 行動← 行動→
s = 1 7 9 3 0
s = 2 4 7 0 0
1
2
Q-Table
, := > 9 ) > 2 ,) f
D dts
20 0- i
22
• g g a P
• e
• 4 e 5
• 4 e 5
• CL Q 2 1 l
• 4 5 4
g 8 6
• 9 5
•
1,- -,
24
25
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
)11 , 1 ), - 32 1 5 0 (.
26
• , 1 MN [
• D Q
t i
27
• c t t i
• R T a rn
• o kg dC E i
• -
• Cpl a N t
•
• Q R T e
• -
• w
1,2 , +, 2,1 (
28
• 2 2,1 ,2 L
• E 9
•
•
Replay buffer Shuffled
θの学習
⋮ ⋮
29
•
• a Q
• T -
• a
• 1
• N
•
00
, -
30
•
• 1
• Q 1
• R
•
•
•
強化学習アルゴリズムマップ
31
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
2,. 00 .1 1 .
33
• 6HN
•
•
• Q D
37
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
38
• Q
•
•
•
•
• A 2 Q
, , . ,
39
• , ,
• c g a g PQ
• d
• , . - , ,
• LN
• g PQD g i V l
• . ,
• c g e b
• LN
• n
- - ) 2 ,
40
• ( - - 0 ]C
•
• P Vd
• K c Ta
• [ A
e
P Vd
21 ,3
41
• -16
• A
• C [aA c
• . , ,0
• M ]
• - 6 62 2-
• - 6 62 2- A
i fP
42
-: 1 : .4A / 4A4 4 C4 N
l ha k
k l N m e
m -: 1 : .4A gG
( -: 1 : .4A dc
) b
AA 43 2 4 4 64 A 5 A 4
20
43
•
• 3
44
: 8
.6635 4//6 57 707 /6 5
Monte Carlo
Q-Learning
Actor-Critic
Policy Gradient
Deep Learning
Deep
Q-Network
Double DQN
Double
Q-Learning
GORILA
(並列化)
Dueling DQN
A3C
TRPO PPO
UNREAL
Generalized
Advantage Estimator
Advantage
Q-Learning
Prioritized
Experience Replay
SRASA
,. 1 36 3 2
45
• P Lp Pa L o
• l] l Fe d J P o
• 03 . 3C n PR UV o
• . A3 6 6 b gENcrP[iJ o
46
• 3 L
• A 3 E
• A 3
https://deepmind.com/blog/reinforcement-learning-unsupervised-auxiliary-tasks/
47
. )
48
• (
• ChainerRL • OpenAI Gym
• MuJoCo
.
49
• G I A
• . P
• O
•
• /.I: I P
/ . /
/
50
• :
RoboticsMuJoCo
AtariClassic control
CartPole Pong Breakout Boxing
Humanoid Picking Hand manipulate
/ ./ .
0 / Ib H ] [H
51
• g
• 2
• 1 A ) O (
• ea
• 5 [ ] [d
• ] [
• . P U ]
• ] [
: 7 A C : 7A A 5 : :
7 :
hc ] [
C 5 G
[ ] [
C 5 G 35 4
52
import gym
env = gym.make(‘CartPole-v0’)
env.reset() // 初期化
env.render() // 表示
出力
( C )/ -: C
53
:
/ A)/ -:
=
: . /
1
/0 : /0 : .= /0 =/ O G P
= /0 : GI
出力
.
54
• 3
• :
/// .
55
• OM C C
• AC P
• * .3 G: PI
• 3 J
/ .3
56
• 1 C
• 3
•
• 0
57
• C
•
•
分散学習
画像認識
強化学習
F 1.
58
C
1
1 1
1 .
1 .
1
1
3B C 75 [ 3 L6
59
( 3 L6 b _ a]
1 D 3 L6 R"
C L LC = 0 P L LC : = ""
C L =LC = 0 P L =LC : = ""
1 L "
"
C L LC 0 P L ""
LC = 0 2 "
=LC = 0 4C = L ("
LC 0 8 ', )() ( ', ( ) .-( . ' 9
出力
A: D 1 ]Q P : 2 C P
60
( 3a_ V U
C: 3 D D A: D A: D *
= D C # # D : D # D A == D A:DD C ) *
2 A D
F 3 D D# C D
2 A D (
F C D
C 1 1 D : # D A == D A:DD C #
C 1 1 D : D A == D A:DD C #D A == D A:DD C #
C 1 1 D : D A == D A:DD C # D : D
= :CC C # *
A :DA C C
A :DA C C A
D A: D C : D L:C . , D :C C C A
DL L: D F: A:F
D : D DL : D F: D
G D 3 D D # D : D
025 ]QRQ[
G D F
2D=E AL 2=LN9 FA
61
* b c
_d
NEGE AL D=E AL NEGE AL 0 =G A A (
NEGE AL AN B
a
C=GG= -
A EF CLAA R
AQ F LAL D=E ALLF AQ F LAL 2 N= N4 EF 6LAA R
A EF )# L= G = NE B A P = NE = A =G FA
AQ ALEA A LA F=R
LA F=R BBAL D=E ALLF LA F=R BBAL A F=R1 BBAL = = ENR ,
DE F=G = Q.Q = NR A BF =N)(# R 5=F A
=CA N D=E ALLF =CA N 3:
B # NEGE AL# LA F=R BBAL# C=GG=# AQ F LAL#
LA F=R N=LN E A # =NA E NALP=F #
N=LCAN =NA E NALP=F # DE DE
:3 1 L P 3 C0 <
62
)
3 #
, C
3 ,
, .3<
1 ,
C ,
: < C 3 C +
35C , 3 C 35C23 2C 3 3 CF < 3C( # 3
# 3 # # 2 , C 35C
1 , 3
C ,
3 C C 2 23 2C 3 # 3 #
3 C 3 < 3=
:3 = 1 L F 3 0 < F
63
)
= 3=
, =C
6 = , .3<
1 ,
,
: < = 6 = 3=6 + '
=C =6
3 = , 3 = 3 3 = < 3 ('
6 = 2 , =C 3 =
1 ,
,
= 6 1 1
3 = 2 6
学習前 200episode学習後
64
4
65
• [ N
•
•
•
•
• C
• :
84
84
入力画像
Convolution + Pooling Fully connected
8
P8
66
•
•
eg b T U T
67
• T
• o iu k C A w
• y t w
• U T
• nE
• cd t
• U T rD Q L
• b a
• 3 3 3
• p lx
•
R N
エージェント
(ex. ロボット)
環境
(ex. 迷路)
状態観測部
行動選択
状態s
状態s’
観測
行動
68
•
:94
69
• e 4
• [ 1 -1 0+
• e e 1 e
•
2
2
2
2 [
2
83 5]
70
•
• : 2
平均行動ステップ数 平均収益
71
72
d
73
• a e gc R
• i
• e e
•
•
• 1 0) (- .+
1 2 1 7 0
2
74
f a
75
• Zdl f
• f l
•
• he
+ 1/ -+4/ .+:+ - 44/-:3 3: + + + :
:: : / 3+ / /
IF Zdl
b
: L c
b
: L c
i
g
L
76
•
•
•
•
•
•
77
•
• G
•
• B R
B
B
•
•
•
78
1
.
•
•
•
79
80
1 2 D
(( (-. )
1 2
D
D
• N
• Q
•
• N
• Q
•
81
Q
Q
•
•
•
ND
)) ( ). -
3 1
e 2
D C DI 7
82
• +, D C DI
,-
• [ 1 +, Q 2 ]
• U R R 1 M0
N D C
DI
D C DI 7
83
• +, D C DI
,-
• [1 +, R 2
• ] U 1 N0Q
DI
M
icgn
84
• , D 3 1 G C ,31 0B = G 1 2 (
• aNSP ]Q m
• [ . Mpl
GR
oR
G R
k FG
+CBIC G CB 2CC B - CBB G
3 FG ) r
3 FG )d e
3 FG )h e
+11
o ) 3
s
85
•
3 -
86
•
•
•
•
• 5 -
•
•
• 5 -
• 5 - 0 1
• +
:
87
•
• ( ) R
H4 : H :
•
• S
88
B
89
•
• 9 .
• G H
• 12S
• :6 8 ] R
•
• [
:6 8
] R
[
E
90
B
B
GB
シミュレータではどの入力画像でも学習可能
91
• 24 01
•
•
•
•
•
•
•
4 4
92
• B B 4
• 4 5 /
• 5 0/14 [8 .2
]
B B
R ] G
4
93
•
•
•
•

Mais conteúdo relacionado

Semelhante a RL_Tutorial

[DL Hacks]Modeling Relational Data with Graph Convolutional Networks
[DL Hacks]Modeling Relational Data with Graph Convolutional Networks[DL Hacks]Modeling Relational Data with Graph Convolutional Networks
[DL Hacks]Modeling Relational Data with Graph Convolutional NetworksDeep Learning JP
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RLDongmin Lee
 
Summary of the state of Java that will affect Scala-ers
Summary of the state of Java that will affect Scala-ersSummary of the state of Java that will affect Scala-ers
Summary of the state of Java that will affect Scala-ersLINE Corporation
 
Fishing Spot Estimation by Sea Temperature Pattern Learning
Fishing Spot Estimation by Sea Temperature Pattern LearningFishing Spot Estimation by Sea Temperature Pattern Learning
Fishing Spot Estimation by Sea Temperature Pattern LearningMasaakiIiyama
 
ボール保持力・奪取力マップから見るロシアW杯2018
ボール保持力・奪取力マップから見るロシアW杯2018ボール保持力・奪取力マップから見るロシアW杯2018
ボール保持力・奪取力マップから見るロシアW杯2018SaeruYamamuro
 
[DL Hacks]Video-to-Video Synthesis
[DL Hacks]Video-to-Video Synthesis[DL Hacks]Video-to-Video Synthesis
[DL Hacks]Video-to-Video SynthesisDeep Learning JP
 
LINEでのモバイルアプリ開発
LINEでのモバイルアプリ開発LINEでのモバイルアプリ開発
LINEでのモバイルアプリ開発LINE Corporation
 
退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]
退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]
退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]YamaKo @rehabit
 
カメラキャリブレーション
カメラキャリブレーションカメラキャリブレーション
カメラキャリブレーションKento Doi
 
ゼロから始める機械学習 ディープラーニング超概要
ゼロから始める機械学習 ディープラーニング超概要ゼロから始める機械学習 ディープラーニング超概要
ゼロから始める機械学習 ディープラーニング超概要Kenshi Toritani
 
Jeu plan d'expérience
Jeu plan d'expérienceJeu plan d'expérience
Jeu plan d'expérienceCIPE
 
JAWS-UG OSAKA chime_and_connect_and_alexa
JAWS-UG OSAKA chime_and_connect_and_alexaJAWS-UG OSAKA chime_and_connect_and_alexa
JAWS-UG OSAKA chime_and_connect_and_alexaDaiki Mori
 
Cyber hacking and Security threats (focusing on IoT security)
Cyber hacking and Security threats (focusing on IoT security)Cyber hacking and Security threats (focusing on IoT security)
Cyber hacking and Security threats (focusing on IoT security)Jihoon Yang
 
Issues and Development about Internet-Only Bank
Issues and Development about Internet-Only BankIssues and Development about Internet-Only Bank
Issues and Development about Internet-Only BankCollaborator
 
[DL輪読会] Designing Network Design Spaces [CVPR2020]
[DL輪読会] Designing Network Design Spaces [CVPR2020][DL輪読会] Designing Network Design Spaces [CVPR2020]
[DL輪読会] Designing Network Design Spaces [CVPR2020]Deep Learning JP
 
LINE APIで開発する価値
LINE APIで開発する価値LINE APIで開発する価値
LINE APIで開発する価値Hiroyuki Hiki
 
20prioridad3
20prioridad320prioridad3
20prioridad3'Katyy Mv
 
jeu plans d'expérience
jeu plans d'expériencejeu plans d'expérience
jeu plans d'expérienceCIPE
 

Semelhante a RL_Tutorial (20)

[DL Hacks]Modeling Relational Data with Graph Convolutional Networks
[DL Hacks]Modeling Relational Data with Graph Convolutional Networks[DL Hacks]Modeling Relational Data with Graph Convolutional Networks
[DL Hacks]Modeling Relational Data with Graph Convolutional Networks
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
 
Summary of the state of Java that will affect Scala-ers
Summary of the state of Java that will affect Scala-ersSummary of the state of Java that will affect Scala-ers
Summary of the state of Java that will affect Scala-ers
 
最近のJava事情
最近のJava事情最近のJava事情
最近のJava事情
 
Fishing Spot Estimation by Sea Temperature Pattern Learning
Fishing Spot Estimation by Sea Temperature Pattern LearningFishing Spot Estimation by Sea Temperature Pattern Learning
Fishing Spot Estimation by Sea Temperature Pattern Learning
 
ボール保持力・奪取力マップから見るロシアW杯2018
ボール保持力・奪取力マップから見るロシアW杯2018ボール保持力・奪取力マップから見るロシアW杯2018
ボール保持力・奪取力マップから見るロシアW杯2018
 
[DL Hacks]Video-to-Video Synthesis
[DL Hacks]Video-to-Video Synthesis[DL Hacks]Video-to-Video Synthesis
[DL Hacks]Video-to-Video Synthesis
 
LINEでのモバイルアプリ開発
LINEでのモバイルアプリ開発LINEでのモバイルアプリ開発
LINEでのモバイルアプリ開発
 
退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]
退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]
退院時アウトカム予測における機械学習の応用 [JARM第2回秋季学術集会]
 
カメラキャリブレーション
カメラキャリブレーションカメラキャリブレーション
カメラキャリブレーション
 
ゼロから始める機械学習 ディープラーニング超概要
ゼロから始める機械学習 ディープラーニング超概要ゼロから始める機械学習 ディープラーニング超概要
ゼロから始める機械学習 ディープラーニング超概要
 
Jeu plan d'expérience
Jeu plan d'expérienceJeu plan d'expérience
Jeu plan d'expérience
 
JAWS-UG OSAKA chime_and_connect_and_alexa
JAWS-UG OSAKA chime_and_connect_and_alexaJAWS-UG OSAKA chime_and_connect_and_alexa
JAWS-UG OSAKA chime_and_connect_and_alexa
 
Cyber hacking and Security threats (focusing on IoT security)
Cyber hacking and Security threats (focusing on IoT security)Cyber hacking and Security threats (focusing on IoT security)
Cyber hacking and Security threats (focusing on IoT security)
 
Issues and Development about Internet-Only Bank
Issues and Development about Internet-Only BankIssues and Development about Internet-Only Bank
Issues and Development about Internet-Only Bank
 
[DL輪読会] Designing Network Design Spaces [CVPR2020]
[DL輪読会] Designing Network Design Spaces [CVPR2020][DL輪読会] Designing Network Design Spaces [CVPR2020]
[DL輪読会] Designing Network Design Spaces [CVPR2020]
 
LINE APIで開発する価値
LINE APIで開発する価値LINE APIで開発する価値
LINE APIで開発する価値
 
Slides desjardins-2011
Slides desjardins-2011Slides desjardins-2011
Slides desjardins-2011
 
20prioridad3
20prioridad320prioridad3
20prioridad3
 
jeu plans d'expérience
jeu plans d'expériencejeu plans d'expérience
jeu plans d'expérience
 

Mais de Takayoshi Yamashita

Mais de Takayoshi Yamashita (13)

AI_DL_Education
AI_DL_EducationAI_DL_Education
AI_DL_Education
 
20190804_icml_kyoto
20190804_icml_kyoto20190804_icml_kyoto
20190804_icml_kyoto
 
MIRU_Preview_JSAI2019
MIRU_Preview_JSAI2019MIRU_Preview_JSAI2019
MIRU_Preview_JSAI2019
 
MIRU2018 tutorial
MIRU2018 tutorialMIRU2018 tutorial
MIRU2018 tutorial
 
UsingChainerMN
UsingChainerMNUsingChainerMN
UsingChainerMN
 
DeepLearningTutorial
DeepLearningTutorialDeepLearningTutorial
DeepLearningTutorial
 
Tutorial-DeepLearning-PCSJ-IMPS2016
Tutorial-DeepLearning-PCSJ-IMPS2016Tutorial-DeepLearning-PCSJ-IMPS2016
Tutorial-DeepLearning-PCSJ-IMPS2016
 
DeepLearningDay2016Summer
DeepLearningDay2016SummerDeepLearningDay2016Summer
DeepLearningDay2016Summer
 
IEEE ITSS Nagoya Chapter
IEEE ITSS Nagoya ChapterIEEE ITSS Nagoya Chapter
IEEE ITSS Nagoya Chapter
 
DeepLearningDay2016Spring
DeepLearningDay2016SpringDeepLearningDay2016Spring
DeepLearningDay2016Spring
 
NVIDIA Seminar ディープラーニングによる画像認識と応用事例
NVIDIA Seminar ディープラーニングによる画像認識と応用事例NVIDIA Seminar ディープラーニングによる画像認識と応用事例
NVIDIA Seminar ディープラーニングによる画像認識と応用事例
 
ICIP2014 Presentation
ICIP2014 PresentationICIP2014 Presentation
ICIP2014 Presentation
 
MIRU2014 tutorial deeplearning
MIRU2014 tutorial deeplearningMIRU2014 tutorial deeplearning
MIRU2014 tutorial deeplearning
 

RL_Tutorial