Mais conteúdo relacionado Semelhante a Irs gan doc (20) Mais de Masato Nakai (20) Irs gan doc1. A connection Between GAN(Generative Adversarial
Networks) and IRL(Inverse Reinforcement Learning) and
Energy-Based Model
GAN IRL
Mabonki0725
()1
May 16, 2017
5. IRL
τ −cθ(τ)
pθ(τ) =
1
Z(θ)
exp (−cθ(τ))
cθ(τ) =
t
cθ(xt, ut)
τ =
x1, x2, · · · , xT
u1, u2, · · · , uT
−cθ(τ) cθ(τ)
xt t x
ut t u
τ
5 / 20
7. IRL
IRL
Max Entropy − pθ(log pθ)dpθ
Gausian Proces
Guide Cost Learning
GAN Guid Cost Learning GAN
7 / 20
8. Guide cost Learning for IRL
Max Entopy Lcost(p) pθ Network
(Cost of IRL)
Lcost(p) = Eτ∼p[− log pθ(τ)] (1)
= Eτ∼p[cθ(τ)] + log Z(θ) (2)
= Eτ∼p[cθ(τ)] + log Eτ∼q
exp(−cθ(τ))
q(τ)
(3)
Max Entropy Z(θ)
q q Lsampler(q)
8 / 20
9. Guide cost Learning for IRL
cθ Z = exp(cθ(τ))dθ
q(τ) 1
Z exp(−cθ(τ)) KL
Lsampler(q) q(τ) Network
(Sampler of IRL)
Lsampler(q) = KL q(τ)||
1
Z
exp(−cθ(τ)) (4)
= q(τ) log
1
Z exp(−cθ(τ))
q(τ)
dτ (5)
= Eτ∼p[cθ(τ)] + Eτ∼q[log q(τ)] + log Z (6)
Guide cost Learning p q
Lcost(p) pθ
Lsampler(q) q
9 / 20
10. Guide cost Learning for IRL
q(τ) Importance Sampling Sample
p(τ)
µ ∼
1
2
p(τ) +
1
2
q(τ)
p(τ) ˜p(τ)
GAN Generator p(τ)
(Cost )
Lcost(p) = Eτ∼p[cθ(τ)] + log Eτ∼µ
exp(−cθ(τ))
1
2 ˜p(τ) + 1
2 q(τ)
(7)
10 / 20
11. GAN disciminator
p(τ) q(τ) GAN
Discriminater D∗
(GAN Discriminater)
D∗
(τ) =
p(τ)
1
2 p(τ) + 1
2 q(τ)
(8)
p(τ)
p(τ) =
1
Z
exp(−cθ(τ))
(GAN Discriminater for θ)
Dθ(τ) =
1
Z exp(−cθ(τ))
1
2Z exp(−cθ(τ)) + 1
2 q(τ)
(9)
11 / 20
12. GAN disciminator Loss
(Loss of Discrimater)
Ldiscriminator(Dθ) = Eτ∼p[log Dθ(τ)] − Eτ∼p[log(1 − Dθ(τ))] (10)
= Eτ∼p − log
1
Z exp(−cθ(τ))
1
2Z exp(−cθ(τ)) + 1
2 q(τ)
− Eτ∼p − log
q(τ)
1
2Z exp(−cθ(τ)) + 1
2 q(τ)
(11)
Discriminator Network Loss
12 / 20
13. Estimeate Z
˜µ =
1
2Z
exp(−cθ(τ)) +
1
2
q(τ)
(Discriminater)
Ldiscriminator(Dθ) = Eτ∼p[log Dθ(τ)] − Eτ∼p[log(1 − Dθ(τ))]
= Eτ∼µ
1
Z exp(−cθ(τ))
˜µ
− Eτ∼q − log
q(τ)
˜µ
!
= log Z + Eτ∼p[cθ(τ)] + Eτ∼p[log ˜µ(τ)]
− Eτ∼q[log q(τ)] + Eτ∼q[log ˜µ(τ)]
13 / 20
14. Estimeate Z
Ldiscriminator(Dθ) = log Z + Eτ∼p[cθ(τ)] + Eτ∼p[log ˜µ(τ)]
− Eτ∼q[log q(τ)] + Eτ∼q[log ˜µ(τ)]
Ldiscriminater(Dθ) Z Z
derivative Discriminater wiht z
∂zLdiscriminator(Dθ) =
1
Z
− Eτ∼µ
1
Z2 exp(−cθ(τ))
˜µ
#
∂zLdiscriminator(Dθ) = 0 $
Z = Eτ∼µ
exp(−cθ(τ))
˜µ
%
14 / 20
15. Derivative Discriminater
Ldiscriminator(Dθ) = log Z + Eτ∼p[cθ(τ)] + Eτ∼p[log ˜µ(τ)]
− Eτ∼q[log q(τ)] + Eτ∼q[log ˜µ(τ)]
Ldiscriminater(Dθ) θ
derivative Discriminater with θ
∂θLdiscriminator(Dθ) = Eτ∼p[∂θcθ(τ)]
− Eτ∼µ
1
Z exp(−cθ(τ)∂θcθ(τ)
˜µ
15 / 20
16. Derivative IRL cost
(7)
Lcost(θ) = Eτ∼p[cθ(τ)] + log Eτ∼µ
exp(−cθ(τ))
˜µ(τ)
IRL Lcost(θ) θ (17) Z
derivative cost with θ
∂θLcost(θ) = Eτ∼p[∂θcθ(τ)] + ∂θ log Eτ∼µ
exp(−cθ(τ))
˜µ(τ)
(19)
= Eτ∼p[∂θcθ(τ)]
− Eτ∼µ
exp(−cθ(τ))∂θcθ(τ)
˜µ(τ)
/Eτ∼µ
exp(−cθ(τ))
˜µ(τ)
= Eτ∼p[∂θcθ(τ)] − Eτ∼µ
exp(−cθ(τ))∂θcθ(τ)
˜µ(τ)
/Z (20)
(21)
16 / 20
17. Conclusion IRS cost and GAN discriminator
Derivative IRL cost = Derivative GAN discriminator
∂θLdiscriminator(Dθ) = Eτ∼p[∂θcθ(τ)]
− Eτ∼µ
1
Z exp(−cθ(τ)∂θcθ(τ)
˜µ
∂θLcost(θ) = Eτ∼p[∂θcθ(τ)] − Eτ∼µ
exp(−cθ(τ))∂θcθ(τ)
˜µ(τ)
/Z
= Eτ∼p[∂θcθ(τ)] − Eτ∼µ
1
Z exp(−cθ(τ))∂θcθ(τ)
˜µ(τ)
= ∂θLdiscriminator(Dθ) !
17 / 20
18. Conclusion IRS sampler and GAN generator
IRL sampler
$
Lsampler(q) = Eτ∼p[cθ(τ)] + Eτ∼q[log q(τ)]
GAN generator = IRS sampler + Constant
Lgenerater(q) = Eτ∼q[log(1 − D(τ)) − log D((τ))]
= Eτ∼q log
q(τ)
˜µ(τ)
− log
1
Z exp(−cθ(τ))
˜µ(τ)
#
= Eτ∼q[log q(τ) + log Z + cθ(τ)] $
= log Z + Eτ∼q[cθ(τ)] + Eτ∼q[log q(τ)] %
= log Z + Lsampler(q)
18 / 20
19. Conclusion
network
IRL Lcost qθ Lsampler q
−cθ(τ)
GAN Lgenerater Ldiscriminator
q(τ) = p(τ)
IRL GAN
IRL GAN ∂θLcost = ∂θLdiscriminator
IRL GAN Lsampler(q) + log Z = Lgenerator(q)
19 / 20