SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Qo
(s,a) = r(s,a)+γ max
a'
Qo
(s',a')
Qo
L = (r(s,a)+γ max
a'
Qθ
o
(s',a')−Qθ
o
(s,a))2
∇θ J = ∇θ Eπθ
[ γ τ
Rτ ]
τ =0
∞
∑
= ∇θ P( ′s | st ,a)πθ (a | st ) γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= P( ′s | st ,a)∇θπθ (a | st ) γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= P( ′s | st ,a)πθ (a | st )
∇θπθ (a | st )
πθ (a | st )
γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= P( ′s | st ,a)πθ (a | st )∇θ log(πθ (a | st )) γ τ
Rτ
τ =0
∞
∑
a
∑
′s
∑
= Eπθ
[∇θ log(πθ (a | st )) γ τ
Rτ ]
τ =0
∞
∑
Eπθ
[∇θ log(πθ (a | st )) γ τ
Rτ ]
τ =0
∞
∑
=
1
M
∇θ log(πθ (ai
T
| si
T
))( γ τ
Rτ
T
)
τ =0
∞
∑
i
∑
T
∑
T = s0
T
,a0
T
,r0
T
,!sn
T
,an
T
,rn
T
1
M
∇θ log(πθ (ai
T
| si
T
))( γ τ
Rτ
T
τ =0
∞
∑
i
∑
T
∑ )
1
M
∇θ log(πθ (ai
T
| si
T
))( γ τ
Rτ
T
τ =0
∞
∑
i
∑
T
∑ )
1
M
∇θ log(πθ (ai
T
| si
T
))
i
∑
T
∑ A(si
T
,ai
T
)
Qaux
(a,i, j)
LQ = E[(Rt:t+n +γ n
max
a'
Q(s',a';θ−
)−Q(s,a;θ))2
]
LVR = Eπ [(Rt:t+n +γ n
V(st+n+1,θ−
)−V(st ,θ))2
]
Ep[ f (x)] = p(x) f (x)x∑
Eq[ f (x)] = q(x) f (x)x∑
=
q(x)
p(x)
p(x) f (x)x∑
= p(x)
q(x)
p(x)
f (x)x∑
= Ep[
q(x)
p(x)
f (x)]
LA3C = Lπ + LV − Es∼π [αH(π(⋅| s))]
!Qπ
(s,a) = α(log(π(s,a)+ Hπ
(s))+Vπ
(s)
Q∗
(s,a) = r(s,a)+γτ log exp(Q∗
(s',a') /τ )a'∑
Q∗
V∗
(s) = −τ logπ∗
(a | s)+ r(s,a)+γV∗
(s')
−V∗
(s1)+γ t−1
V∗
(st )+ R(s1:t )−τG(s1:t ,π∗
) = 0
R(sm:n ) = γ i
r(sm+i ,am+i )
i=0
n−m−1
∑ G(sm:n,π) = γ i
logπ(am+i | sm+i )
i=0
n−m−1
∑
Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1
Vφ (st )+ R(s1:t )−τG(s1:t ,πθ )
Δθ ∝Cθ,φ (s1:t )∇θG(s1:t ,πθ )
Δφ ∝Cθ,φ (s1:t )(∇φVφ (s1)− ∇φγ t−1
Vφ (st ))
Recent rl
Recent rl
Recent rl
Recent rl
Recent rl
Recent rl
Recent rl

Mais conteúdo relacionado

Semelhante a Recent rl

強化学習勉強会6の資料
強化学習勉強会6の資料強化学習勉強会6の資料
強化学習勉強会6の資料Yuji Okamoto
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Shohei Taniguchi
 
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ssusere0a682
 
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-ssusere0a682
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilarWidmar Aguilar Gonzalez
 
Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionAtsushi Nitanda
 
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ssusere0a682
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライドYuchi Matsuoka
 
6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rl6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rlSean Devlin
 
El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120RCCSRENKEI
 
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明ssusere0a682
 
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ssusere0a682
 
Ejercicios varios de algebra widmar aguilar
Ejercicios varios de  algebra   widmar aguilarEjercicios varios de  algebra   widmar aguilar
Ejercicios varios de algebra widmar aguilarWidmar Aguilar Gonzalez
 
Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...IJRTEMJOURNAL
 
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足ssusere0a682
 
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ssusere0a682
 

Semelhante a Recent rl (20)

強化学習勉強会6の資料
強化学習勉強会6の資料強化学習勉強会6の資料
強化学習勉強会6の資料
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
ゲーム理論NEXT 戦略形協力ゲーム第11回 -寡占市場ゲームにおける結託耐性ナッシュ均衡-
 
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
ゲーム理論BASIC 演習51 -完全ベイジアン均衡-
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilar
 
Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network Perception
 
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド関西NIPS+読み会発表スライド
関西NIPS+読み会発表スライド
 
6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rl6 28 18_hack_hunterdon_meetup_deep_rl
6 28 18_hack_hunterdon_meetup_deep_rl
 
El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120El text.life science6.matsubayashi191120
El text.life science6.matsubayashi191120
 
Prelude to halide_public
Prelude to halide_publicPrelude to halide_public
Prelude to halide_public
 
Gan
GanGan
Gan
 
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明Re:ゲーム理論入門 - ナッシュ均衡の存在証明
Re:ゲーム理論入門 - ナッシュ均衡の存在証明
 
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
ゲーム理論NEXT 期待効用理論第10/11回 -期待効用定理の証明4/5
 
uuum_3q
uuum_3quuum_3q
uuum_3q
 
Ejercicios varios de algebra widmar aguilar
Ejercicios varios de  algebra   widmar aguilarEjercicios varios de  algebra   widmar aguilar
Ejercicios varios de algebra widmar aguilar
 
Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...Existence of positive solutions for fractional q-difference equations involvi...
Existence of positive solutions for fractional q-difference equations involvi...
 
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
3人ゲームの混合戦略ナッシュ均衡を求める ゲーム理論 BASIC 演習1の補足
 
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
ゲーム理論BASIC 第42回 -仁に関する定理の証明3-
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Recent rl

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Qo (s,a) = r(s,a)+γ max a' Qo (s',a') Qo L = (r(s,a)+γ max a' Qθ o (s',a')−Qθ o (s,a))2
  • 8.
  • 9.
  • 10. ∇θ J = ∇θ Eπθ [ γ τ Rτ ] τ =0 ∞ ∑ = ∇θ P( ′s | st ,a)πθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)∇θπθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)πθ (a | st ) ∇θπθ (a | st ) πθ (a | st ) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = P( ′s | st ,a)πθ (a | st )∇θ log(πθ (a | st )) γ τ Rτ τ =0 ∞ ∑ a ∑ ′s ∑ = Eπθ [∇θ log(πθ (a | st )) γ τ Rτ ] τ =0 ∞ ∑
  • 11. Eπθ [∇θ log(πθ (a | st )) γ τ Rτ ] τ =0 ∞ ∑ = 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T ) τ =0 ∞ ∑ i ∑ T ∑ T = s0 T ,a0 T ,r0 T ,!sn T ,an T ,rn T
  • 12. 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T τ =0 ∞ ∑ i ∑ T ∑ )
  • 13. 1 M ∇θ log(πθ (ai T | si T ))( γ τ Rτ T τ =0 ∞ ∑ i ∑ T ∑ ) 1 M ∇θ log(πθ (ai T | si T )) i ∑ T ∑ A(si T ,ai T )
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Qaux (a,i, j) LQ = E[(Rt:t+n +γ n max a' Q(s',a';θ− )−Q(s,a;θ))2 ]
  • 22. LVR = Eπ [(Rt:t+n +γ n V(st+n+1,θ− )−V(st ,θ))2 ]
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. Ep[ f (x)] = p(x) f (x)x∑ Eq[ f (x)] = q(x) f (x)x∑ = q(x) p(x) p(x) f (x)x∑ = p(x) q(x) p(x) f (x)x∑ = Ep[ q(x) p(x) f (x)]
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. LA3C = Lπ + LV − Es∼π [αH(π(⋅| s))]
  • 37. !Qπ (s,a) = α(log(π(s,a)+ Hπ (s))+Vπ (s)
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43. Q∗ (s,a) = r(s,a)+γτ log exp(Q∗ (s',a') /τ )a'∑ Q∗
  • 44. V∗ (s) = −τ logπ∗ (a | s)+ r(s,a)+γV∗ (s') −V∗ (s1)+γ t−1 V∗ (st )+ R(s1:t )−τG(s1:t ,π∗ ) = 0 R(sm:n ) = γ i r(sm+i ,am+i ) i=0 n−m−1 ∑ G(sm:n,π) = γ i logπ(am+i | sm+i ) i=0 n−m−1 ∑
  • 45. Cθ,φ (s1:t ) = −Vφ (s1)+γ t−1 Vφ (st )+ R(s1:t )−τG(s1:t ,πθ ) Δθ ∝Cθ,φ (s1:t )∇θG(s1:t ,πθ ) Δφ ∝Cθ,φ (s1:t )(∇φVφ (s1)− ∇φγ t−1 Vφ (st ))