Double Q-learning Paper Reading

•

0 gostou•193 visualizações

Takato Yamazaki

Summarizing DDQN.

Tecnologia

Deep Reinforcement Learning with
Double Q-learning
Presenter: Takato Yamazaki
1

About the Paper
Title
Deep Reinforcement Learning with Double Q-learning
[arXiv:1509.06461]
Author
Hado van Hasselt, Arthur Guez, David Silver
Af liation
Google DeepMind
Year
2015
2

Outline
How DDQN was Derived
DDQN
Experiment Environment
Results
Summary
Related Papers
3

How DDQN was Derived
Reinforcement Learning
Agent's Goal: Learn good policies for sequential decision problems
With policy π, the true value Q of an action a in state s is
Q (s, a) = E R + γR + ...∣S = s, A = a, π
Optimal value is then
Q (s, a) = Q (s, a)
π [ 1 2 0 0 ]
∗
π
max π
4

How DDQN was Derived
Q-learning (Watkins, 1989)
Q(s, a) = Q(s, a) + α −
where α is the learning rate.
Current Q value will move closer to (Reward + next Q value)
(R + γ Q(s , a )t+1
a′
max ′ ′
Q(s, a))
5

How DDQN was Derived
Deep Q-learning (Mnih et al., 2015)
What if there is in nite states...
Q-learning can be considered as minimization problem.
Neural network can be used to minimize the error!
Y t
DQN
L(θ )
θt
min t
= R + γ Q(s , a ; θ )t+1
a′
max ′ ′
t
−
= E (R + γ Q(s , a ; θ ) − Q(s, a; θ ))
θt
min [ t+1
a′
max ′ ′
t
−
t
2
]
6

How DDQN was Derived
Deep Q-learning (Mnih et al., 2015) (Continued)
Experience replay
Store observed transitions to memory bank
Sample from memory bank randomly and train network
Target network
Copy online network θ to target network θ every τ stepst t
−
7

How DDQN was Derived
Double Q-learning (van Hasselt, 2010)
Q-learning often OVERESTIMATES the Q values because...
it uses the maximum action value every time to update Q values
it uses the same values to select and to evaluate an action
Double Q-learning helps avoiding overestimates!
Split the weights θ into selector and evaluator
8

Double Q-learning (van Hasselt, 2010) (continued)
9

Double Q-learning (van Hasselt, 2010) (continued)
Q-learning target
Y = R + γ Q(s , a ; θ )
Transform to
Y = R + γQ s , argmax Q(s , a; θ ); θ
Use different parameter for evaluating the Q-value
Y = R + γQ s , argmax Q(s , a; θ ); θ
t
Q
t+1
a′
max ′ ′
t
t
Q
t+1 ( ′
a
′
t t)
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
10

Double Q-learning (van Hasselt, 2010) (continued)
11

DDQN
Double Deep Q-learning (DDQN)
Combination of DQN and Double Q-learning!!!
Using neural network as selector and evaluator.
Easy implementation because...
DQN uses target network feature
Online network θ = Selector
Target network θ = Evaluator
t
t
−
12

Double Deep Q-learning (DDQN) (continued)
Double Q-learning's target was described as
Y = R + γQ s , argmax Q(s , a; θ ); θ
Transform for DDQN
Y = R + γQ s , argmax Q(s , a; θ ); θ
where θ is the online network and θ is the target network
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
t
DoubleDQN
t+1 ( ′
a
′
t t
−
)
t t
−
13

Experiment Environment
Atari 2600 Games, using the Arcade Learning Environment (ALE)
14

Experiment Environment
Network
Optimizer: RMSProp
15

Experiment Environment
Parameters (DQN, DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 10000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.1
Steps: 50,000,000 steps
(
1, 000, 000
1
)
16

Experiment Environment
Parameters (Tuned for DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 30000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.01
Steps: 50,000,000 steps
(
1, 000, 000
1
)
17

Results
DDQN is better than DQN
Value estimates: argmax Q(S , a; θ)
T
1
t=1
∑
T
a t
18

Results
More results (100 games each)
20

Summary
DDQN > DQN for most of the environments.
Less overestimations of values.
Implementing is easy!
Go DDQN!!
22

Related Papers
Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination
in Adversarial Multi-Agent with Distributed Double DQN".
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas
Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with
deep reinforcement learning”, 2015;
[http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc
Lanctot: “Dueling Network Architectures for Deep Reinforcement
Learning”, 2015; [http://arxiv.org/abs/1511.06581
arXiv:1511.06581].
23

Mais conteúdo relacionado

Mais procurados

Lec3 dqnRonald Teo

Reinforcement learningDing Li

Support Vector MachinesSakis Sotiropoulos

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.

An introduction to deep reinforcement learningBig Data Colombia

deep reinforcement learning with double q learningSeungHyeok Baek

Deep Reinforcement Learning: Q-LearningKai-Wen Zhao

Gradient descent methodSanghyuk Chun

Reinforcement LearningDongHyun Kwak

Deep Q-LearningNikolay Pavlov

Reinforcement learning, Q-LearningKuppusamy P

Introduction of Deep Reinforcement LearningNAVER Engineering

An introduction to reinforcement learningSubrat Panda, PhD

Deep Reinforcement Learning and Its ApplicationsBill Liu

AI - Introduction to Bellman EquationsAndrew Ferlitsch

Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee

Reinforcement LearningCloudxLab

Optimization Using Evolutionary Computing Techniques Siksha 'O' Anusandhan (Deemed to be University )

Reinforcement learning 7313Slideshare

Deep Q-learning explainedazzeddine chenine

Mais procurados (20)

Lec3 dqn

Reinforcement learning

Support Vector Machines

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman

An introduction to deep reinforcement learning

deep reinforcement learning with double q learning

Deep Reinforcement Learning: Q-Learning

Gradient descent method

Reinforcement Learning

Deep Q-Learning

Reinforcement learning, Q-Learning

Introduction of Deep Reinforcement Learning

An introduction to reinforcement learning

Deep Reinforcement Learning and Its Applications

AI - Introduction to Bellman Equations

Dueling Network Architectures for Deep Reinforcement Learning

Reinforcement Learning

Optimization Using Evolutionary Computing Techniques

Reinforcement learning 7313

Deep Q-learning explained

Semelhante a Double Q-learning Paper Reading

Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益黃

TensorFlow and Deep Learning Tips and TricksBen Ball

Introduction to Big Data ScienceAlbert Bifet

Safe and Efficient Off-Policy Reinforcement Learningmooopan

Reinforcement Learning Overview | Marco Del PraData Science Milan

Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES

SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel

Dear - 딥러닝 논문읽기 모임 김창연님taeseon ryu

Joint optimization framework for learning with noisy labelsCheng-You Lu

L1 intro2 supervised_learningYogendra Singh

Playing Atari with Deep Reinforcement LearningWilly Marroquin (WillyDevNET)

MLHEP Lectures - day 2, basic trackarogozhnikov

Lect4sumit621

Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez

learned optimizer.pptxQingsong Guo

Unbiased Bayes for Big DataChristian Robert

Deep Reinforcement LearningMeetupDataScienceRoma

Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Michael Lie

Machine learning in science and industry — day 1arogozhnikov

Support Vector Machines SimplyEmad Nabil

Semelhante a Double Q-learning Paper Reading (20)

Financial Trading as a Game: A Deep Reinforcement Learning Approach

TensorFlow and Deep Learning Tips and Tricks

Introduction to Big Data Science

Safe and Efficient Off-Policy Reinforcement Learning

Reinforcement Learning Overview | Marco Del Pra

Introduction to Neural Networks and Deep Learning from Scratch

SIAM - Minisymposium on Guaranteed numerical algorithms

Dear - 딥러닝 논문읽기 모임 김창연님

Joint optimization framework for learning with noisy labels

L1 intro2 supervised_learning

Playing Atari with Deep Reinforcement Learning

MLHEP Lectures - day 2, basic track

Lect4

Andres hernandez ai_machine_learning_london_nov2017

learned optimizer.pptx

Unbiased Bayes for Big Data

Deep Reinforcement Learning

Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...

Machine learning in science and industry — day 1

Support Vector Machines Simply

Último

Developing An App To Navigate The Roads of BrazilV3cube

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Partners Life - Insurer Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Double Q-learning Paper Reading

1. Deep Reinforcement Learning with Double Q-learning Presenter: Takato Yamazaki 1

2. About the Paper Title Deep Reinforcement Learning with Double Q-learning [arXiv:1509.06461] Author Hado van Hasselt, Arthur Guez, David Silver Af liation Google DeepMind Year 2015 2

3. Outline How DDQN was Derived DDQN Experiment Environment Results Summary Related Papers 3

4. How DDQN was Derived Reinforcement Learning Agent's Goal: Learn good policies for sequential decision problems With policy π, the true value Q of an action a in state s is Q (s, a) = E R + γR + ...∣S = s, A = a, π Optimal value is then Q (s, a) = Q (s, a) π [ 1 2 0 0 ] ∗ π max π 4

5. How DDQN was Derived Q-learning (Watkins, 1989) Q(s, a) = Q(s, a) + α − where α is the learning rate. Current Q value will move closer to (Reward + next Q value) (R + γ Q(s , a )t+1 a′ max ′ ′ Q(s, a)) 5

6. How DDQN was Derived Deep Q-learning (Mnih et al., 2015) What if there is in nite states... Q-learning can be considered as minimization problem. Neural network can be used to minimize the error! Y t DQN L(θ ) θt min t = R + γ Q(s , a ; θ )t+1 a′ max ′ ′ t − = E (R + γ Q(s , a ; θ ) − Q(s, a; θ )) θt min [ t+1 a′ max ′ ′ t − t 2 ] 6

7. How DDQN was Derived Deep Q-learning (Mnih et al., 2015) (Continued) Experience replay Store observed transitions to memory bank Sample from memory bank randomly and train network Target network Copy online network θ to target network θ every τ stepst t − 7

8. How DDQN was Derived Double Q-learning (van Hasselt, 2010) Q-learning often OVERESTIMATES the Q values because... it uses the maximum action value every time to update Q values it uses the same values to select and to evaluate an action Double Q-learning helps avoiding overestimates! Split the weights θ into selector and evaluator 8

9. Double Q-learning (van Hasselt, 2010) (continued) 9

10. Double Q-learning (van Hasselt, 2010) (continued) Q-learning target Y = R + γ Q(s , a ; θ ) Transform to Y = R + γQ s , argmax Q(s , a; θ ); θ Use different parameter for evaluating the Q-value Y = R + γQ s , argmax Q(s , a; θ ); θ t Q t+1 a′ max ′ ′ t t Q t+1 ( ′ a ′ t t) t DoubleQ t+1 ( ′ a ′ t t ′ ) 10

11. Double Q-learning (van Hasselt, 2010) (continued) 11

12. DDQN Double Deep Q-learning (DDQN) Combination of DQN and Double Q-learning!!! Using neural network as selector and evaluator. Easy implementation because... DQN uses target network feature Online network θ = Selector Target network θ = Evaluator t t − 12

13. Double Deep Q-learning (DDQN) (continued) Double Q-learning's target was described as Y = R + γQ s , argmax Q(s , a; θ ); θ Transform for DDQN Y = R + γQ s , argmax Q(s , a; θ ); θ where θ is the online network and θ is the target network t DoubleQ t+1 ( ′ a ′ t t ′ ) t DoubleDQN t+1 ( ′ a ′ t t − ) t t − 13

14. Experiment Environment Atari 2600 Games, using the Arcade Learning Environment (ALE) 14

15. Experiment Environment Network Optimizer: RMSProp 15

16. Experiment Environment Parameters (DQN, DDQN) Discount value: γ = 0.99 Learning rate: α = 0.00025 Target network update: every 10000 steps Exploration: epsilon-greedy method Epsilon: ε = max 1 − t , 0.1 Steps: 50,000,000 steps ( 1, 000, 000 1 ) 16

17. Experiment Environment Parameters (Tuned for DDQN) Discount value: γ = 0.99 Learning rate: α = 0.00025 Target network update: every 30000 steps Exploration: epsilon-greedy method Epsilon: ε = max 1 − t , 0.01 Steps: 50,000,000 steps ( 1, 000, 000 1 ) 17

18. Results DDQN is better than DQN Value estimates: argmax Q(S , a; θ) T 1 t=1 ∑ T a t 18

19. Results More results 19

20. Results More results (100 games each) 20

21. Results More results 21

22. Summary DDQN > DQN for most of the environments. Less overestimations of values. Implementing is easy! Go DDQN!! 22

23. Related Papers Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination in Adversarial Multi-Agent with Distributed Double DQN". Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971]. Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot: “Dueling Network Architectures for Deep Reinforcement Learning”, 2015; [http://arxiv.org/abs/1511.06581 arXiv:1511.06581]. 23

Double Q-learning Paper Reading

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Double Q-learning Paper Reading

Semelhante a Double Q-learning Paper Reading (20)

Último

Último (20)

Double Q-learning Paper Reading