SlideShare uma empresa Scribd logo
1 de 23
Demystifying Reinforement
Learning
Slides by JaeyeunYoon
IDS Lab.
What is Reinforcement Learning?
• Learning by trial-and-error, in real-time.
• Improves with experience
• Inspired by psychology
- Agent + Environment
- Agent selects actions to maximize utility function.
IDS Lab.
When to use RL?
•Data in the form of trajectories(궤적).
•Need to make a sequence of (related) decisions.
•Observe (partial, noisy) feedback to choice of
actions.
•Tasks that require both learning and planning.
IDS Lab.
Supervised Learning VS RL
IDS Lab.
Markov Decision Process(MDP)
•Defined by:
S: = 𝒔 𝟏, 𝒔 𝟐, … , 𝒔 𝒏 , the set of states (can be infinite / continuous)
A: = 𝑎 𝟏, 𝑎 𝟐, … , 𝑎 𝒏 , the set of actions (can be infinite / continuous)
T(s,a,s′ ): = Pr(𝑠′
|𝑠, 𝑎), the dynamics of states (can b infinite /
continuous)
R(s,a): Reward function
μ(s): Initial state distribution
IDS Lab.
The Markov Property
•The distribution over future states depends only on the
present state and action, not on any other previous event.
Pr 𝑠𝑡+1 𝑠0, … , 𝑠𝑡, 𝑎0 , … , 𝑎 𝑡) = Pr(𝑠𝑡+1 | 𝑠𝑡, 𝑎 𝑡)
IDS Lab.
The goal of RL? Maximize return!
•Returns, 𝑼𝒕 of a trajectory, is the sum of rewards starting
from step t.
•Episodic task: consider over finite horizon (e.g. games,
maze).
→ 𝑼 𝒕 = 𝒓 𝒕 + 𝒓 𝒕+𝟏 + 𝒓 𝒕+𝟐 + ⋯ + 𝒓 𝑻
•Continuing task: consider return over infinite horizon
(e.g. juggling,
balancing).
→ 𝑼 𝒕 = γ𝒓 𝒕 + γ 𝟐 𝒓 𝒕+𝟏 + γ 𝟑 𝒓 𝒕+𝟐 + ⋯ = 𝒌=𝟎:𝒊𝒏𝒇 γ 𝒌 𝒓 𝒕+𝒌
IDS Lab.
The discount factor, γ
•Discount facator, γ ∈ 𝟎, 𝟏 (usually close to 1).
•This values immediate reward above delayed reward.
- γ close to 0 leads to ”myopic”(근시안적인) evaluation
- γ close to 1 leads to ”far-sighted”(원시안적인) evaluation
•Intuition :
- Receiving $80 today is worth the same as $100 tomorrow assuming
a discount of factor of γ = 𝟎. 𝟖
- At each time step, there is a (𝟏 − γ) chance that the agen dies, and
does not receive rewards aftwards
IDS Lab.
Major Components of an RL Agent
•An RL agent may include one or more of these components:
- Policy: agent's behavior function
- Value function: how good is each state and/or action
- Model: agent's representation of the environment
IDS Lab.
Defining behavior: The policy
•Policy, π defines the action-selction strategy at every state:
π 𝒔, 𝒂 = 𝑷 𝒂 𝒕 = 𝒂 𝒔 𝒕 = 𝒔)
π : S -> A
Goal : Find the policy that maximizes expected total reward.
(But there are many policies!)
𝒂𝒓𝒈𝒎𝒂𝒙π 𝑬π[𝒓 𝟎 + 𝒓 𝟏 + 𝒓 … + 𝒓 𝑻|𝒔 𝟎
???
IDS Lab.
Example: Career Options
IDS Lab.
Example: Career Options
IDS Lab.
Value functions
•The expected return of a policy (for every state) is called the
•Value function: 𝐕π 𝒔 = 𝑬 𝒑[𝒓 𝒕 + 𝒓 𝒕+𝟏 + ⋯ + 𝒓 𝑻|𝒔 𝒕 = 𝒔]
* Simple strategy to find the best policy:
1. Enumerate the space of all possible policies.
2. Estimate the expected return of each one.
3. Keep the policy that has maximum expected return.
IDS Lab.
Getting confused with terminology?
•Reward: 1 step numerical feedback
•Return: Sum of rewards over the agent’s trajectory.
•Value: Expected sum of rewards over the agent’s trajector.
•Utility: Numerical function representing preferences.
* In RL, we assume Utility = Return.
IDS Lab.
RL algorithm outline
IDS Lab.
RL algorithm outline
IDS Lab.
Q-learning: Model-Free RL
•In Q-learning we define a function Q(s, a) representing the
maximum discounted future reward when we perform action a in
state s, and continue optimally from that point on. (함수 Q(s, a)를
각 지점에서 계속 최적값을 찾으면서 상태 에서 행동 를 수행할 때 차감된
미래의 리워드(discounted future reward)를 나타내는 함수로 정의함)
𝑸 𝒔 𝒕, 𝒂 𝒕 = 𝒎𝒂𝒙 𝑹 𝒕+𝟏
• The way to think about Q(s, a) is that it is “the best possible score
at the end of the game after performing action a in state s”. It is
called Q-function, because it represents the “quality” of a certain
action in a given state.
• Then, we can choose followed policy function :
π 𝒔 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑸(𝒔, 𝒂)
IDS Lab.
Q-learning: Bellman equation
•How do we get that Q-function then? Let’s focus on just one
transition <s, a, r, s’>. Just like with discounted future rewards in
the previous section, we can express the Q-value of state s and
action a in terms of the Q-value of the next state s’.
𝑸 𝒔, 𝒂 = 𝒓 + 𝜸𝒎𝒂𝒙 𝒂′ 𝑸(𝒔′
, 𝒂′
) (Bellman equation)
• The main idea in Q-learning
- we can iteratively approximate the Q-function using the Bellman equation.
IDS Lab.
Q-learning: Atari Breakout
• For example, ‘Breakout’ game screens as in the DeepMind paper
-> take the four last screen images, resize them to 84×84 and
convert to grayscale with 256 gray levels
-> we would have 25684x84x4 ≈ 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎
possible game states.
This means 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 rows in our imaginary Q-table
-> more than the number of atoms in the known universe!
Atari Breakout game. Image credit: DeepMind.
IDS Lab.
Deep Q Network: Atari Breakout
•The Q-function can be approximated using a neural network
model.
IDS Lab.
Deep Q Network: Atari Breakout
IDS Lab.
* No pooling layer? Why?
Deep Q Network: Atari Breakout
IDS Lab.
•Experience Replay
- During gameplay all the experiences < s, a, r, s’ > are stored in a replay
memory. When training the network, random minibatches from the replay
memory are used instead of the most recent transition.
•Exploration-Exploitation
- ε-greedy exploration – with probability ε choose a random action, otherwise
go with the “greedy” action with the highest Q-value. In their system
DeepMind actually decreases ε over time from 1 to 0.1
Deep Q Network: Atari Breakout

Mais conteúdo relacionado

Mais procurados

ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
宏毅 李
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 

Mais procurados (20)

An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Boosted tree
Boosted treeBoosted tree
Boosted tree
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Finalver
FinalverFinalver
Finalver
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
 
Semantical Cognitive Scheduling
Semantical Cognitive SchedulingSemantical Cognitive Scheduling
Semantical Cognitive Scheduling
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 

Semelhante a Demystifying deep reinforement learning

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 

Semelhante a Demystifying deep reinforement learning (20)

Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Demystifying deep reinforement learning

  • 2. IDS Lab. What is Reinforcement Learning? • Learning by trial-and-error, in real-time. • Improves with experience • Inspired by psychology - Agent + Environment - Agent selects actions to maximize utility function.
  • 3. IDS Lab. When to use RL? •Data in the form of trajectories(궤적). •Need to make a sequence of (related) decisions. •Observe (partial, noisy) feedback to choice of actions. •Tasks that require both learning and planning.
  • 5. IDS Lab. Markov Decision Process(MDP) •Defined by: S: = 𝒔 𝟏, 𝒔 𝟐, … , 𝒔 𝒏 , the set of states (can be infinite / continuous) A: = 𝑎 𝟏, 𝑎 𝟐, … , 𝑎 𝒏 , the set of actions (can be infinite / continuous) T(s,a,s′ ): = Pr(𝑠′ |𝑠, 𝑎), the dynamics of states (can b infinite / continuous) R(s,a): Reward function μ(s): Initial state distribution
  • 6. IDS Lab. The Markov Property •The distribution over future states depends only on the present state and action, not on any other previous event. Pr 𝑠𝑡+1 𝑠0, … , 𝑠𝑡, 𝑎0 , … , 𝑎 𝑡) = Pr(𝑠𝑡+1 | 𝑠𝑡, 𝑎 𝑡)
  • 7. IDS Lab. The goal of RL? Maximize return! •Returns, 𝑼𝒕 of a trajectory, is the sum of rewards starting from step t. •Episodic task: consider over finite horizon (e.g. games, maze). → 𝑼 𝒕 = 𝒓 𝒕 + 𝒓 𝒕+𝟏 + 𝒓 𝒕+𝟐 + ⋯ + 𝒓 𝑻 •Continuing task: consider return over infinite horizon (e.g. juggling, balancing). → 𝑼 𝒕 = γ𝒓 𝒕 + γ 𝟐 𝒓 𝒕+𝟏 + γ 𝟑 𝒓 𝒕+𝟐 + ⋯ = 𝒌=𝟎:𝒊𝒏𝒇 γ 𝒌 𝒓 𝒕+𝒌
  • 8. IDS Lab. The discount factor, γ •Discount facator, γ ∈ 𝟎, 𝟏 (usually close to 1). •This values immediate reward above delayed reward. - γ close to 0 leads to ”myopic”(근시안적인) evaluation - γ close to 1 leads to ”far-sighted”(원시안적인) evaluation •Intuition : - Receiving $80 today is worth the same as $100 tomorrow assuming a discount of factor of γ = 𝟎. 𝟖 - At each time step, there is a (𝟏 − γ) chance that the agen dies, and does not receive rewards aftwards
  • 9. IDS Lab. Major Components of an RL Agent •An RL agent may include one or more of these components: - Policy: agent's behavior function - Value function: how good is each state and/or action - Model: agent's representation of the environment
  • 10. IDS Lab. Defining behavior: The policy •Policy, π defines the action-selction strategy at every state: π 𝒔, 𝒂 = 𝑷 𝒂 𝒕 = 𝒂 𝒔 𝒕 = 𝒔) π : S -> A Goal : Find the policy that maximizes expected total reward. (But there are many policies!) 𝒂𝒓𝒈𝒎𝒂𝒙π 𝑬π[𝒓 𝟎 + 𝒓 𝟏 + 𝒓 … + 𝒓 𝑻|𝒔 𝟎 ???
  • 13. IDS Lab. Value functions •The expected return of a policy (for every state) is called the •Value function: 𝐕π 𝒔 = 𝑬 𝒑[𝒓 𝒕 + 𝒓 𝒕+𝟏 + ⋯ + 𝒓 𝑻|𝒔 𝒕 = 𝒔] * Simple strategy to find the best policy: 1. Enumerate the space of all possible policies. 2. Estimate the expected return of each one. 3. Keep the policy that has maximum expected return.
  • 14. IDS Lab. Getting confused with terminology? •Reward: 1 step numerical feedback •Return: Sum of rewards over the agent’s trajectory. •Value: Expected sum of rewards over the agent’s trajector. •Utility: Numerical function representing preferences. * In RL, we assume Utility = Return.
  • 17. IDS Lab. Q-learning: Model-Free RL •In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on. (함수 Q(s, a)를 각 지점에서 계속 최적값을 찾으면서 상태 에서 행동 를 수행할 때 차감된 미래의 리워드(discounted future reward)를 나타내는 함수로 정의함) 𝑸 𝒔 𝒕, 𝒂 𝒕 = 𝒎𝒂𝒙 𝑹 𝒕+𝟏 • The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a in state s”. It is called Q-function, because it represents the “quality” of a certain action in a given state. • Then, we can choose followed policy function : π 𝒔 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑸(𝒔, 𝒂)
  • 18. IDS Lab. Q-learning: Bellman equation •How do we get that Q-function then? Let’s focus on just one transition <s, a, r, s’>. Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’. 𝑸 𝒔, 𝒂 = 𝒓 + 𝜸𝒎𝒂𝒙 𝒂′ 𝑸(𝒔′ , 𝒂′ ) (Bellman equation) • The main idea in Q-learning - we can iteratively approximate the Q-function using the Bellman equation.
  • 19. IDS Lab. Q-learning: Atari Breakout • For example, ‘Breakout’ game screens as in the DeepMind paper -> take the four last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels -> we would have 25684x84x4 ≈ 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 possible game states. This means 𝟏𝟎 𝟔𝟕𝟗𝟕𝟎 rows in our imaginary Q-table -> more than the number of atoms in the known universe! Atari Breakout game. Image credit: DeepMind.
  • 20. IDS Lab. Deep Q Network: Atari Breakout •The Q-function can be approximated using a neural network model.
  • 21. IDS Lab. Deep Q Network: Atari Breakout
  • 22. IDS Lab. * No pooling layer? Why? Deep Q Network: Atari Breakout
  • 23. IDS Lab. •Experience Replay - During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random minibatches from the replay memory are used instead of the most recent transition. •Exploration-Exploitation - ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value. In their system DeepMind actually decreases ε over time from 1 to 0.1 Deep Q Network: Atari Breakout

Notas do Editor

  1. Emerging technologies such as smartphones and GPS enable the effortless collection of trajectories and other tracking data. More generally, a time-series is a recording of a signal that changes over time. 
  2. Markov Assumtion (바로 직전)
  3. 알고리즘의 (알파)는 알고리즘 내에서 이루어진 이전의 Q-값과 새로 제시된 Q-값의 차이를 컨트롤 하는 학습률 입니다. 특히, 이면, 두 는 취소되고 벨맨 방정식과 완전히 동일한 방법으로 업데이트 시킵니다.
  4. 여기서 Q-table을 그려준다 세로 S 가로 A Output Q(S,A)
  5. 여기서 Q-table을 그려준다 세로 S 가로 A Output Q(S,A)
  6. 여기서 Q-table을 그려준다 세로 S 가로 A Output Q(S,A)
  7. But if you really think about it, pooling layers buy you translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!
  8. 재현 경험하기(Experience Replay) 이제 우리는 Q-러닝을 이용해서 각각의 상태에서 미래의 리워드를 추정하는 방법과 convolutional 뉴럴 네트워크를 이용해서 Q-함수를 근사시키는 방법에 대한 아이디어가 생겼습니다. 그러나 비선형적인 함수를 사용해서 Q-값들을 예측하면 매우 안정적이지 않습니다. 이것을 수렴시킬 수많은 요령들이 있습니다. 그리고 GPU하나로 돌린다면 거의 1주일이나 소요될 정도로 오랜 시간이 걸리는 일입니다. 가장 중요한 요령은 재현을 경험시키는 것입니다. 게임을 하는 동안 모든 경험들 은 재현 메모리에 저장됩니다. 네트워크를 훈련시키는 동안, 가장 최근의 보다는 재현 메모리의 무작위적인 샘플들이 사용될 것입니다. 이 방법은 차후의 훈련 예시들의 유사성을 무너뜨리거나 네트워크를 극솟점으로 이동시켜 줍니다. 또한 재현을 경험시키는 것은 간단하게 디버깅하고 알고리즘을 테스트할 수 있는 일반적인 지도학습과 비슷한 훈련 업무로 만들어 줍니다. 실제로는 사람이 게임을 플레이하는 모든 경험들을 수집하여 이들로 네트워크를 훈련시킬 수도 있습니다. 탐색-이용(Exploration-Exploitation) Q-러닝은 신뢰할당문제를 해결하려고 합니다. 진짜 리워드를 얻게하는 중요한 결정 포인트에 다다를 때까지 게속 리워드를 전파합니다. 아직까지 우리는 탐색-이용 딜레마(exploration-exploitation dillema)에 대해 다루지 못했습니다. 첫 번째로 관찰할 수 있듯, Q-테이블과 Q-네트워크가 무작위로 초기화되면, 그에 따른 예측 또한 랜덤적으로 초기화됩니다. 가장 높은 Q-값을 선택하면 행동은 아마 무작위적이고, 에이전트는 대충 “탐색”을 수행할 것입니다. Q-함수가 수렴하면 더 많은 거듭된 Q-값들을 반환하고, 탐색의 양은 줄어들게 됩니다. 누군가 말한 것처럼, 이 Q-러닝은 알고리즘의 일부분으로써 탐색을 만들게 됩니다. 그렇지만 이 탐색은 “(욕심)greedy”이 많아서, 찾은 방법 중에 첫번째 효율적인 방법에 안착합니다. 위 문제를 간단하고 효율적으로 고치기 위해서는 무작위적인 행동을 선택하는 확률 ε를 이용하는 ε-greedy exploration을 사용합니다. 그 외에는, 가장 높은 Q-값을 이용해서 “greedy” 행동을 수행합니다. 사실 딥마인드 시스템에서는 시간이 지나면서 ε를 1에서 0.1로 만들어줍니다. 즉, 시스템 시작시점에서는 상태 공간을 최대화시켜 검색하도록 완전히 무작위적으로 만들어 준 후, 고정된 탐색률에 안착시키는 방법을 사용합니다.