Reinforcement Learning

REINFORCEMENT
LEARNING
AbdalmuGhith Alzbibi
Ahmad Ataya
Mhd Salem Kabbani
Nadir Pervez
Outline
■ Introduction
■ Element of reinforcement learning
■ Q-learning
■ Deep Q-Network
■ Demo
■ References
2
3
Introduction
 Supervised learning : a situation in which
sample (input, output) pairs of the function to be
learned can be perceived or are given
 Unsupervised learning : Data Driven (Clustering)
 Reinforcement learning —
 Close to human learning.
 Algorithm learns a policy of how to act in a given
environment.
 Every action has some effect in the environment,
and the environment provides rewards that guides
the learning algorithm.
4
Supervised Learning vs Reinforcement Learning
Supervised Learning
Step: 1
Teacher: Does picture 1 show a car or a flower?
Learner: A flower.
Teacher: No, it’s a car.
Step: 2
Teacher: Does picture 2 show a car or a flower?
Learner: A car.
Teacher: Yes, it’s a car.
Step: 3 ....
5
Reinforcement Learning
Step: 1
World: You are in state 9. Choose action A or C.
Learner: Action A.
World: Your reward is 100.
Step: 2
World: You are in state 32. Choose action B or E.
Learner: Action B.
World: Your reward is 50.
Step: 3 ....
Supervised Learning vs Reinforcement Learning
6
7
8
Introduction (Cont..)
 Meaning of Reinforcement: Occurrence of an
event, in the proper relation to a response, that tends
to increase the probability that the response will
occur again in the same situation.
 Reinforcement learning is the problem faced by an
agent that learns behavior through trial-and-error
interactions with a dynamic environment.
 Reinforcement Learning is learning how to act in
order to maximize a numerical reward.
9
Introduction …
 Reinforcement learning is not a type of neural
network, nor is it an alternative to neural networks.
Rather, it is an area of Learning Machine.
 Reinforcement learning return delayed feedback that
evaluates the learner's performance but is not told of
which action is the correct one to achieve its goal
Reward Hypothesis
 All goals can be described by the maximization of expected
cumulative reward.
 Make a robot to walk: +R for forward, -R for falling over.
 Play ATARI games: +R / -R for increasing/decreasing score.
 Control a helicopter: + R / -R following trajectory / crashing.
10
Q – Learning
 There are many different ways a reinforcement learning agent
can be trained, but a common one is call
Q-learning.
 Before we talk about Q-learning, we need to cover some
background material.
 Markov Decision Processes.
 Value functions
11
 Model-free (vs Model-based):
MDP model is unknown, but experience can be sampled MDP
Model is known, but is too big to use, except by samples.
 Off-policy (vs On-policy):
Can learn about policy from experience sampled from some
other policy.
Q-Learning …
12
Markov Decision Process
 A set of possible world states 𝑆
 A set of possible actions 𝐴
 A real valued reward function 𝑅(𝑠, 𝑎)
 A transition function 𝑇(𝑠, 𝑎, 𝑠’) = 𝑃(𝑠’|𝑠, 𝑎) - the
probability of transition from 𝑠 to 𝑠’ given action 𝑎
 A policy 𝜋 is a mapping from 𝑆 to 𝐴
Policy
13
 𝑄 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑄 𝜋
𝑠, 𝑎 = 𝔼 𝑅𝑡|𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎, 𝜋
 Is a prediction of future reward.
 Next reward plus the best I can do from the next state
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎, 𝑠′
+ 𝛾𝑚𝑎x 𝑎′Q s′
, a′
𝛾 𝜖 [0,1] a discount factor to give later rewards less effect
Value functions
14
15
 We’re looking for the optimal policy that no policy generates
more reward than it.
𝑄∗
𝑠, 𝑎 = max
𝜋
𝑄 𝜋
𝑠, 𝑎
 Deterministic policy a = argmax
𝑎′∈𝐴
𝑄∗
𝑠, 𝑎′
 Bellman equation 𝑄∗
𝑠, 𝑎 = 𝔼 𝑠′ 𝑟 + 𝛾 max
𝑎′
𝑄∗
𝑠′, 𝑎′ |𝑠, 𝑎
 Recursively with dynamic programming.
Getting the Policy
16
 We want to pick good actions most of the time, but
also do some exploration:
 Exploring means that we can learn better policies
 But, we want to balance known good actions with
exploratory ones
 This is called the exploration/exploitation problem
Exploration - Exploitation dilemma
17
Deep Q-Network …
18
CNN
19
Deep Q-Network …
20
Input of QN
21
Stochastic gradient descent
22
 Deep learning algorithms require
 huge training datasets
 independence between samples
 fixed underlying data distribution
Theoretical complications
23
 To avoids theoretical complications.
 greater data efficiency
each experience potentially used in many weight udpates
 reduce correlations between samples
randomizing samples breaks correlations from consecutive
samples
 experience replay averages behavior distribution over states
smooths out learning
avoids oscillations or divergence in gradient descent
Deep Q-learning …
24
Serial Deep Q-learning
25
Demo Video
26
Like an expert player!!
27
28
29
Reinforcement Learning
Reinforcement Learning
• Mnih et al. Playing Atari with deep reinforcement learning.
arXiv preprint arXiv:1312.5602, 2013.
• Mnih et al. Human-level control through deep reinforcement
learning. Nature, 518(7540):529–533, 2015.
• Course Udacity Machine Learning:Reinforcement Learning
https://www.youtube.com/playlist?list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp
References
Thanks for listening
1 de 33

Recomendados

An introduction to reinforcement learning por
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
4.4K visualizações30 slides
Reinforcement Learning por
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
1.4K visualizações61 slides
Reinforcement Learning por
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
1.8K visualizações36 slides
An introduction to deep reinforcement learning por
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
5.1K visualizações53 slides
Deep Q-Learning por
Deep Q-LearningDeep Q-Learning
Deep Q-LearningNikolay Pavlov
3.8K visualizações27 slides
Reinforcement Learning Q-Learning por
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
5K visualizações32 slides

Mais conteúdo relacionado

Mais procurados

Reinforcement Learning : A Beginners Tutorial por
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
15.3K visualizações37 slides
An introduction to reinforcement learning por
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
875 visualizações60 slides
Reinforcement learning 7313 por
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
13.7K visualizações30 slides
Deep Reinforcement Learning por
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
3K visualizações38 slides
Intro to Deep Reinforcement Learning por
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
805 visualizações31 slides
Reinforcement learning, Q-Learning por
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
1.8K visualizações24 slides

Mais procurados(20)

Reinforcement Learning : A Beginners Tutorial por Omar Enayet
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet15.3K visualizações
An introduction to reinforcement learning por Jie-Han Chen
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen875 visualizações
Reinforcement learning 7313 por Slideshare
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare13.7K visualizações
Deep Reinforcement Learning por Usman Qayyum
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum3K visualizações
Intro to Deep Reinforcement Learning por Khaled Saleh
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh805 visualizações
Reinforcement learning, Q-Learning por Kuppusamy P
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P1.8K visualizações
DQN (Deep Q-Network) por Dong Guo
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo2.2K visualizações
Multi-armed Bandits por Dongmin Lee
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee3.6K visualizações
An introduction to reinforcement learning (rl) por pauldix
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix4.8K visualizações
Reinforcement learning por DongHyun Kwak
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak927 visualizações
Reinforcement learning por Chandra Meena
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena44.7K visualizações
Markov decision process por Hamed Abdi
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi3.1K visualizações
Q-learning por Jasmine Anteunis
Q-learningQ-learning
Q-learning
Jasmine Anteunis313 visualizações
Introduction of Deep Reinforcement Learning por NAVER Engineering
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering5.3K visualizações
Reinforcement Learning 5. Monte Carlo Methods por Seung Jae Lee
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee1.7K visualizações
Multi-Armed Bandit and Applications por Sangwoo Mo
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo5.3K visualizações
Deep sarsa, Deep Q-learning, DQN por Euijin Jeong
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
Euijin Jeong2.8K visualizações
Adversarial search por Nilu Desai
Adversarial searchAdversarial search
Adversarial search
Nilu Desai4.2K visualizações

Similar a Reinforcement Learning

reinforcement-learning-141009013546-conversion-gate02.pdf por
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
117 visualizações64 slides
Introduction to Deep Reinforcement Learning por
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIDEAS - Int'l Data Engineering and Science Association
137 visualizações34 slides
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017 por
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
924 visualizações28 slides
RL.ppt por
RL.pptRL.ppt
RL.pptAzharJamil15
31 visualizações16 slides
Machine learning ( Part 3 ) por
Machine learning ( Part 3 )Machine learning ( Part 3 )
Machine learning ( Part 3 )Sunil OS
529.2K visualizações51 slides
Reinforcement learning Research experiments OpenAI por
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
57 visualizações18 slides

Similar a Reinforcement Learning(20)

reinforcement-learning-141009013546-conversion-gate02.pdf por VaishnavGhadge1
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1117 visualizações
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017 por MLconf
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf924 visualizações
RL.ppt por AzharJamil15
RL.pptRL.ppt
RL.ppt
AzharJamil1531 visualizações
Machine learning ( Part 3 ) por Sunil OS
Machine learning ( Part 3 )Machine learning ( Part 3 )
Machine learning ( Part 3 )
Sunil OS529.2K visualizações
Reinforcement learning Research experiments OpenAI por Raouf KESKES
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES57 visualizações
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017 por MLconf
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf1.3K visualizações
Reinforcement learning por Farzad M. Zaravand
Reinforcement learningReinforcement learning
Reinforcement learning
Farzad M. Zaravand135 visualizações
Survey of Modern Reinforcement Learning por Julia Maddalena
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
Julia Maddalena170 visualizações
Reinforcement learning por Elias Hasnat
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat279 visualizações
Reinforcement Learning.ppt por POOJASHREEC1
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
POOJASHREEC124 visualizações
acai01-updated.ppt por butest
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
butest437 visualizações
reiniforcement learning.ppt por charusharma165
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma16511 visualizações
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration por Hye-min Ahn
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn811 visualizações
Demystifying deep reinforement learning por 재연 윤
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤204 visualizações
Learning To Run por Emanuele Ghelfi
Learning To RunLearning To Run
Learning To Run
Emanuele Ghelfi132 visualizações
Machine Learning Presentation por Sk Samiul Islam
Machine Learning PresentationMachine Learning Presentation
Machine Learning Presentation
Sk Samiul Islam20 visualizações
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022) por EmilyJoseph18
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
EmilyJoseph18436 visualizações
Aaa ped-24- Reinforcement Learning por AminaRepo
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
AminaRepo59 visualizações

Último

Introduction to Maven por
Introduction to MavenIntroduction to Maven
Introduction to MavenJohn Valentino
7 visualizações10 slides
Top-5-production-devconMunich-2023.pptx por
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptxTier1 app
10 visualizações40 slides
University of Borås-full talk-2023-12-09.pptx por
University of Borås-full talk-2023-12-09.pptxUniversity of Borås-full talk-2023-12-09.pptx
University of Borås-full talk-2023-12-09.pptxMahdi_Fahmideh
12 visualizações51 slides
.NET Deserialization Attacks por
.NET Deserialization Attacks.NET Deserialization Attacks
.NET Deserialization AttacksDharmalingam Ganesan
7 visualizações50 slides
Techstack Ltd at Slush 2023, Ukrainian delegation por
Techstack Ltd at Slush 2023, Ukrainian delegationTechstack Ltd at Slush 2023, Ukrainian delegation
Techstack Ltd at Slush 2023, Ukrainian delegationViktoriiaOpanasenko
7 visualizações4 slides
Transport Management System - Shipment & Container Tracking por
Transport Management System - Shipment & Container TrackingTransport Management System - Shipment & Container Tracking
Transport Management System - Shipment & Container TrackingFreightoscope
6 visualizações3 slides

Último(20)

Introduction to Maven por John Valentino
Introduction to MavenIntroduction to Maven
Introduction to Maven
John Valentino7 visualizações
Top-5-production-devconMunich-2023.pptx por Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app10 visualizações
University of Borås-full talk-2023-12-09.pptx por Mahdi_Fahmideh
University of Borås-full talk-2023-12-09.pptxUniversity of Borås-full talk-2023-12-09.pptx
University of Borås-full talk-2023-12-09.pptx
Mahdi_Fahmideh12 visualizações
Techstack Ltd at Slush 2023, Ukrainian delegation por ViktoriiaOpanasenko
Techstack Ltd at Slush 2023, Ukrainian delegationTechstack Ltd at Slush 2023, Ukrainian delegation
Techstack Ltd at Slush 2023, Ukrainian delegation
ViktoriiaOpanasenko7 visualizações
Transport Management System - Shipment & Container Tracking por Freightoscope
Transport Management System - Shipment & Container TrackingTransport Management System - Shipment & Container Tracking
Transport Management System - Shipment & Container Tracking
Freightoscope 6 visualizações
What is API por artembondar5
What is APIWhat is API
What is API
artembondar515 visualizações
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... por NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi217 visualizações
Google Solutions Challenge 2024 Talk pdf por MohdAbdulAleem4
Google Solutions Challenge 2024 Talk pdfGoogle Solutions Challenge 2024 Talk pdf
Google Solutions Challenge 2024 Talk pdf
MohdAbdulAleem434 visualizações
The Path to DevOps por John Valentino
The Path to DevOpsThe Path to DevOps
The Path to DevOps
John Valentino6 visualizações
predicting-m3-devopsconMunich-2023.pptx por Tier1 app
predicting-m3-devopsconMunich-2023.pptxpredicting-m3-devopsconMunich-2023.pptx
predicting-m3-devopsconMunich-2023.pptx
Tier1 app10 visualizações
aATP - New Correlation Confirmation Feature.pptx por EsatEsenek1
aATP - New Correlation Confirmation Feature.pptxaATP - New Correlation Confirmation Feature.pptx
aATP - New Correlation Confirmation Feature.pptx
EsatEsenek1222 visualizações
Techstack Ltd at Slush 2023, Ukrainian delegation por ViktoriiaOpanasenko
Techstack Ltd at Slush 2023, Ukrainian delegationTechstack Ltd at Slush 2023, Ukrainian delegation
Techstack Ltd at Slush 2023, Ukrainian delegation
ViktoriiaOpanasenko7 visualizações
Ports-and-Adapters Architecture for Embedded HMI por Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert35 visualizações
JioEngage_Presentation.pptx por admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254559 visualizações
predicting-m3-devopsconMunich-2023-v2.pptx por Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app14 visualizações
Flask-Python por Triloki Gupta
Flask-PythonFlask-Python
Flask-Python
Triloki Gupta10 visualizações
Playwright Retries por artembondar5
Playwright RetriesPlaywright Retries
Playwright Retries
artembondar57 visualizações
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile... por Stefan Wolpers
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
Stefan Wolpers44 visualizações

Reinforcement Learning

  • 2. Outline ■ Introduction ■ Element of reinforcement learning ■ Q-learning ■ Deep Q-Network ■ Demo ■ References 2
  • 3. 3 Introduction  Supervised learning : a situation in which sample (input, output) pairs of the function to be learned can be perceived or are given  Unsupervised learning : Data Driven (Clustering)  Reinforcement learning —  Close to human learning.  Algorithm learns a policy of how to act in a given environment.  Every action has some effect in the environment, and the environment provides rewards that guides the learning algorithm.
  • 4. 4 Supervised Learning vs Reinforcement Learning Supervised Learning Step: 1 Teacher: Does picture 1 show a car or a flower? Learner: A flower. Teacher: No, it’s a car. Step: 2 Teacher: Does picture 2 show a car or a flower? Learner: A car. Teacher: Yes, it’s a car. Step: 3 ....
  • 5. 5 Reinforcement Learning Step: 1 World: You are in state 9. Choose action A or C. Learner: Action A. World: Your reward is 100. Step: 2 World: You are in state 32. Choose action B or E. Learner: Action B. World: Your reward is 50. Step: 3 .... Supervised Learning vs Reinforcement Learning
  • 6. 6
  • 7. 7
  • 8. 8 Introduction (Cont..)  Meaning of Reinforcement: Occurrence of an event, in the proper relation to a response, that tends to increase the probability that the response will occur again in the same situation.  Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment.  Reinforcement Learning is learning how to act in order to maximize a numerical reward.
  • 9. 9 Introduction …  Reinforcement learning is not a type of neural network, nor is it an alternative to neural networks. Rather, it is an area of Learning Machine.  Reinforcement learning return delayed feedback that evaluates the learner's performance but is not told of which action is the correct one to achieve its goal
  • 10. Reward Hypothesis  All goals can be described by the maximization of expected cumulative reward.  Make a robot to walk: +R for forward, -R for falling over.  Play ATARI games: +R / -R for increasing/decreasing score.  Control a helicopter: + R / -R following trajectory / crashing. 10
  • 11. Q – Learning  There are many different ways a reinforcement learning agent can be trained, but a common one is call Q-learning.  Before we talk about Q-learning, we need to cover some background material.  Markov Decision Processes.  Value functions 11
  • 12.  Model-free (vs Model-based): MDP model is unknown, but experience can be sampled MDP Model is known, but is too big to use, except by samples.  Off-policy (vs On-policy): Can learn about policy from experience sampled from some other policy. Q-Learning … 12
  • 13. Markov Decision Process  A set of possible world states 𝑆  A set of possible actions 𝐴  A real valued reward function 𝑅(𝑠, 𝑎)  A transition function 𝑇(𝑠, 𝑎, 𝑠’) = 𝑃(𝑠’|𝑠, 𝑎) - the probability of transition from 𝑠 to 𝑠’ given action 𝑎  A policy 𝜋 is a mapping from 𝑆 to 𝐴 Policy 13
  • 14.  𝑄 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑄 𝜋 𝑠, 𝑎 = 𝔼 𝑅𝑡|𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎, 𝜋  Is a prediction of future reward.  Next reward plus the best I can do from the next state 𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑚𝑎x 𝑎′Q s′ , a′ 𝛾 𝜖 [0,1] a discount factor to give later rewards less effect Value functions 14
  • 15. 15
  • 16.  We’re looking for the optimal policy that no policy generates more reward than it. 𝑄∗ 𝑠, 𝑎 = max 𝜋 𝑄 𝜋 𝑠, 𝑎  Deterministic policy a = argmax 𝑎′∈𝐴 𝑄∗ 𝑠, 𝑎′  Bellman equation 𝑄∗ 𝑠, 𝑎 = 𝔼 𝑠′ 𝑟 + 𝛾 max 𝑎′ 𝑄∗ 𝑠′, 𝑎′ |𝑠, 𝑎  Recursively with dynamic programming. Getting the Policy 16
  • 17.  We want to pick good actions most of the time, but also do some exploration:  Exploring means that we can learn better policies  But, we want to balance known good actions with exploratory ones  This is called the exploration/exploitation problem Exploration - Exploitation dilemma 17
  • 19. 19
  • 23.  Deep learning algorithms require  huge training datasets  independence between samples  fixed underlying data distribution Theoretical complications 23
  • 24.  To avoids theoretical complications.  greater data efficiency each experience potentially used in many weight udpates  reduce correlations between samples randomizing samples breaks correlations from consecutive samples  experience replay averages behavior distribution over states smooths out learning avoids oscillations or divergence in gradient descent Deep Q-learning … 24
  • 27. Like an expert player!! 27
  • 28. 28
  • 29. 29
  • 32. • Mnih et al. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. • Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. • Course Udacity Machine Learning:Reinforcement Learning https://www.youtube.com/playlist?list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp References

Notas do Editor

  1. Q-Learning Algorithm 1. Initialize Q(s, a) to small random values, ∀s, a 2. Observe state, s 3. Pick an action, a, and do it 4. Observe next state, s’, and reward, r 5. Q(s, a) ← (1 - α)Q(s, a) + α(r + γmaxa’Q(s’, a’)) 6. Go to 2 0 ≤ α ≤ 1 is the learning rate And user ε-greedy in pivking actiones • Pick best (greedy) action with probability ε • Otherwise, pick a random action
  2. - There is always optimal policy for any MPD - All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function All you need is to find q*