Reinforcement learning

Elias Hasnat
Elias HasnatSoftware Engineer, Telecom Data Scientist (Design, Architect, Code) em NS
Challenges
In
Reinforcement Learning
Elias Hasnat
Reinforcement learning
Definition of Reinforcement Learning
We can explain reinforcement learning as a combination of
supervised and unsupervised learning.
1. In supervised learning one has a target label for each
training example and
2. In unsupervised learning one has no labels at all,
3. In reinforcement learning one has sparse and time-
delayed labels (the rewards).
Based only on those rewards the agent has to learn to behave in the environment.
1. Credit assignment problem and
2. Exploration-Exploitation Dilemma
Steps in consideration in RL problem solving
Credit Assignment Step is used for defining the strategy,
Exploration-Exploitation is used for improving the reward
outcomes.
Credit Assignment
Reinforcement learning
Mathematical relation formalization
● An agent, situated in an environment.
● The environment is in a certain state.
● The agent can perform certain actions in the environment.
● These actions sometimes result in a reward.
● Actions transform the environment and lead to a new state, where the
agent can perform another action, and so on.
● The rules for how you choose those actions are called policy.
● The environment in general is stochastic, which means the next state may
be somewhat random.
● The set of states and actions, together with rules for transitioning from one
state to another, make up a Markov decision process.
● One episode of this process forms a finite sequence of states, actions and
rewards.
Reward Strategy
To perform well in the long-term, we need to take into account not only the
immediate rewards, but also the future rewards we are going to get.
total reward function
total reward function from time point t
But as the environment is stochastic we never get same reward in the next run,
that is why we be considering
discounted future reward
γ is the discount factor between 0 and 1, as the seps increased the γ value goes
near to 0.
Discounted future reward calculation
Discounted future reward at time step t can be expressed in terms of the same
thing at time step t+1:
If we set the discount factor γ=0, then our strategy will be short-sighted and we
rely only on the immediate rewards.
If we want to balance between immediate and future rewards, we should set
discount factor to something like γ=0.9.
If our environment is deterministic and the same actions always result in same
rewards, then we can set discount factor γ=1.
A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
The Q-function is for improved quality in action
The Q-Learning function Q(s, a) could be represented for
the maximum discounted future reward when we perform action a in state s, and
continue optimally from that point on:
Suppose you are in state and thinking whether you should take action a or b. You
want to select the action that results in the highest score at the end of game.
π represents the policy,
Now Let’s focus on just one transition <s, a, r, s’>. we can express the Q-value of
state s and action a in terms of the Q-value of the next state s’.
This is called the Bellman Equation.
Q-Learning Formulation
α in the algorithm is a learning rate that controls how much of the difference
between previous Q-value and newly proposed Q-value is taken into account.
In particular, when α=1, then two Q[s,a] cancel and the update is exactly the
same as the Bellman equation.
Deep Q Network
Neural networks are exceptionally good at coming up
with good features for highly structured data.
We can easily represent our Q-function with a
neural network,
Use-case 1:
that takes the state and action as input and
outputs the corresponding Q-value.
Use-case 2:
Input the state as input and
output the Q-value for each possible action.
* The benefit of the later use-case over the former one is that if we want to perform a Q-value update or pick the action with the highest Q-value, we only have to
do one forward pass through the network and have all Q-values for all actions available immediately.
Rewrite the Quality function using Q Network
Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm
must be replaced with the following:
1. Do a feedforward pass for the current state s to get predicted Q-values for all
actions.
2. Do a feedforward pass for the next state s’ and calculate maximum overall
network outputs max a’ Q(s’, a’).
3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated
in step 2). For all other actions, set the Q-value target to the same as
originally returned from step 1, making the error 0 for those outputs.
4. Update the weights using backpropagation.
Experience Replay (Memory)
Estimate the future reward in each state using Q-learning and approximate the Q-
function using neural network.
Challenge:
But the approximation of Q-values using non-linear functions is very time costly.
Solution: Memorizing the experience.
During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory.
When training the network, random mini batches from the replay memory are used
instead of the most recent transition. This breaks the similarity of subsequent
training samples, which otherwise might drive the network into a local minimum.
Also experience replay makes the training task more similar to usual supervised
learning, which simplifies debugging and testing the algorithm.
Exploration and Exploitation
Firstly observe, that when a Q-table or Q-network is initialized randomly, then its
predictions are initially random as well. If we pick an action with the highest Q-value,
the action will be random and the agent performs crude “exploration”. As a Q-
function converges, it returns more consistent Q-values and the amount of
exploration decreases. So one could say, that Q-learning incorporates the
exploration as part of the algorithm. But this exploration is “greedy”, it settles with
the first effective strategy it finds.
A simple and effective fix for the above problem is ε-greedy exploration – with
probability ε choose a random action, otherwise go with the “greedy” action with the
highest Q-value. It actually decreases ε over time from 1 to 0.1 – in the beginning
the system makes completely random moves to explore the state space maximally,
and then it settles down to a fixed exploration rate.
Deep Q-Learning
Thank You
1 de 17

Recomendados

Multi armed bandit por
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
653 visualizações40 slides
Discrete sequential prediction of continuous actions for deep RL por
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
402 visualizações46 slides
Reinforcement Learning Guide For Beginners por
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
296 visualizações20 slides
Policy gradient por
Policy gradientPolicy gradient
Policy gradientJie-Han Chen
3.4K visualizações67 slides
Intro to Deep Reinforcement Learning por
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
805 visualizações31 slides
Deep reinforcement learning from scratch por
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
1.1K visualizações84 slides

Mais conteúdo relacionado

Mais procurados

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel... por
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
1.5K visualizações88 slides
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration por
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
811 visualizações18 slides
Temporal difference learning por
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
2.4K visualizações90 slides
An introduction to reinforcement learning (rl) por
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
4.8K visualizações29 slides
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ... por
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...
Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona ...Universitat Politècnica de Catalunya
2.2K visualizações74 slides
An introduction to deep reinforcement learning por
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
5.1K visualizações53 slides

Mais procurados(20)

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration por Hye-min Ahn
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn811 visualizações
Temporal difference learning por Jie-Han Chen
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen2.4K visualizações
An introduction to reinforcement learning (rl) por pauldix
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix4.8K visualizações
An introduction to deep reinforcement learning por Big Data Colombia
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia5.1K visualizações
Episodic Policy Gradient Training por Hung Le
Episodic Policy Gradient TrainingEpisodic Policy Gradient Training
Episodic Policy Gradient Training
Hung Le134 visualizações
Reinforcement learning por DongHyun Kwak
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak927 visualizações
Trust Region Policy Optimization, Schulman et al, 2015 por Chris Ohk
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
Chris Ohk666 visualizações
0415_seminar_DeepDPG por Hye-min Ahn
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
Hye-min Ahn899 visualizações
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015 por Chris Ohk
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk1.1K visualizações
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 por Chris Ohk
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Chris Ohk1K visualizações
Deep Multi-agent Reinforcement Learning por deawoo Kim
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
deawoo Kim2.7K visualizações
Aaa ped-24- Reinforcement Learning por AminaRepo
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
AminaRepo59 visualizações
Reinforcement Learning por DongHyun Kwak
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak1.4K visualizações
Maximum Entropy Reinforcement Learning (Stochastic Control) por Dongmin Lee
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee4K visualizações
The effect of episodic retrieval on inhibition in task switching por JimGrange
The effect of episodic retrieval on inhibition in task switchingThe effect of episodic retrieval on inhibition in task switching
The effect of episodic retrieval on inhibition in task switching
JimGrange184 visualizações

Similar a Reinforcement learning

CS799_FinalReport por
CS799_FinalReportCS799_FinalReport
CS799_FinalReportAbhanshu Gupta
138 visualizações5 slides
An efficient use of temporal difference technique in Computer Game Learning por
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
70 visualizações18 slides
Head First Reinforcement Learning por
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
13 visualizações76 slides
Reinforcement learning por
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
44.7K visualizações64 slides
Reinforcement learning por
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
409 visualizações28 slides
Lecture notes por
Lecture notesLecture notes
Lecture notesbutest
302 visualizações21 slides

Similar a Reinforcement learning(20)

CS799_FinalReport por Abhanshu Gupta
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
Abhanshu Gupta138 visualizações
An efficient use of temporal difference technique in Computer Game Learning por Prabhu Kumar
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar70 visualizações
Head First Reinforcement Learning por azzeddine chenine
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine13 visualizações
Reinforcement learning por Chandra Meena
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena44.7K visualizações
Reinforcement learning por Ding Li
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li409 visualizações
Lecture notes por butest
Lecture notesLecture notes
Lecture notes
butest302 visualizações
RL.ppt por AzharJamil15
RL.pptRL.ppt
RL.ppt
AzharJamil1530 visualizações
lecture_21.pptx - PowerPoint Presentation por butest
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest398 visualizações
Adaptive High-Level Strategy Learning in StarCraft por Jiéverson Maissiat
Adaptive High-Level Strategy Learning in StarCraftAdaptive High-Level Strategy Learning in StarCraft
Adaptive High-Level Strategy Learning in StarCraft
Jiéverson Maissiat868 visualizações
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017 por MLconf
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf924 visualizações
Reinforcement Learning por SVijaylakshmi
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi207 visualizações
Reinforcement learning Research experiments OpenAI por Raouf KESKES
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES57 visualizações
Demystifying deep reinforement learning por 재연 윤
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤204 visualizações
Survey of Modern Reinforcement Learning por Julia Maddalena
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
Julia Maddalena170 visualizações
Reinforcement learning, Q-Learning por Kuppusamy P
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P1.8K visualizações
Reinfrocement Learning por Natan Katz
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz175 visualizações
14_ReinforcementLearning.pptx por RithikRaj25
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
RithikRaj255 visualizações

Mais de Elias Hasnat

BLE.pdf por
BLE.pdfBLE.pdf
BLE.pdfElias Hasnat
100 visualizações14 slides
FacialRecognition-May-8-2020.pdf por
FacialRecognition-May-8-2020.pdfFacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdfElias Hasnat
109 visualizações15 slides
Smart City IoT Solution Improved por
Smart City IoT Solution ImprovedSmart City IoT Solution Improved
Smart City IoT Solution ImprovedElias Hasnat
39 visualizações16 slides
Connected vehicle mobility as a service (maas) por
Connected vehicle mobility as a service (maas)Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)Elias Hasnat
83 visualizações28 slides
Lorawan for agriculture, haccp hazard analysis and critical control point por
Lorawan for agriculture, haccp hazard analysis and critical control pointLorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control pointElias Hasnat
55 visualizações18 slides
IoT Security with Azure por
IoT Security with AzureIoT Security with Azure
IoT Security with AzureElias Hasnat
63 visualizações19 slides

Mais de Elias Hasnat(20)

BLE.pdf por Elias Hasnat
BLE.pdfBLE.pdf
BLE.pdf
Elias Hasnat100 visualizações
FacialRecognition-May-8-2020.pdf por Elias Hasnat
FacialRecognition-May-8-2020.pdfFacialRecognition-May-8-2020.pdf
FacialRecognition-May-8-2020.pdf
Elias Hasnat109 visualizações
Smart City IoT Solution Improved por Elias Hasnat
Smart City IoT Solution ImprovedSmart City IoT Solution Improved
Smart City IoT Solution Improved
Elias Hasnat39 visualizações
Connected vehicle mobility as a service (maas) por Elias Hasnat
Connected vehicle mobility as a service (maas)Connected vehicle mobility as a service (maas)
Connected vehicle mobility as a service (maas)
Elias Hasnat83 visualizações
Lorawan for agriculture, haccp hazard analysis and critical control point por Elias Hasnat
Lorawan for agriculture, haccp hazard analysis and critical control pointLorawan for agriculture, haccp hazard analysis and critical control point
Lorawan for agriculture, haccp hazard analysis and critical control point
Elias Hasnat55 visualizações
IoT Security with Azure por Elias Hasnat
IoT Security with AzureIoT Security with Azure
IoT Security with Azure
Elias Hasnat63 visualizações
産業向け AWS IoT ソリューション por Elias Hasnat
産業向け AWS IoT ソリューション産業向け AWS IoT ソリューション
産業向け AWS IoT ソリューション
Elias Hasnat91 visualizações
Soap vs REST-API por Elias Hasnat
Soap vs REST-APISoap vs REST-API
Soap vs REST-API
Elias Hasnat104 visualizações
AIIoT組み込みシステム向けIEEE1888通信スタック por Elias Hasnat
AIIoT組み込みシステム向けIEEE1888通信スタックAIIoT組み込みシステム向けIEEE1888通信スタック
AIIoT組み込みシステム向けIEEE1888通信スタック
Elias Hasnat168 visualizações
IoT security reference architecture por Elias Hasnat
IoT security  reference architectureIoT security  reference architecture
IoT security reference architecture
Elias Hasnat95 visualizações
Intelligent video stream detection platform por Elias Hasnat
Intelligent video stream detection platformIntelligent video stream detection platform
Intelligent video stream detection platform
Elias Hasnat78 visualizações
Machine Learning Algorithms por Elias Hasnat
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Elias Hasnat173 visualizações
REST API por Elias Hasnat
REST APIREST API
REST API
Elias Hasnat358 visualizações
Mqtt por Elias Hasnat
MqttMqtt
Mqtt
Elias Hasnat386 visualizações
Java8 features por Elias Hasnat
Java8 featuresJava8 features
Java8 features
Elias Hasnat1.1K visualizações
Dalvikよりart por Elias Hasnat
DalvikよりartDalvikよりart
Dalvikよりart
Elias Hasnat390 visualizações
K means por Elias Hasnat
K meansK means
K means
Elias Hasnat800 visualizações
Unity sdk-plugin por Elias Hasnat
Unity sdk-pluginUnity sdk-plugin
Unity sdk-plugin
Elias Hasnat638 visualizações
Cocos2dx por Elias Hasnat
Cocos2dxCocos2dx
Cocos2dx
Elias Hasnat332 visualizações
China Mobile Market por Elias Hasnat
China Mobile MarketChina Mobile Market
China Mobile Market
Elias Hasnat53 visualizações

Último

ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf por
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfAlhamduKure
10 visualizações11 slides
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth por
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for GrowthInnomantra
20 visualizações4 slides
DESIGN OF SPRINGS-UNIT4.pptx por
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptxgopinathcreddy
21 visualizações47 slides
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... por
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...csegroupvn
13 visualizações210 slides
CPM Schedule Float.pptx por
CPM Schedule Float.pptxCPM Schedule Float.pptx
CPM Schedule Float.pptxMathew Joseph
6 visualizações5 slides
Module-1, Chapter-2 Data Types, Variables, and Arrays por
Module-1, Chapter-2 Data Types, Variables, and ArraysModule-1, Chapter-2 Data Types, Variables, and Arrays
Module-1, Chapter-2 Data Types, Variables, and ArraysDemian Antony D'Mello
6 visualizações44 slides

Último(20)

ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf por AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure10 visualizações
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth por Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 20 visualizações
DESIGN OF SPRINGS-UNIT4.pptx por gopinathcreddy
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptx
gopinathcreddy21 visualizações
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... por csegroupvn
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
csegroupvn13 visualizações
CPM Schedule Float.pptx por Mathew Joseph
CPM Schedule Float.pptxCPM Schedule Float.pptx
CPM Schedule Float.pptx
Mathew Joseph6 visualizações
Module-1, Chapter-2 Data Types, Variables, and Arrays por Demian Antony D'Mello
Module-1, Chapter-2 Data Types, Variables, and ArraysModule-1, Chapter-2 Data Types, Variables, and Arrays
Module-1, Chapter-2 Data Types, Variables, and Arrays
Demian Antony D'Mello6 visualizações
GPS Survery Presentation/ Slides por OmarFarukEmon1
GPS Survery Presentation/ SlidesGPS Survery Presentation/ Slides
GPS Survery Presentation/ Slides
OmarFarukEmon17 visualizações
Design_Discover_Develop_Campaign.pptx por ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth655 visualizações
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx por lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78188 visualizações
Plant Design Report-Oil Refinery.pdf por Safeen Yaseen Ja'far
Plant Design Report-Oil Refinery.pdfPlant Design Report-Oil Refinery.pdf
Plant Design Report-Oil Refinery.pdf
Safeen Yaseen Ja'far9 visualizações
Unlocking Research Visibility.pdf por KhatirNaima
Unlocking Research Visibility.pdfUnlocking Research Visibility.pdf
Unlocking Research Visibility.pdf
KhatirNaima11 visualizações
Proposal Presentation.pptx por keytonallamon
Proposal Presentation.pptxProposal Presentation.pptx
Proposal Presentation.pptx
keytonallamon76 visualizações
Créativité dans le design mécanique à l’aide de l’optimisation topologique por LIEGE CREATIVE
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologique
LIEGE CREATIVE8 visualizações
dummy.pptx por JamesLamp
dummy.pptxdummy.pptx
dummy.pptx
JamesLamp7 visualizações
REACTJS.pdf por ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR337 visualizações
Web Dev Session 1.pptx por VedVekhande
Web Dev Session 1.pptxWeb Dev Session 1.pptx
Web Dev Session 1.pptx
VedVekhande20 visualizações
GDSC Mikroskil Members Onboarding 2023.pdf por gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil68 visualizações
unit 1.pptx por rrbornarecm
unit 1.pptxunit 1.pptx
unit 1.pptx
rrbornarecm5 visualizações

Reinforcement learning

  • 3. Definition of Reinforcement Learning We can explain reinforcement learning as a combination of supervised and unsupervised learning. 1. In supervised learning one has a target label for each training example and 2. In unsupervised learning one has no labels at all, 3. In reinforcement learning one has sparse and time- delayed labels (the rewards). Based only on those rewards the agent has to learn to behave in the environment.
  • 4. 1. Credit assignment problem and 2. Exploration-Exploitation Dilemma Steps in consideration in RL problem solving Credit Assignment Step is used for defining the strategy, Exploration-Exploitation is used for improving the reward outcomes.
  • 7. Mathematical relation formalization ● An agent, situated in an environment. ● The environment is in a certain state. ● The agent can perform certain actions in the environment. ● These actions sometimes result in a reward. ● Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. ● The rules for how you choose those actions are called policy. ● The environment in general is stochastic, which means the next state may be somewhat random. ● The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. ● One episode of this process forms a finite sequence of states, actions and rewards.
  • 8. Reward Strategy To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. total reward function total reward function from time point t But as the environment is stochastic we never get same reward in the next run, that is why we be considering discounted future reward γ is the discount factor between 0 and 1, as the seps increased the γ value goes near to 0.
  • 9. Discounted future reward calculation Discounted future reward at time step t can be expressed in terms of the same thing at time step t+1: If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1. A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
  • 10. The Q-function is for improved quality in action The Q-Learning function Q(s, a) could be represented for the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on: Suppose you are in state and thinking whether you should take action a or b. You want to select the action that results in the highest score at the end of game. π represents the policy, Now Let’s focus on just one transition <s, a, r, s’>. we can express the Q-value of state s and action a in terms of the Q-value of the next state s’. This is called the Bellman Equation.
  • 11. Q-Learning Formulation α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.
  • 12. Deep Q Network Neural networks are exceptionally good at coming up with good features for highly structured data. We can easily represent our Q-function with a neural network, Use-case 1: that takes the state and action as input and outputs the corresponding Q-value. Use-case 2: Input the state as input and output the Q-value for each possible action. * The benefit of the later use-case over the former one is that if we want to perform a Q-value update or pick the action with the highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions available immediately.
  • 13. Rewrite the Quality function using Q Network Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm must be replaced with the following: 1. Do a feedforward pass for the current state s to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’). 3. Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs. 4. Update the weights using backpropagation.
  • 14. Experience Replay (Memory) Estimate the future reward in each state using Q-learning and approximate the Q- function using neural network. Challenge: But the approximation of Q-values using non-linear functions is very time costly. Solution: Memorizing the experience. During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random mini batches from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. Also experience replay makes the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm.
  • 15. Exploration and Exploitation Firstly observe, that when a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. As a Q- function converges, it returns more consistent Q-values and the amount of exploration decreases. So one could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds. A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value. It actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.