SlideShare uma empresa Scribd logo
1 de 19
Reinforcement Learning
from Explore to Exploit
Task: Learn how to behave successfully to achieve a
goal while interacting with an external environment
Learn via experiences!
Examples
• Game playing: player knows whether it win or lose, but
not know how to move at each step
• Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
Reinforcement Learning
2
RL is learning from interaction
3
Agent acts on its environment, it receives some evaluation of
its action (reinforcement)
The goal of the agent is to learn a policy that maximize its total
(future) reward
St →At →Rt →St+1 →At+1 →Rt+1 →St+2…
At each State S, choose the a action a which
maximizes the function Q(S, a)
Q is the estimated utility function – it tells us how
good an action is given a certain state
Q-Learning Basics
4
所有決策都依據Q-table (best policy),但Q table 要從何得來?
5
If this number > epsilon, then we will do “exploitation” (this means we
use what we already know to select the best action at each step).
Bellman equation (Q-table Update Rule)
6
s0 s2
s1
s3
a
c
b
a
c
d
a
c
f
a
b
d
Q(S0,b) is max
Get maximum Q value for this next
state based on all possible actions.
0 <= 𝛾 <1Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’))
a’immediate reward future reward
This is a recursive definition
Discount rate
Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’))
𝛾=0.9
Example
7
reward=1
?
8
Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
This Q-table becomes a reference table for our agent to
select the best action
1. Set current state = initial state.
2. From current state, find the action with the highest Q value.
3. Set current state = next state.
4. Repeat Steps 2 and 3 until current state = goal state.
Algorithm to utilize the Q matrix:
9
根據Q-table 走,
有可能有其他更
好的線但你不知
道, 也就是你可
能未必找到整體
的最佳路線!
10
Bellman equation 是看到當下狀態, 因此未必能能找到最
佳解, 故引入MDP方法以增加隨機性
只看Q-table 的值 (greedy method), 即永遠只做當下最好
的選擇,但未必最好。 故引入某種程度的隨機選擇,則有
機會去嘗試一條沒有走過的路,也許可以找到最佳的路
MDP (Markov Decision Process)
11
Q-Learning algorithm
12
Bellman equation
13
Note: 更新的是落差的部份..
NewQ(s,a)= (1-𝛼) Q(s,a) + 𝛼 {R (s,a)+ 𝛾 max(Q (s’, a’) }
0 <= 𝛾 < 1 Discount rate
0 <= 𝛼 <= 1 Learning rate
Example : Q-Learning By Hand
14http://mnemstudio.org/path-finding-q-learning-tutorial.htm
The outside of the building can be thought of as one big
room (5). Notice that doors 1 and 4 lead into the building
from room 5 (outside).
15
The -1's in the table represent null
values (i.e.; where there isn't a
link between nodes). For example,
State 0 cannot go to State 1.
Taking Action: Exploit or Explore
16
Again, Initial state = 3, randomly select action 1
Update Q table
Q(3,1) =
R(3, 1) + 0.8 * Max(Q(1, 3), Q(1, 5))
= 0 + 0.8 * Max(0, 100) = 80
Initial random state = 1
Update Q table
Q(1, 5) =
R(1, 5) + 0.8 * Max(Q(5, 1), Q(5, 4), Q(5, 5))
= 100 + 0.8 * 0 = 100
1 2
Create a q-table0
17
If our agent learns more through further
episodes, it will finally reach convergence
values in matrix Q like:
This matrix Q, can then be
normalized (i.e.; converted to
percentage) by dividing all non-zero
entries by the highest number (500 in
this case):
實作
18
19
X
y
North
South
EastWest
1
4
2
3

Mais conteúdo relacionado

Semelhante a Reinforcement Learning

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
United International University
 

Semelhante a Reinforcement Learning (20)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Finalver
FinalverFinalver
Finalver
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agents
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 

Mais de 艾鍗科技

Mais de 艾鍗科技 (20)

TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Appendix 1 Goolge colab
Appendix 1 Goolge colabAppendix 1 Goolge colab
Appendix 1 Goolge colab
 
Project-IOT於餐館系統的應用
Project-IOT於餐館系統的應用Project-IOT於餐館系統的應用
Project-IOT於餐館系統的應用
 
02 IoT implementation
02 IoT implementation02 IoT implementation
02 IoT implementation
 
Tiny ML for spark Fun Edge
Tiny ML for spark Fun EdgeTiny ML for spark Fun Edge
Tiny ML for spark Fun Edge
 
Openvino ncs2
Openvino ncs2Openvino ncs2
Openvino ncs2
 
Step motor
Step motorStep motor
Step motor
 
2. 機器學習簡介
2. 機器學習簡介2. 機器學習簡介
2. 機器學習簡介
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
3. data features
3. data features3. data features
3. data features
 
心率血氧檢測與運動促進
心率血氧檢測與運動促進心率血氧檢測與運動促進
心率血氧檢測與運動促進
 
利用音樂&情境燈幫助放鬆
利用音樂&情境燈幫助放鬆利用音樂&情境燈幫助放鬆
利用音樂&情境燈幫助放鬆
 
IoT感測器驅動程式 在樹莓派上實作
IoT感測器驅動程式在樹莓派上實作IoT感測器驅動程式在樹莓派上實作
IoT感測器驅動程式 在樹莓派上實作
 
無線聲控遙控車
無線聲控遙控車無線聲控遙控車
無線聲控遙控車
 
最佳光源的研究和實作
最佳光源的研究和實作最佳光源的研究和實作
最佳光源的研究和實作
 
無線監控網路攝影機與控制自走車
無線監控網路攝影機與控制自走車無線監控網路攝影機與控制自走車
無線監控網路攝影機與控制自走車
 
Linux Device Tree
Linux Device TreeLinux Device Tree
Linux Device Tree
 
人臉辨識考勤系統
人臉辨識考勤系統人臉辨識考勤系統
人臉辨識考勤系統
 
智慧家庭Smart Home
智慧家庭Smart Home智慧家庭Smart Home
智慧家庭Smart Home
 
智能健身
智能健身智能健身
智能健身
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Reinforcement Learning

  • 2. Task: Learn how to behave successfully to achieve a goal while interacting with an external environment Learn via experiences! Examples • Game playing: player knows whether it win or lose, but not know how to move at each step • Control: a traffic system can measure the delay of cars, but not know how to decrease it. Reinforcement Learning 2
  • 3. RL is learning from interaction 3 Agent acts on its environment, it receives some evaluation of its action (reinforcement) The goal of the agent is to learn a policy that maximize its total (future) reward St →At →Rt →St+1 →At+1 →Rt+1 →St+2…
  • 4. At each State S, choose the a action a which maximizes the function Q(S, a) Q is the estimated utility function – it tells us how good an action is given a certain state Q-Learning Basics 4 所有決策都依據Q-table (best policy),但Q table 要從何得來?
  • 5. 5 If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step).
  • 6. Bellman equation (Q-table Update Rule) 6 s0 s2 s1 s3 a c b a c d a c f a b d Q(S0,b) is max Get maximum Q value for this next state based on all possible actions. 0 <= 𝛾 <1Q(S,a)=R(S,a)+ 𝛾 max(Q(S’,a’)) a’immediate reward future reward This is a recursive definition Discount rate
  • 8. 8 Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.
  • 9. This Q-table becomes a reference table for our agent to select the best action 1. Set current state = initial state. 2. From current state, find the action with the highest Q value. 3. Set current state = next state. 4. Repeat Steps 2 and 3 until current state = goal state. Algorithm to utilize the Q matrix: 9
  • 11. Bellman equation 是看到當下狀態, 因此未必能能找到最 佳解, 故引入MDP方法以增加隨機性 只看Q-table 的值 (greedy method), 即永遠只做當下最好 的選擇,但未必最好。 故引入某種程度的隨機選擇,則有 機會去嘗試一條沒有走過的路,也許可以找到最佳的路 MDP (Markov Decision Process) 11
  • 13. Bellman equation 13 Note: 更新的是落差的部份.. NewQ(s,a)= (1-𝛼) Q(s,a) + 𝛼 {R (s,a)+ 𝛾 max(Q (s’, a’) } 0 <= 𝛾 < 1 Discount rate 0 <= 𝛼 <= 1 Learning rate
  • 14. Example : Q-Learning By Hand 14http://mnemstudio.org/path-finding-q-learning-tutorial.htm The outside of the building can be thought of as one big room (5). Notice that doors 1 and 4 lead into the building from room 5 (outside).
  • 15. 15 The -1's in the table represent null values (i.e.; where there isn't a link between nodes). For example, State 0 cannot go to State 1.
  • 16. Taking Action: Exploit or Explore 16 Again, Initial state = 3, randomly select action 1 Update Q table Q(3,1) = R(3, 1) + 0.8 * Max(Q(1, 3), Q(1, 5)) = 0 + 0.8 * Max(0, 100) = 80 Initial random state = 1 Update Q table Q(1, 5) = R(1, 5) + 0.8 * Max(Q(5, 1), Q(5, 4), Q(5, 5)) = 100 + 0.8 * 0 = 100 1 2 Create a q-table0
  • 17. 17 If our agent learns more through further episodes, it will finally reach convergence values in matrix Q like: This matrix Q, can then be normalized (i.e.; converted to percentage) by dividing all non-zero entries by the highest number (500 in this case):