reinforcement-learning-141009013546-conversion-gate02.pdf

1
Reinforcement Learning
By:
Chandra Prakash
IIITM Gwalior
1
2
2
Outline
Introduction
Element of reinforcement learning
Reinforcement Learning Problem
Problem solving methods for RL
2
3
3
Introduction
Machine learning: Definition
 Machine learning is a scientific discipline that is
concerned with the design and development of
algorithms that allow computers to learn based on
data, such as from sensor data or databases.
A major focus of machine learning research is to
automatically learn to recognize complex patterns and
make intelligent decisions based on data .
3
4
Machine learning Type:
With respect to the feedback type to learner:
 Supervised learning : Task Driven (Classification)
 Unsupervised learning : Data Driven (Clustering)
 Reinforcement learning —
 Close to human learning.
 Algorithm learns a policy of how to act in a given
environment.
 Every action has some impact in the environment,
and the environment provides rewards that guides
the learning algorithm.
4
5
Supervised Learning Vs
Reinforcement Learning
Supervised Learning
 Step: 1
Teacher: Does picture 1 show a car or a flower?
Learner: A flower.
Teacher: No, it’s a car.
Step: 2
Teacher: Does picture 2 show a car or a flower?
Learner: A car.
Teacher: Yes, it’s a car.
Step: 3 ....
5
6
Supervised Learning Vs
Reinforcement Learning (Cont...)
Reinforcement Learning
Step: 1
World: You are in state 9. Choose action A or C.
Learner: Action A.
World: Your reward is 100.
Step: 2
World: You are in state 32. Choose action B or E.
Learner: Action B.
World: Your reward is 50.
Step: 3 ....
6
7
7
Introduction (Cont..)
 Meaning of Reinforcement: Occurrence of an
event, in the proper relation to a response, that tends to
increase the probability that the response will occur
again in the same situation.
Reinforcement learning is the problem faced by an
agent that learns behavior through trial-and-error
interactions with a dynamic environment.
 Reinforcement Learning is learning how to act in
order to maximize a numerical reward.
7
8
8
Introduction (Cont..)
 Reinforcement learning is not a type of neural
network, nor is it an alternative to neural networks.
Rather, it is an orthogonal approach for Learning
Machine.
 Reinforcement learning emphasizes learning feedback
that evaluates the learner's performance without
providing standards of correctness in the form
of behavioral targets.
Example: Bicycle learning
8
9
9
Element of reinforcement learning
Agent
Environment
State Reward Action
Policy
 Agent: Intelligent programs
 Environment: External condition
 Policy:
Defines the agent’s behavior at a given time
A mapping from states to actions
Lookup tables or simple function
9
10
10
Element of reinforcement learning
 Reward function :
 Defines the goal in an RL problem
 Policy is altered to achieve this goal
 Value function:
 Reward function indicates what is good in an immediate
sense while a value function specifies what is good in
the long run.
 Value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting form that state.
 Model of the environment :
 Predict mimic behavior of environment,
 Used for planning & if Know current state and action
then predict the resultant next state and next reward.
10
11
11
Agent- Environment Interface
11
12
12
Steps for Reinforcement Learning
1. The agent observes an input state
2. An action is determined by a decision making
function (policy)
3. The action is performed
4. The agent receives a scalar reward or reinforcement
from the environment
5. Information about the reward given for that state /
action pair is recorded
12
13
13
Silent Features of Reinforcement
Learning :
 Set of problems rather than a set of techniques
 Without specifying how the task is to be achieved.
 “RL as a tool” point of view:
 RL is training by rewards and punishments.
 Train tool for the computer learning.
 The learning agent’s point of view:
 RL is learning from trial and error with the world.
 Eg. how much reward I much get if I get this.
13
14
14
Reinforcement Learning (Cont..)
 Reinforcement Learning uses Evaluative Feedback
 Purely Evaluative Feedback
 Indicates how good the action taken is.
 Not tell if it is the best or the worst action possible.
 Basis of methods for function optimization.
 Purely Instructive Feedback
 Indicates the correct action to take, independently of
the action actually taken.
 Eg: Supervised Learning
 Associative and Non Associative:
14
15
Associative & Non Associative Tasks
 Associative :
 Situation Dependent
 Mapping from situation to the actions that are best in that
situation
 Non Associative:
 Situation independent
 No need for associating different action with different
situations.
 Learner either tries to find a single best action when the task
is stationary, or tries to track the best action as it changes
over time when the task is non stationary.
15
16
16
Reinforcement Learning (Cont..)
Exploration and exploitation
Greedy action: Action chosen with greatest estimated
value.
Greedy action: a case of Exploitation.
Non greedy action: a case of Exploration, as it
enables us to improve estimate the non-greedy
action's value.
 N-armed bandit Problem:
We have to choose from n different options or actions.
We will choose the one with maximum reward.
16
Bandits Problem
One-Bandit
“arms”
Pull arms sequentially so as to maximize the total
expected reward
•It is non associative and evaluative
18
18
Action Selection Policies
Greediest action - action with the highest estimated
reward.

Ɛ -greedy
 Most of the time the greediest action is chosen
 Every once in a while with a small probability Ɛ, an
action is selected at random. The action is selected
uniformly, independent of the action-value estimates.

Ɛ -soft - The best action is selected with probability (1
–Ɛ) and the rest of the time a random action is chosen
uniformly.
18
19
19
-Greedy Action Selection Method :
Ɛ
Let the a* is the greedy action at time t and Qt(a) is the
value of action a at time.
Greedy Action Selection:
 Ɛ –greedy
19
∈
∈
−
y
probabilit
h
action wit
random
1
y
probabilit
with
t
a
{
at=
at = at
*
= arg max
a
Qt (a)
20
Action Selection Policies (Cont…)
 Softmax –
 Drawback of Ɛ -greedy & Ɛ -soft: Select random
actions uniformly.
 Softmax remedies this by:
 Assigning a weight with each actions, according
to their action-value estimate.
 A random action is selected with regards to the
weight associated with each action
 The worst actions are unlikely to be chosen.
 This is a good approach to take where the worst
actions are very unfavorable.
20
21
Softmax Action Selection( Cont…)
 Problem with Ɛ-greedy: Neglects action values
 Softmax idea: grade action probs. by estimated values.
 Gibbs, or Boltzmann action selection, or exponential
weights:
τ is the “computational temperature”
At τ  0 the Softmax action selection method become
the same as greedy action selection.
21
22
Incremental Implementation
 Sample average:
 Incremental computation:
 Common update rule form:
NewEstimate = OldEstimate + StepSize[Target –
OldEstimate]
 The expression [ Target - Old Estimate] is an error in the
estimate.
 It is reduce by taking a step toward the target.
 In proceeding the (t+1)st reward for action a the step-size
parameter will be 1(t+1).
22
23
23
Some terms in Reinforcement Learning
 The Agent Learns a Policy:
 Policy at step t, : a mapping from states to action
probabilities will be:
 Agents changes their policy with Experience.
 Objective: get as much reward as possible over a
long run.
 Goals and Rewards
 A goal should specify what we want to achieve, not
how we want to achieve it.
23
24
Some terms in RL (Cont…)
 Returns
 Rewards in long term
 Episodes: Subsequence of interaction between agent-
environment e.g., plays of a game, trips through a
maze.
 Discount return
 The geometrically discounted model of return:
 Used to:
 To determine the present value of the future
rewards
 Give more weight to earlier rewards
24
25
25
The Markov Property
 Environment's response at (t+1) depends only on State
(s) & Action (a) at t,
Markov Decision Processes
 Finite MDP
 Transition Probabilities:
 Expected Reward
Transition Graph: Summarize dynamics of finite MDP.
25
26
26
Value function
 States-action pairs function that estimate how good it is
for the agent to be in a given state
 Type of value function
 State-Value function
 Action-Value function
26
27
27
Backup Diagrams
 Basis of the update or backup operations
 backup diagrams to provide graphical summaries of
the algorithms. (State value & Action value function)
Bellman Equation for a Policy Π
27
28
Model-free and Model-based Learning
 Model-based learning
 Learn from the model instead of interacting with the world
 Can visit arbitrary parts of the model
 Also called indirect methods
 E.g. Dynamic Programming
 Model-free learning
 Sample Reward and transition function by interacting with
the world
 Also called direct methods
28
29
Problem Solving Methods for RL
1) Dynamic programming
2) Monte Carlo methods
3) Temporal-difference learning.
29
30
30
1.Dynamic programming
 Classical solution method
 Require a complete and accurate model of the
environment.
 Popular method for Dynamic programming
 Policy Evaluation : Iterative computation of the value
function for a given policy (prediction Problem)
 Policy Improvement: Computation of improved policy
for a given value function.
30
{ }
)
(
)
( 1 t
t
t s
V
r
E
s
V γ
π +
← +
31
Dynamic Programming
T
T T T
st
rt+ 1
st+1
T
T
T
T
T
T
T
T
T
31
{ }
)
(
)
( 1 t
t
t s
V
r
E
s
V γ
π +
← +
32
33
34
 NewEstimate = OldEstimate +
StepSize[Target –
OldEstimate]
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
35
Computation of improved policy for a given value function
36
 Policy Evaluation and Policy improvement
together leads
 Policy Iteration
 Value Iteration:
37
37
Policy Iteration : Alternates policy evaluation and policy
improvement steps
37
38
 Combines evaluation and improvement in one update
39
39
39
Value Iteration:
Combines evaluation and improvement in one update
40
Generalized Policy Iteration (GPI)
Consist of two iteration process,

Policy Evaluation :Making the value function
consistent with the current policy

Policy Improvement: Making the policy greedy
with respect to the current value function
40
41
41
2. Monte Carlo Methods
 Features of Monte Carlo Methods
 No need of Complete knowledge of environment
 Based on averaging sample returns observed after visit
to that state.
 Experience is divided into Episodes
 Only after completing an episode, value estimates and
policies are changed.
 Don't require a model
 Not suited for step-by-step incremental computation
41
42
42
MC and DP Methods
 Compute same value function
 Same step as in DP
 Policy evaluation
 Computation of VΠ and QΠ for a fixed arbitrary
policy (Π).
 Policy Improvement
 Generalized Policy Iteration
42
43
43
To find value of a State
 Estimate by experience, average the returns observed
after visit to that state.
 More the return, more is the average converge to
expected value
43
44
44
First-visit Monte Carlo Policy evaluation
44
45
45
45
46
46
In MCES, all returns for each state-action pair are accumulated
and averaged, irrespective of what Policy was in force when
they were observed
46
47
47
47
48
48
Monte Carlo Method (Cont…)
 All actions should be selected infinitely for
optimization
 MC exploring starts can not converge to any
suboptimal policy.
 Agents have 2 approaches
 On-policy
 Off-policy
48
49
50
50
On-policy MC algorithm
50
51
Off Policy MC algorithm
52
52
Monte Carlo and Dynamic Programming
 MC has several advantage over DP:
 Can learn from interaction with environment
 No need of full models
 No need to learn about ALL states
 No bootstrapping
52
53
53
3. Temporal Difference (TD) methods
 Learn from experience, like MC
 Can learn directly from interaction with environment
 No need for full models
 Estimate values based on estimated values of next states,
like DP
 Bootstrapping (like DP)
 Issue to watch for: maintaining sufficient exploration
53
54
54
TD Prediction
[ ]
)
V(s
R
α
+
)
V(s
)
V(s t
t
t
t −
←
:
d
Carlometho
Monte
visit
-
every
Simple
Policy Evaluation (the prediction problem):
for a given policy π, compute the state-value function
V p
[ ]
)
V(s
)
V(s
+
r
α
+
)
V(s
)
V(s
:
)
(
t
+
t
+
t
t
t −
← 1
1
0
TD
method,
TD
simplest
The
γ
target: the actual return after time t
target: an estimate of the return
54
55
Simple Monte Carlo
T T T T
T
T T T T T
st
T T
T T
T
T T
T T
T
55
56
Simplest TD Method
T T T T
T
T T T T T
st+1
rt+ 1
st
T
T
T
T
T
T T T T T
56
V(st ) ← V(st) +α rt+1 + γ V(st+1 ) − V(st )
[ ]
57
Dynamic Programming
T
T T T
st
rt+ 1
st+1
T
T
T
T
T
T
T
T
T
57
{ }
)
(
)
( 1 t
t
t s
V
r
E
s
V γ
π +
← +
58
58
TD methods bootstrap and sample
 Bootstrapping: update involves an estimate
 MC does not bootstrap
 DP bootstraps
 TD bootstraps
 Sampling: update does not involve an expected
value
 MC samples
 DP does not sample
 TD samples
58
59
59
Learning An Action-Value Function
Estimate Qp
for the current behavior policy p .
59
( ) ( ) ( ) ( )
[ ]
0.
then
terminal,
is
If
this
do
,
state
l
nontermina
a
from
sition
every tran
After
1
1
1
1
=
)
a
,
Q(s
s
a
,
s
Q
a
,
s
gQ
+
r
α
+
a
,
s
¬Q
a
,
s
Q
s
1
+
t
+
t
+
t
t
t
1
+
t
+
t
+
t
t
t
t
t
t
−
60
Q-Learning: Off-Policy TD Control
60
One- step Q- learning :
Q st , at
( )← Q st , at
( )+ α rt +1 +γ max
a
Q st+1, a
( )− Q st , at
( )
[ ]
61
61
Sarsa: On-Policy TD Control
S A R S A: State Action Reward State Action
61
Turn this into a control method by always updating the policy
to be greedy with respect to the current estimate:
62
62
Advantages of TD Learning
 TD methods do not require a model of the
environment, only experience
 TD, but not MC, methods can be fully incremental
 You can learn before knowing the final outcome
 Less memory
 Less peak computation
 You can learn without the final outcome
 From incomplete sequences
 Both MC and TD converge
62
63
References: RL
 Univ. of Alberta
 http://www.cs.ualberta.ca/~sutton/book/ebook/node
1.html
 Sutton and barto,”Reinforcement Learning an
introduction.”
 Univ. of South Wales
 http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlear
ning.html
64
64
Thank You
64
1 de 64

Mais conteúdo relacionado

Mais procurados(20)

Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov3.8K visualizações
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
datamantra6.9K visualizações
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
Bruce Kuo2.7K visualizações
Wix's ML PlatformWix's ML Platform
Wix's ML Platform
Ran Romano358 visualizações
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas Poggi
Spark Summit1.2K visualizações
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
Cloudera, Inc.6K visualizações
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen1.1K visualizações
Giới thiệu Agile + ScrumGiới thiệu Agile + Scrum
Giới thiệu Agile + Scrum
GMO-Z.com Vietnam Lab Center3.6K visualizações
Changepoint Detection with Bayesian InferenceChangepoint Detection with Bayesian Inference
Changepoint Detection with Bayesian Inference
Frank Kelly9.2K visualizações
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
Bethmi Gunasekara595 visualizações
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K visualizações
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation system
AkashPatil334407 visualizações
Limiting WIP - Global Scrum Gathering Denver 2022Limiting WIP - Global Scrum Gathering Denver 2022
Limiting WIP - Global Scrum Gathering Denver 2022
Wm. Hunter Tammaro102 visualizações
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy1.1K visualizações

Similar a reinforcement-learning-141009013546-conversion-gate02.pdf

Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
44.4K visualizações64 slides
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
926 visualizações33 slides
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
745 visualizações31 slides
RL.pptRL.ppt
RL.pptAzharJamil15
27 visualizações16 slides

Similar a reinforcement-learning-141009013546-conversion-gate02.pdf(20)

Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena44.4K visualizações
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
logesswarisrinivasan18 visualizações
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani926 visualizações
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS 745 visualizações
RL.pptRL.ppt
RL.ppt
AzharJamil1527 visualizações
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li390 visualizações
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine13 visualizações
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association137 visualizações
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi204 visualizações
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
butest437 visualizações
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma16511 visualizações
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
POOJASHREEC122 visualizações
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
RithikRaj253 visualizações
YijueRL.pptYijueRL.ppt
YijueRL.ppt
Shoaib Iqbal4 visualizações
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
ssuser43a5995 visualizações
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile537 visualizações

Mais de VaishnavGhadge1

InsiderAttack_p3.pptInsiderAttack_p3.ppt
InsiderAttack_p3.pptVaishnavGhadge1
6 visualizações17 slides
seminar.pptxseminar.pptx
seminar.pptxVaishnavGhadge1
4 visualizações16 slides
medicalmirror-170210123013.pdfmedicalmirror-170210123013.pdf
medicalmirror-170210123013.pdfVaishnavGhadge1
10 visualizações25 slides
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxVaishnavGhadge1
6 visualizações31 slides
data.2.pptxdata.2.pptx
data.2.pptxVaishnavGhadge1
7 visualizações22 slides

Mais de VaishnavGhadge1(6)

InsiderAttack_p3.pptInsiderAttack_p3.ppt
InsiderAttack_p3.ppt
VaishnavGhadge16 visualizações
digitalmarketingfinal-151111124851-lva1-app6891.pdfdigitalmarketingfinal-151111124851-lva1-app6891.pdf
digitalmarketingfinal-151111124851-lva1-app6891.pdf
VaishnavGhadge12 visualizações
seminar.pptxseminar.pptx
seminar.pptx
VaishnavGhadge14 visualizações
medicalmirror-170210123013.pdfmedicalmirror-170210123013.pdf
medicalmirror-170210123013.pdf
VaishnavGhadge110 visualizações
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptx
VaishnavGhadge16 visualizações
data.2.pptxdata.2.pptx
data.2.pptx
VaishnavGhadge17 visualizações

Último(20)

Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen016 visualizações
GA4 - Google Analytics 4 - Session Metrics.pdfGA4 - Google Analytics 4 - Session Metrics.pdf
GA4 - Google Analytics 4 - Session Metrics.pdf
GA4 Tutorials19 visualizações
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4917 visualizações
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 visualizações
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann88 visualizações
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann76 visualizações
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 visualizações
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE909 visualizações
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 visualizações
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials6 visualizações
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2810 visualizações
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116186 visualizações
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 visualizações
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials5 visualizações
 The Business Tycoons (Jan-2023) - The Unparalleled Digital Leaders The Business Tycoons (Jan-2023) - The Unparalleled Digital Leaders
The Business Tycoons (Jan-2023) - The Unparalleled Digital Leaders
Global India Business Forum14 visualizações
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 visualizações
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia17 visualizações
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika19 visualizações
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela164 visualizações

reinforcement-learning-141009013546-conversion-gate02.pdf

  • 2. 2 2 Outline Introduction Element of reinforcement learning Reinforcement Learning Problem Problem solving methods for RL 2
  • 3. 3 3 Introduction Machine learning: Definition  Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data . 3
  • 4. 4 Machine learning Type: With respect to the feedback type to learner:  Supervised learning : Task Driven (Classification)  Unsupervised learning : Data Driven (Clustering)  Reinforcement learning —  Close to human learning.  Algorithm learns a policy of how to act in a given environment.  Every action has some impact in the environment, and the environment provides rewards that guides the learning algorithm. 4
  • 5. 5 Supervised Learning Vs Reinforcement Learning Supervised Learning  Step: 1 Teacher: Does picture 1 show a car or a flower? Learner: A flower. Teacher: No, it’s a car. Step: 2 Teacher: Does picture 2 show a car or a flower? Learner: A car. Teacher: Yes, it’s a car. Step: 3 .... 5
  • 6. 6 Supervised Learning Vs Reinforcement Learning (Cont...) Reinforcement Learning Step: 1 World: You are in state 9. Choose action A or C. Learner: Action A. World: Your reward is 100. Step: 2 World: You are in state 32. Choose action B or E. Learner: Action B. World: Your reward is 50. Step: 3 .... 6
  • 7. 7 7 Introduction (Cont..)  Meaning of Reinforcement: Occurrence of an event, in the proper relation to a response, that tends to increase the probability that the response will occur again in the same situation. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment.  Reinforcement Learning is learning how to act in order to maximize a numerical reward. 7
  • 8. 8 8 Introduction (Cont..)  Reinforcement learning is not a type of neural network, nor is it an alternative to neural networks. Rather, it is an orthogonal approach for Learning Machine.  Reinforcement learning emphasizes learning feedback that evaluates the learner's performance without providing standards of correctness in the form of behavioral targets. Example: Bicycle learning 8
  • 9. 9 9 Element of reinforcement learning Agent Environment State Reward Action Policy  Agent: Intelligent programs  Environment: External condition  Policy: Defines the agent’s behavior at a given time A mapping from states to actions Lookup tables or simple function 9
  • 10. 10 10 Element of reinforcement learning  Reward function :  Defines the goal in an RL problem  Policy is altered to achieve this goal  Value function:  Reward function indicates what is good in an immediate sense while a value function specifies what is good in the long run.  Value of a state is the total amount of reward an agent can expect to accumulate over the future, starting form that state.  Model of the environment :  Predict mimic behavior of environment,  Used for planning & if Know current state and action then predict the resultant next state and next reward. 10
  • 12. 12 12 Steps for Reinforcement Learning 1. The agent observes an input state 2. An action is determined by a decision making function (policy) 3. The action is performed 4. The agent receives a scalar reward or reinforcement from the environment 5. Information about the reward given for that state / action pair is recorded 12
  • 13. 13 13 Silent Features of Reinforcement Learning :  Set of problems rather than a set of techniques  Without specifying how the task is to be achieved.  “RL as a tool” point of view:  RL is training by rewards and punishments.  Train tool for the computer learning.  The learning agent’s point of view:  RL is learning from trial and error with the world.  Eg. how much reward I much get if I get this. 13
  • 14. 14 14 Reinforcement Learning (Cont..)  Reinforcement Learning uses Evaluative Feedback  Purely Evaluative Feedback  Indicates how good the action taken is.  Not tell if it is the best or the worst action possible.  Basis of methods for function optimization.  Purely Instructive Feedback  Indicates the correct action to take, independently of the action actually taken.  Eg: Supervised Learning  Associative and Non Associative: 14
  • 15. 15 Associative & Non Associative Tasks  Associative :  Situation Dependent  Mapping from situation to the actions that are best in that situation  Non Associative:  Situation independent  No need for associating different action with different situations.  Learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is non stationary. 15
  • 16. 16 16 Reinforcement Learning (Cont..) Exploration and exploitation Greedy action: Action chosen with greatest estimated value. Greedy action: a case of Exploitation. Non greedy action: a case of Exploration, as it enables us to improve estimate the non-greedy action's value.  N-armed bandit Problem: We have to choose from n different options or actions. We will choose the one with maximum reward. 16
  • 17. Bandits Problem One-Bandit “arms” Pull arms sequentially so as to maximize the total expected reward •It is non associative and evaluative
  • 18. 18 18 Action Selection Policies Greediest action - action with the highest estimated reward.  Ɛ -greedy  Most of the time the greediest action is chosen  Every once in a while with a small probability Ɛ, an action is selected at random. The action is selected uniformly, independent of the action-value estimates.  Ɛ -soft - The best action is selected with probability (1 –Ɛ) and the rest of the time a random action is chosen uniformly. 18
  • 19. 19 19 -Greedy Action Selection Method : Ɛ Let the a* is the greedy action at time t and Qt(a) is the value of action a at time. Greedy Action Selection:  Ɛ –greedy 19 ∈ ∈ − y probabilit h action wit random 1 y probabilit with t a { at= at = at * = arg max a Qt (a)
  • 20. 20 Action Selection Policies (Cont…)  Softmax –  Drawback of Ɛ -greedy & Ɛ -soft: Select random actions uniformly.  Softmax remedies this by:  Assigning a weight with each actions, according to their action-value estimate.  A random action is selected with regards to the weight associated with each action  The worst actions are unlikely to be chosen.  This is a good approach to take where the worst actions are very unfavorable. 20
  • 21. 21 Softmax Action Selection( Cont…)  Problem with Ɛ-greedy: Neglects action values  Softmax idea: grade action probs. by estimated values.  Gibbs, or Boltzmann action selection, or exponential weights: τ is the “computational temperature” At τ  0 the Softmax action selection method become the same as greedy action selection. 21
  • 22. 22 Incremental Implementation  Sample average:  Incremental computation:  Common update rule form: NewEstimate = OldEstimate + StepSize[Target – OldEstimate]  The expression [ Target - Old Estimate] is an error in the estimate.  It is reduce by taking a step toward the target.  In proceeding the (t+1)st reward for action a the step-size parameter will be 1(t+1). 22
  • 23. 23 23 Some terms in Reinforcement Learning  The Agent Learns a Policy:  Policy at step t, : a mapping from states to action probabilities will be:  Agents changes their policy with Experience.  Objective: get as much reward as possible over a long run.  Goals and Rewards  A goal should specify what we want to achieve, not how we want to achieve it. 23
  • 24. 24 Some terms in RL (Cont…)  Returns  Rewards in long term  Episodes: Subsequence of interaction between agent- environment e.g., plays of a game, trips through a maze.  Discount return  The geometrically discounted model of return:  Used to:  To determine the present value of the future rewards  Give more weight to earlier rewards 24
  • 25. 25 25 The Markov Property  Environment's response at (t+1) depends only on State (s) & Action (a) at t, Markov Decision Processes  Finite MDP  Transition Probabilities:  Expected Reward Transition Graph: Summarize dynamics of finite MDP. 25
  • 26. 26 26 Value function  States-action pairs function that estimate how good it is for the agent to be in a given state  Type of value function  State-Value function  Action-Value function 26
  • 27. 27 27 Backup Diagrams  Basis of the update or backup operations  backup diagrams to provide graphical summaries of the algorithms. (State value & Action value function) Bellman Equation for a Policy Π 27
  • 28. 28 Model-free and Model-based Learning  Model-based learning  Learn from the model instead of interacting with the world  Can visit arbitrary parts of the model  Also called indirect methods  E.g. Dynamic Programming  Model-free learning  Sample Reward and transition function by interacting with the world  Also called direct methods 28
  • 29. 29 Problem Solving Methods for RL 1) Dynamic programming 2) Monte Carlo methods 3) Temporal-difference learning. 29
  • 30. 30 30 1.Dynamic programming  Classical solution method  Require a complete and accurate model of the environment.  Popular method for Dynamic programming  Policy Evaluation : Iterative computation of the value function for a given policy (prediction Problem)  Policy Improvement: Computation of improved policy for a given value function. 30 { } ) ( ) ( 1 t t t s V r E s V γ π + ← +
  • 31. 31 Dynamic Programming T T T T st rt+ 1 st+1 T T T T T T T T T 31 { } ) ( ) ( 1 t t t s V r E s V γ π + ← +
  • 32. 32
  • 33. 33
  • 34. 34  NewEstimate = OldEstimate + StepSize[Target – OldEstimate] NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
  • 35. 35 Computation of improved policy for a given value function
  • 36. 36  Policy Evaluation and Policy improvement together leads  Policy Iteration  Value Iteration:
  • 37. 37 37 Policy Iteration : Alternates policy evaluation and policy improvement steps 37
  • 38. 38  Combines evaluation and improvement in one update
  • 39. 39 39 39 Value Iteration: Combines evaluation and improvement in one update
  • 40. 40 Generalized Policy Iteration (GPI) Consist of two iteration process,  Policy Evaluation :Making the value function consistent with the current policy  Policy Improvement: Making the policy greedy with respect to the current value function 40
  • 41. 41 41 2. Monte Carlo Methods  Features of Monte Carlo Methods  No need of Complete knowledge of environment  Based on averaging sample returns observed after visit to that state.  Experience is divided into Episodes  Only after completing an episode, value estimates and policies are changed.  Don't require a model  Not suited for step-by-step incremental computation 41
  • 42. 42 42 MC and DP Methods  Compute same value function  Same step as in DP  Policy evaluation  Computation of VΠ and QΠ for a fixed arbitrary policy (Π).  Policy Improvement  Generalized Policy Iteration 42
  • 43. 43 43 To find value of a State  Estimate by experience, average the returns observed after visit to that state.  More the return, more is the average converge to expected value 43
  • 44. 44 44 First-visit Monte Carlo Policy evaluation 44
  • 46. 46 46 In MCES, all returns for each state-action pair are accumulated and averaged, irrespective of what Policy was in force when they were observed 46
  • 48. 48 48 Monte Carlo Method (Cont…)  All actions should be selected infinitely for optimization  MC exploring starts can not converge to any suboptimal policy.  Agents have 2 approaches  On-policy  Off-policy 48
  • 49. 49
  • 51. 51 Off Policy MC algorithm
  • 52. 52 52 Monte Carlo and Dynamic Programming  MC has several advantage over DP:  Can learn from interaction with environment  No need of full models  No need to learn about ALL states  No bootstrapping 52
  • 53. 53 53 3. Temporal Difference (TD) methods  Learn from experience, like MC  Can learn directly from interaction with environment  No need for full models  Estimate values based on estimated values of next states, like DP  Bootstrapping (like DP)  Issue to watch for: maintaining sufficient exploration 53
  • 54. 54 54 TD Prediction [ ] ) V(s R α + ) V(s ) V(s t t t t − ← : d Carlometho Monte visit - every Simple Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V p [ ] ) V(s ) V(s + r α + ) V(s ) V(s : ) ( t + t + t t t − ← 1 1 0 TD method, TD simplest The γ target: the actual return after time t target: an estimate of the return 54
  • 55. 55 Simple Monte Carlo T T T T T T T T T T st T T T T T T T T T T 55
  • 56. 56 Simplest TD Method T T T T T T T T T T st+1 rt+ 1 st T T T T T T T T T T 56 V(st ) ← V(st) +α rt+1 + γ V(st+1 ) − V(st ) [ ]
  • 57. 57 Dynamic Programming T T T T st rt+ 1 st+1 T T T T T T T T T 57 { } ) ( ) ( 1 t t t s V r E s V γ π + ← +
  • 58. 58 58 TD methods bootstrap and sample  Bootstrapping: update involves an estimate  MC does not bootstrap  DP bootstraps  TD bootstraps  Sampling: update does not involve an expected value  MC samples  DP does not sample  TD samples 58
  • 59. 59 59 Learning An Action-Value Function Estimate Qp for the current behavior policy p . 59 ( ) ( ) ( ) ( ) [ ] 0. then terminal, is If this do , state l nontermina a from sition every tran After 1 1 1 1 = ) a , Q(s s a , s Q a , s gQ + r α + a , s ¬Q a , s Q s 1 + t + t + t t t 1 + t + t + t t t t t t −
  • 60. 60 Q-Learning: Off-Policy TD Control 60 One- step Q- learning : Q st , at ( )← Q st , at ( )+ α rt +1 +γ max a Q st+1, a ( )− Q st , at ( ) [ ]
  • 61. 61 61 Sarsa: On-Policy TD Control S A R S A: State Action Reward State Action 61 Turn this into a control method by always updating the policy to be greedy with respect to the current estimate:
  • 62. 62 62 Advantages of TD Learning  TD methods do not require a model of the environment, only experience  TD, but not MC, methods can be fully incremental  You can learn before knowing the final outcome  Less memory  Less peak computation  You can learn without the final outcome  From incomplete sequences  Both MC and TD converge 62
  • 63. 63 References: RL  Univ. of Alberta  http://www.cs.ualberta.ca/~sutton/book/ebook/node 1.html  Sutton and barto,”Reinforcement Learning an introduction.”  Univ. of South Wales  http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlear ning.html