Reinforcement learning 7313

Slideshare
SlideshareAssistant Professor em Slideshare
Reinforcement Learning
V.Saranya
AP/CSE
Sri Vidya College of Engineering and
Technology,
Virudhunagar
What is learning?
 Learning takes place as a result of interaction
between an agent and the world, the idea
behind learning is that
 Percepts received by an agent should be used not
only for acting, but also for improving the agent’s
ability to behave optimally in the future to achieve
the goal.
Learning types
 Learning types
 Supervised learning:
a situation in which sample (input, output) pairs of the
function to be learned can be perceived or are given

You can think it as if there is a kind teacher
 Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to
achieve its goal
Reinforcement learning
 Task
Learn how to behave successfully to achieve a
goal while interacting with an external environment
 Learn via experiences!
 Examples
 Game playing: player knows whether it win or
lose, but not know how to move at each step
 Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
RL is learning from interaction
RL model
 Each percept(e) is enough to determine
the State(the state is accessible)
 The agent can decompose the Reward
component from a percept.
 The agent task: to find a optimal policy, mapping
states to actions, that maximize long-run measure
of the reinforcement.
 Think of reinforcement as reward
 Can be modeled as “MDP” model!
Review of MDP model
 MDP model <S,T,A,R>
Agent
Environment
State
Reward
Action
s0
r0
a0
s1
a1
r1
s2
a2
r2
s3
• S– set of states
• A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
probability of transition from s to
s’ given action a
• R(s,a)– the expected reward for
taking action a in state s
∑
∑
=
=
'
'
)',,()',,(),(
)',,(),|'(),(
s
s
sasrsasTasR
sasrassPasR
Model based v.s.Model free
approaches
 But, we don’t know anything about the environment
model—the transition function T(s,a,s’)
 Here comes two approaches
 Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
 Model free approach RL:
derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
 Which one is better?
Passive learning v.s. Active
learning
 Passive learning
 The agent imply watches the world going by and
tries to learn the utilities of being in various states
 Active learning
 The agent not simply watches, but also acts
Example environment
Passive learning scenario
 The agent see the the sequences of state
transitions and associate rewards
 The environment generates state transitions and the
agent perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]
(1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1)
(3,1) (4,1) (4,2)[-1]
 Key idea: updating the utility value using the
given training sequences.
Passive leaning scenario
LMS(least mean squares)
updating
 Reward to go of a state
the sum of the rewards from that state until a
terminal state is reached
 Key: use observed reward to go of the state as
the direct evidence of the actual expected utility of
that state
 Learning utility function directly from sequence
example
LMS updating
function LMS-UPDATE (U, e, percepts, M, N ) return an updated U
if TERMINAL?[e] then
{ reward-to-go  0
for each ei in percepts (starting from end) do
s = STATE[ei]
reward-to-go  reward-to-go + REWARS[ei]
U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s])
end
}
function RUNNING-AVERAGE (U[s], reward-to-go, N[s] )
U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
LMS updating algorithm in
passive learning
 Drawback:
 The actual utility of a state is constrained to be probability- weighted average
of its successor’s utilities.
 Meet very slowly to correct utilities values (requires a lot of sequences)

for our example, >1000!
Temporal difference method in
passive learning
 TD(0) key idea:
 adjust the estimated utility value of the current state based on its
immediately reward and the estimated value of the next state.
 The updating rule
 is the learning rate parameter
 Only when is a function that decreases as the number of times
a state has been visited increased, then can U(s)converge to the
correct value.
))()'()(()()( sUsUsRsUsU −++= α
α
α
The TD learning curve
(4,3)
(2,3)
(2,2)
(1,1)
(3,1)
(4,1)
(4,2)
Adaptive dynamic
programming(ADP) in passive
learning
 Different with LMS and TD method(model free
approaches)
 ADP is a model based approach!
 The updating rule for passive learning
 However, in an unknown environment, T is not
given, the agent must learn T itself by
experiences with the environment.
 How to learn T?
))'()',(()',()(
'
sUssrssTsU
s
γ+=∑
ADP learning curves
(4,3)
(3,3)
(2,3)
(1,1)
(3,1)
(4,1)
(4,2)
Active learning
 An active agent must consider
 what actions to take?
 what their outcomes maybe
 Update utility equation
 Rule to chose action
))'()',,(),((maxarg
'
sUsasTasRa
sa
∑+= γ
))'()',,(),((max)(
'
sUsasTasRsU
s
a
∑+= γ
Active ADP algorithm
For each s, initialize U(s) , T(s,a,s’) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and T) using
Receive immediate reward r and observe the new state s’
Using the transition tuple <s,a,s’,r> to update model R and T (see further)
For all the sate s, update U(s) using the updating rule
s = s’
}
))'()',,(),((maxarg
'
sUsasTasRa
sa
∑+= γ
))'()',,(),((max)(
'
sUsasTasRsU
s
a
∑+= γ
How to learn model?
 Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a).
That’s supervised learning!
 Since the agent can get every transition (s, a, s’,r) directly, so take
(s,a)/s’ as an input/output example of the transition probability
function T.
 Different techniques in the supervised learning(see further reading
for detail)
 Use r and T(s,a,s’) to learn R(s,a)
∑=
'
)',,(),(
s
rsasTasR
ADP approach pros and cons
 Pros:
 ADP algorithm converges far faster than LMS and Temporal
learning. That is because it use the information from the model of
the environment.
 Cons:
 Intractable for large state space
 In each step, update U for all states
 Improve this by prioritized-sweeping
 An action has two kinds of outcome
 Gain rewards on the current experience
tuple (s,a,s’)
 Affect the percepts received, and hence
the ability of the agent to learn
Exploration problem in Active
learning
Exploration problem in Active
learning
 A trade off when choosing action between
 its immediately good (reflected in its current utility estimates using
the what we have learned)
 its long term good (exploring more about the environment help it to
behave optimally in the long run)
 Two extreme approaches
 “wacky”approach: acts randomly, in the hope that it will
eventually explore the entire environment.
 “greedy”approach: acts to maximize its utility using current
model estimate
 Just like human in the real world! People need to decide between
 Continuing in a comfortable existence
 Or striking out into the unknown in the hopes of discovering a new and
better life
Exploration problem in Active
learning
 One kind of solution: the agent should be more wacky
when it has little idea of the environment, and more
greedy when it has a model that is close to being
correct
 In a given state, the agent should give some weight to actions that it
has not tried very often.
 While tend to avoid actions that are believed to be of low utility
 Implemented by exploration function f(u,n):
 assigning a higher utility estimate to relatively unexplored action state
pairs
 Chang the updating rule of value function to
 U+ denote the optimistic estimate of the utility
)),(),'()',,((),((max)(
'
saNsUsasTfasrsU
s
a
++
∑+= γ
Exploration problem in Active
learning
 One kind of definition of f(u,n)
if n< Ne
u otherwise
 is an optimistic estimate of the best possible reward
obtainable in any state
 The agent will try each action-state pair(s,a) at least Ne times
 The agent will behave initially as if there were wonderful rewards
scattered all over around– optimistic .
=),( nuf
+
R
{+
R
Generalization in
Reinforcement Learning
 Use generalization techniques to deal with large state
or action space.
 Function approximation techniques
Genetic algorithm and Evolutionary
programming
 Start with a set of individuals
 Apply selection and reproduction operators to “evolve” an individual that is
successful (measured by a fitness function)
Genetic algorithm and Evolutionary
programming
 Imagine the individuals as agent functions
 Fitness function as performance measure or
reward function
 No attempt made to learn the relationship the
rewards and actions taken by an agent
 Simply searches directly in the individual space to
find one that maximizes the fitness functions
1 de 30

Recomendados

Reinforcement Learning por
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
1.8K visualizações36 slides
An introduction to reinforcement learning por
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
4.4K visualizações30 slides
Reinforcement Learning por
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
951 visualizações33 slides
Lecture 9 Markov decision process por
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
400 visualizações16 slides
Reinforcement Learning Q-Learning por
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
5K visualizações32 slides
Reinforcement Learning : A Beginners Tutorial por
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
15.3K visualizações37 slides

Mais conteúdo relacionado

Mais procurados

Deep Reinforcement Learning por
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
3K visualizações38 slides
Reinforcement learning por
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
44.7K visualizações64 slides
Adversarial search por
Adversarial searchAdversarial search
Adversarial searchNilu Desai
4.2K visualizações39 slides
Reinforcement learning por
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
927 visualizações51 slides
Reinforcement learning por
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
412 visualizações28 slides

Mais procurados(20)

Deep Reinforcement Learning por Usman Qayyum
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum3K visualizações
Reinforcement learning por Chandra Meena
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena44.7K visualizações
Adversarial search por Nilu Desai
Adversarial searchAdversarial search
Adversarial search
Nilu Desai4.2K visualizações
Reinforcement learning por DongHyun Kwak
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak927 visualizações
Reinforcement learning por Ding Li
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li412 visualizações
Artificial Intelligence Searching Techniques por Dr. C.V. Suresh Babu
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
Dr. C.V. Suresh Babu3.2K visualizações
Intro to Reinforcement learning - part III por Mikko Mäkipää
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
Mikko Mäkipää56 visualizações
DQN (Deep Q-Network) por Dong Guo
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo2.2K visualizações
Reinforcement Learning por DongHyun Kwak
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak1.4K visualizações
Forward and Backward chaining in AI por Megha Sharma
Forward and Backward chaining in AIForward and Backward chaining in AI
Forward and Backward chaining in AI
Megha Sharma3.1K visualizações
Deep Q-Learning por Nikolay Pavlov
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov3.8K visualizações
knowledge representation using rules por Harini Balamurugan
knowledge representation using rulesknowledge representation using rules
knowledge representation using rules
Harini Balamurugan17.6K visualizações
Reinforcement learning por SKS
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS 851 visualizações
Hill climbing por Mohammad Faizan
Hill climbingHill climbing
Hill climbing
Mohammad Faizan67.9K visualizações
Markov decision process por Hamed Abdi
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi3.1K visualizações
Knowledge representation In Artificial Intelligence por Ramla Sheikh
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
Ramla Sheikh15.7K visualizações
An introduction to reinforcement learning por Jie-Han Chen
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen875 visualizações
AI 7 | Constraint Satisfaction Problem por Mohammad Imam Hossain
AI 7 | Constraint Satisfaction ProblemAI 7 | Constraint Satisfaction Problem
AI 7 | Constraint Satisfaction Problem
Mohammad Imam Hossain2.8K visualizações
Multi-armed Bandits por Dongmin Lee
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee3.6K visualizações

Similar a Reinforcement learning 7313

YijueRL.ppt por
YijueRL.pptYijueRL.ppt
YijueRL.pptShoaib Iqbal
5 visualizações34 slides
RL_online _presentation_1.ppt por
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
5 visualizações34 slides
reiniforcement learning.ppt por
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
11 visualizações34 slides
Reinforcement Learning.ppt por
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
24 visualizações34 slides
RL.ppt por
RL.pptRL.ppt
RL.pptAzharJamil15
31 visualizações16 slides
Reinforcement learning 1.pdf por
Reinforcement learning 1.pdfReinforcement learning 1.pdf
Reinforcement learning 1.pdfssuserd27779
94 visualizações39 slides

Similar a Reinforcement learning 7313(20)

YijueRL.ppt por Shoaib Iqbal
YijueRL.pptYijueRL.ppt
YijueRL.ppt
Shoaib Iqbal5 visualizações
RL_online _presentation_1.ppt por ssuser43a599
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
ssuser43a5995 visualizações
reiniforcement learning.ppt por charusharma165
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma16511 visualizações
Reinforcement Learning.ppt por POOJASHREEC1
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
POOJASHREEC124 visualizações
RL.ppt por AzharJamil15
RL.pptRL.ppt
RL.ppt
AzharJamil1531 visualizações
Reinforcement learning 1.pdf por ssuserd27779
Reinforcement learning 1.pdfReinforcement learning 1.pdf
Reinforcement learning 1.pdf
ssuserd2777994 visualizações
Reinforcement Learning Overview | Marco Del Pra por Data Science Milan
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan174 visualizações
Reinfrocement Learning por Natan Katz
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz175 visualizações
REINFORCEMENT LEARNING por pradiprahul
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
pradiprahul236 visualizações
An efficient use of temporal difference technique in Computer Game Learning por Prabhu Kumar
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar70 visualizações
Q_Learning.ppt por AyushGiri27
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
AyushGiri2713 visualizações
lecture_21.pptx - PowerPoint Presentation por butest
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest398 visualizações
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI por Jack Clark
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark2.9K visualizações
Lecture notes por butest
Lecture notesLecture notes
Lecture notes
butest302 visualizações
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING por gerogepatton
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
gerogepatton20 visualizações
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017 por MLconf
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf924 visualizações
Intro rl por Ronald Teo
Intro rlIntro rl
Intro rl
Ronald Teo417 visualizações
Reinforcement Learning por SVijaylakshmi
Reinforcement LearningReinforcement Learning
Reinforcement Learning
SVijaylakshmi207 visualizações
reinforcement-learning-141009013546-conversion-gate02.pdf por VaishnavGhadge1
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1117 visualizações

Mais de Slideshare

Crystal report generation in visual studio 2010 por
Crystal report generation in visual studio 2010Crystal report generation in visual studio 2010
Crystal report generation in visual studio 2010Slideshare
2.6K visualizações32 slides
Report generation por
Report generationReport generation
Report generationSlideshare
1.1K visualizações6 slides
Trigger por
TriggerTrigger
TriggerSlideshare
8K visualizações28 slides
Security in Relational model por
Security in Relational modelSecurity in Relational model
Security in Relational modelSlideshare
811 visualizações16 slides
Entity Relationship Model por
Entity Relationship ModelEntity Relationship Model
Entity Relationship ModelSlideshare
13.2K visualizações61 slides
OLAP por
OLAPOLAP
OLAPSlideshare
25.9K visualizações22 slides

Mais de Slideshare(20)

Crystal report generation in visual studio 2010 por Slideshare
Crystal report generation in visual studio 2010Crystal report generation in visual studio 2010
Crystal report generation in visual studio 2010
Slideshare2.6K visualizações
Report generation por Slideshare
Report generationReport generation
Report generation
Slideshare1.1K visualizações
Trigger por Slideshare
TriggerTrigger
Trigger
Slideshare8K visualizações
Security in Relational model por Slideshare
Security in Relational modelSecurity in Relational model
Security in Relational model
Slideshare811 visualizações
Entity Relationship Model por Slideshare
Entity Relationship ModelEntity Relationship Model
Entity Relationship Model
Slideshare13.2K visualizações
OLAP por Slideshare
OLAPOLAP
OLAP
Slideshare25.9K visualizações
Major issues in data mining por Slideshare
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare36.3K visualizações
Data preprocessing por Slideshare
Data preprocessingData preprocessing
Data preprocessing
Slideshare1.3K visualizações
What is in you por Slideshare
What is in youWhat is in you
What is in you
Slideshare538 visualizações
Propositional logic & inference por Slideshare
Propositional logic & inferencePropositional logic & inference
Propositional logic & inference
Slideshare10.5K visualizações
Logical reasoning 21.1.13 por Slideshare
Logical reasoning 21.1.13Logical reasoning 21.1.13
Logical reasoning 21.1.13
Slideshare3.4K visualizações
Logic agent por Slideshare
Logic agentLogic agent
Logic agent
Slideshare2.6K visualizações
Statistical learning por Slideshare
Statistical learningStatistical learning
Statistical learning
Slideshare1.1K visualizações
Resolution(decision) por Slideshare
Resolution(decision)Resolution(decision)
Resolution(decision)
Slideshare490 visualizações
Neural networks por Slideshare
Neural networksNeural networks
Neural networks
Slideshare6.9K visualizações
Instance based learning por Slideshare
Instance based learningInstance based learning
Instance based learning
Slideshare594 visualizações
Statistical learning por Slideshare
Statistical learningStatistical learning
Statistical learning
Slideshare10.4K visualizações
Neural networks por Slideshare
Neural networksNeural networks
Neural networks
Slideshare1.6K visualizações
Logical reasoning por Slideshare
Logical reasoning Logical reasoning
Logical reasoning
Slideshare14K visualizações
Instance based learning por Slideshare
Instance based learningInstance based learning
Instance based learning
Slideshare6.5K visualizações

Último

Meet the Bible por
Meet the BibleMeet the Bible
Meet the BibleSteve Thomason
83 visualizações80 slides
Peripheral artery diseases by Dr. Garvit.pptx por
Peripheral artery diseases by Dr. Garvit.pptxPeripheral artery diseases by Dr. Garvit.pptx
Peripheral artery diseases by Dr. Garvit.pptxgarvitnanecha
135 visualizações48 slides
Gross Anatomy of the Liver por
Gross Anatomy of the LiverGross Anatomy of the Liver
Gross Anatomy of the Liverobaje godwin sunday
100 visualizações12 slides
ppt_dunarea.pptx por
ppt_dunarea.pptxppt_dunarea.pptx
ppt_dunarea.pptxvvvgeorgevvv
68 visualizações5 slides
Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf por
 Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf
Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdfTechSoup
67 visualizações28 slides
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37 por
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37MysoreMuleSoftMeetup
55 visualizações17 slides

Último(20)

Meet the Bible por Steve Thomason
Meet the BibleMeet the Bible
Meet the Bible
Steve Thomason83 visualizações
Peripheral artery diseases by Dr. Garvit.pptx por garvitnanecha
Peripheral artery diseases by Dr. Garvit.pptxPeripheral artery diseases by Dr. Garvit.pptx
Peripheral artery diseases by Dr. Garvit.pptx
garvitnanecha135 visualizações
Gross Anatomy of the Liver por obaje godwin sunday
Gross Anatomy of the LiverGross Anatomy of the Liver
Gross Anatomy of the Liver
obaje godwin sunday100 visualizações
ppt_dunarea.pptx por vvvgeorgevvv
ppt_dunarea.pptxppt_dunarea.pptx
ppt_dunarea.pptx
vvvgeorgevvv68 visualizações
Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf por TechSoup
 Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf
Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf
TechSoup 67 visualizações
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37 por MysoreMuleSoftMeetup
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37
Payment Integration using Braintree Connector | MuleSoft Mysore Meetup #37
MysoreMuleSoftMeetup55 visualizações
Career Building in AI - Technologies, Trends and Opportunities por WebStackAcademy
Career Building in AI - Technologies, Trends and OpportunitiesCareer Building in AI - Technologies, Trends and Opportunities
Career Building in AI - Technologies, Trends and Opportunities
WebStackAcademy51 visualizações
What is Digital Transformation? por Mark Brown
What is Digital Transformation?What is Digital Transformation?
What is Digital Transformation?
Mark Brown46 visualizações
The Picture Of A Photograph por Evelyn Donaldson
The Picture Of A PhotographThe Picture Of A Photograph
The Picture Of A Photograph
Evelyn Donaldson38 visualizações
Artificial Intelligence and The Sustainable Development Goals (SDGs) Adoption... por BC Chew
Artificial Intelligence and The Sustainable Development Goals (SDGs) Adoption...Artificial Intelligence and The Sustainable Development Goals (SDGs) Adoption...
Artificial Intelligence and The Sustainable Development Goals (SDGs) Adoption...
BC Chew40 visualizações
Presentation_NC_Future now 2006.pdf por Lora
Presentation_NC_Future now 2006.pdfPresentation_NC_Future now 2006.pdf
Presentation_NC_Future now 2006.pdf
Lora 38 visualizações
My Personal Brand Exploration por StephenBradley49
My Personal Brand ExplorationMy Personal Brand Exploration
My Personal Brand Exploration
StephenBradley4935 visualizações
Pharmaceutical Analysis PPT (BP 102T) por yakshpharmacy009
Pharmaceutical Analysis PPT (BP 102T) Pharmaceutical Analysis PPT (BP 102T)
Pharmaceutical Analysis PPT (BP 102T)
yakshpharmacy009118 visualizações
PRELIMS ANSWER.pptx por souravkrpodder
PRELIMS ANSWER.pptxPRELIMS ANSWER.pptx
PRELIMS ANSWER.pptx
souravkrpodder56 visualizações
ANGULARJS.pdf por ArthyR3
ANGULARJS.pdfANGULARJS.pdf
ANGULARJS.pdf
ArthyR354 visualizações
Interaction of microorganisms with vascular plants.pptx por MicrobiologyMicro
Interaction of microorganisms with vascular plants.pptxInteraction of microorganisms with vascular plants.pptx
Interaction of microorganisms with vascular plants.pptx
MicrobiologyMicro75 visualizações
The Future of Micro-credentials: Is Small Really Beautiful? por Mark Brown
The Future of Micro-credentials:  Is Small Really Beautiful?The Future of Micro-credentials:  Is Small Really Beautiful?
The Future of Micro-credentials: Is Small Really Beautiful?
Mark Brown121 visualizações
Boston In The American Revolution por Mary Brown
Boston In The American RevolutionBoston In The American Revolution
Boston In The American Revolution
Mary Brown38 visualizações
JRN 362 - Lecture Twenty-Three (Epilogue) por Rich Hanley
JRN 362 - Lecture Twenty-Three (Epilogue)JRN 362 - Lecture Twenty-Three (Epilogue)
JRN 362 - Lecture Twenty-Three (Epilogue)
Rich Hanley44 visualizações
Interaction of microorganisms with Insects.pptx por MicrobiologyMicro
Interaction of microorganisms with Insects.pptxInteraction of microorganisms with Insects.pptx
Interaction of microorganisms with Insects.pptx
MicrobiologyMicro48 visualizações

Reinforcement learning 7313

  • 1. Reinforcement Learning V.Saranya AP/CSE Sri Vidya College of Engineering and Technology, Virudhunagar
  • 2. What is learning?  Learning takes place as a result of interaction between an agent and the world, the idea behind learning is that  Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve the goal.
  • 3. Learning types  Learning types  Supervised learning: a situation in which sample (input, output) pairs of the function to be learned can be perceived or are given  You can think it as if there is a kind teacher  Reinforcement learning: in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal
  • 4. Reinforcement learning  Task Learn how to behave successfully to achieve a goal while interacting with an external environment  Learn via experiences!  Examples  Game playing: player knows whether it win or lose, but not know how to move at each step  Control: a traffic system can measure the delay of cars, but not know how to decrease it.
  • 5. RL is learning from interaction
  • 6. RL model  Each percept(e) is enough to determine the State(the state is accessible)  The agent can decompose the Reward component from a percept.  The agent task: to find a optimal policy, mapping states to actions, that maximize long-run measure of the reinforcement.  Think of reinforcement as reward  Can be modeled as “MDP” model!
  • 7. Review of MDP model  MDP model <S,T,A,R> Agent Environment State Reward Action s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 • S– set of states • A– set of actions • T(s,a,s’) = P(s’|s,a)– the probability of transition from s to s’ given action a • R(s,a)– the expected reward for taking action a in state s ∑ ∑ = = ' ' )',,()',,(),( )',,(),|'(),( s s sasrsasTasR sasrassPasR
  • 8. Model based v.s.Model free approaches  But, we don’t know anything about the environment model—the transition function T(s,a,s’)  Here comes two approaches  Model based approach RL: learn the model, and use it to derive the optimal policy. e.g Adaptive dynamic learning(ADP) approach  Model free approach RL: derive the optimal policy without learning the model. e.g LMS and Temporal difference approach  Which one is better?
  • 9. Passive learning v.s. Active learning  Passive learning  The agent imply watches the world going by and tries to learn the utilities of being in various states  Active learning  The agent not simply watches, but also acts
  • 11. Passive learning scenario  The agent see the the sequences of state transitions and associate rewards  The environment generates state transitions and the agent perceive them e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1] (1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1) (4,2)[-1]  Key idea: updating the utility value using the given training sequences.
  • 13. LMS(least mean squares) updating  Reward to go of a state the sum of the rewards from that state until a terminal state is reached  Key: use observed reward to go of the state as the direct evidence of the actual expected utility of that state  Learning utility function directly from sequence example
  • 14. LMS updating function LMS-UPDATE (U, e, percepts, M, N ) return an updated U if TERMINAL?[e] then { reward-to-go  0 for each ei in percepts (starting from end) do s = STATE[ei] reward-to-go  reward-to-go + REWARS[ei] U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s]) end } function RUNNING-AVERAGE (U[s], reward-to-go, N[s] ) U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
  • 15. LMS updating algorithm in passive learning  Drawback:  The actual utility of a state is constrained to be probability- weighted average of its successor’s utilities.  Meet very slowly to correct utilities values (requires a lot of sequences)  for our example, >1000!
  • 16. Temporal difference method in passive learning  TD(0) key idea:  adjust the estimated utility value of the current state based on its immediately reward and the estimated value of the next state.  The updating rule  is the learning rate parameter  Only when is a function that decreases as the number of times a state has been visited increased, then can U(s)converge to the correct value. ))()'()(()()( sUsUsRsUsU −++= α α α
  • 17. The TD learning curve (4,3) (2,3) (2,2) (1,1) (3,1) (4,1) (4,2)
  • 18. Adaptive dynamic programming(ADP) in passive learning  Different with LMS and TD method(model free approaches)  ADP is a model based approach!  The updating rule for passive learning  However, in an unknown environment, T is not given, the agent must learn T itself by experiences with the environment.  How to learn T? ))'()',(()',()( ' sUssrssTsU s γ+=∑
  • 20. Active learning  An active agent must consider  what actions to take?  what their outcomes maybe  Update utility equation  Rule to chose action ))'()',,(),((maxarg ' sUsasTasRa sa ∑+= γ ))'()',,(),((max)( ' sUsasTasRsU s a ∑+= γ
  • 21. Active ADP algorithm For each s, initialize U(s) , T(s,a,s’) and R(s,a) Initialize s to current state that is perceived Loop forever { Select an action a and execute it (using current model R and T) using Receive immediate reward r and observe the new state s’ Using the transition tuple <s,a,s’,r> to update model R and T (see further) For all the sate s, update U(s) using the updating rule s = s’ } ))'()',,(),((maxarg ' sUsasTasRa sa ∑+= γ ))'()',,(),((max)( ' sUsasTasRsU s a ∑+= γ
  • 22. How to learn model?  Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a). That’s supervised learning!  Since the agent can get every transition (s, a, s’,r) directly, so take (s,a)/s’ as an input/output example of the transition probability function T.  Different techniques in the supervised learning(see further reading for detail)  Use r and T(s,a,s’) to learn R(s,a) ∑= ' )',,(),( s rsasTasR
  • 23. ADP approach pros and cons  Pros:  ADP algorithm converges far faster than LMS and Temporal learning. That is because it use the information from the model of the environment.  Cons:  Intractable for large state space  In each step, update U for all states  Improve this by prioritized-sweeping
  • 24.  An action has two kinds of outcome  Gain rewards on the current experience tuple (s,a,s’)  Affect the percepts received, and hence the ability of the agent to learn Exploration problem in Active learning
  • 25. Exploration problem in Active learning  A trade off when choosing action between  its immediately good (reflected in its current utility estimates using the what we have learned)  its long term good (exploring more about the environment help it to behave optimally in the long run)  Two extreme approaches  “wacky”approach: acts randomly, in the hope that it will eventually explore the entire environment.  “greedy”approach: acts to maximize its utility using current model estimate  Just like human in the real world! People need to decide between  Continuing in a comfortable existence  Or striking out into the unknown in the hopes of discovering a new and better life
  • 26. Exploration problem in Active learning  One kind of solution: the agent should be more wacky when it has little idea of the environment, and more greedy when it has a model that is close to being correct  In a given state, the agent should give some weight to actions that it has not tried very often.  While tend to avoid actions that are believed to be of low utility  Implemented by exploration function f(u,n):  assigning a higher utility estimate to relatively unexplored action state pairs  Chang the updating rule of value function to  U+ denote the optimistic estimate of the utility )),(),'()',,((),((max)( ' saNsUsasTfasrsU s a ++ ∑+= γ
  • 27. Exploration problem in Active learning  One kind of definition of f(u,n) if n< Ne u otherwise  is an optimistic estimate of the best possible reward obtainable in any state  The agent will try each action-state pair(s,a) at least Ne times  The agent will behave initially as if there were wonderful rewards scattered all over around– optimistic . =),( nuf + R {+ R
  • 28. Generalization in Reinforcement Learning  Use generalization techniques to deal with large state or action space.  Function approximation techniques
  • 29. Genetic algorithm and Evolutionary programming  Start with a set of individuals  Apply selection and reproduction operators to “evolve” an individual that is successful (measured by a fitness function)
  • 30. Genetic algorithm and Evolutionary programming  Imagine the individuals as agent functions  Fitness function as performance measure or reward function  No attempt made to learn the relationship the rewards and actions taken by an agent  Simply searches directly in the individual space to find one that maximizes the fitness functions