SlideShare uma empresa Scribd logo
1 de 36
Reinforcement Learning: An Introduction
               Ch. 3, 4, 6
         R. Sutton and A. Barto.


            KAIST AIPR Lab.
             Jung-Yeol Lee
              3rd June 2010


                                          1
KAIST AIPR Lab.



Contents

•   Reinforcement learning
•   Markov decision processes
•   Value function
•   Policy iteration
•   Value iteration
•   Sarsa
•   Q-learning




                                         2
KAIST AIPR Lab.



Reinforcement Learning

• An approach to machine learning
• How to take actions in an environment responding to those
  actions and presenting new situations
• To find a policy that maps situations to the actions
• To discover which action yield the most reward signal over the
  long run




                                                                   3
KAIST AIPR Lab.



Agent-Environment Interface

• Agent
    The learner and decision maker
• Environment
    Everything outside the agent
    Responding to actions and presenting new situations
    Giving a reward (feedback, or reinforcement)




                                                                    4
KAIST AIPR Lab.



Agent-Environment Interface (cont’d)
                       state st
                                           Agent
                              reward r t                action at

                                  st+1
                                          Environment
                                  r t+1

• Agent and environment interact at time steps t  0,1, 2,3,...
    The environment’s state, st  S where S is the set of possible states
    An action, at  A(st ) where A( st ) is the set of actions available in
     state st
    A numerical reward, rt 1 
• Agent’s policy,  t
     t ( s, a), the probability that at  a if st  s
     t (s)  A(s), the deterministic policy

                                                                                 5
KAIST AIPR Lab.



Goals and Rewards

• Goal
   What we want to achieve, not how we want to achieve it
• Rewards
   To formalize the idea of a goal
   A numerical value by the environment.




                                                                      6
KAIST AIPR Lab.



Returns

• Specific function of the reward sequence
• Types of returns
    Episodic tasks
       Rt  rt 1  rt 2  rt 3     rT , where T is a final time step
    Continuing tasks ( T   )
       The additional concept, discount rate        
                                              
       Rt  rt 1   rt  2   rt 3 
                                2
                                              k rt  k 1 , where 0    1
                                             k 0




                                                                                     7
KAIST AIPR Lab.



Exploration vs. Exploitation

• Exploration
    To discover better action selections
    To improve its knowledge
• Exploitation
    To maximize its reward based on what it already knows
• Exploration-exploitation dilemma
    Both can’t be pursued exclusively without failing




                                                                      8
KAIST AIPR Lab.



Markov Property

• State signal retaining all relevant information
• “Independence of path” property
• Formally,
    Pr st 1  s ', rt 1  r | st , at , rt , st 1 , at 1 , , r1 , s0 , a0 
          Pr st 1  s ', rt 1  r | st , at 




                                                                                             9
KAIST AIPR Lab.



Markov Decision Processes (MDP)

• 4-tuple, (S , A, T , R)
     S is a set of states
     A is a set of actions
     Transition probabilities
        T (s, a, s ')  Pr st 1  s ' | st  s, at  a , for all s, s '  S , a  A(s)
     The expected reward
        R(s, a, s ')  E rt 1 | st  s, at  a, st 1  s '

• Finite MDP: the state and action spaces are finite




                                                                                                    10
KAIST AIPR Lab.



Example: Gridworld




•   S  1, 2,...,14
•   A  up, down, right , left
•   E.g., T (5, right ,6)  1, T (5, right ,10)  0, T (7, right ,7)  1
•   R(s, a, s ')  1, s, s ', a



                                                                                   11
KAIST AIPR Lab.



Value Functions

• “How good” it is to perform a given action in a given state
• The value of a state s under a policy 
    The state-value function for policy
      • Expected return when starting in s and following 
                                        k                    
        V ( s)  E Rt | st  s  E   rt  k 1 | st  s 
          

                                        k 0                  
    The action-value function for policy 
      • Expected return starting from s , taking the action a and following 
                                                    k                            
         Q ( s, a)  E Rt | st  s, at  a  E   rt  k 1 | st  s, at  a 
           

                                                    k 0                          



                                                                                   12
KAIST AIPR Lab.



Bellman Equation

• Particular recursive relationships of value functions
• The Bellman equation for V 
    V ( s)  E   rt  k 1 | st  s 
                 k
                                              
                  k 0                       
                            
                                                      
             E rt 1     k rt  k  2 | st  s 
                           k 0                      
                                                                     
                                                                                                    
                  ( s, a) T ( s, a, s ')  R( s, a, s ')   E {  k rt  k  2 | st 1  s '}
                 a          s'                                      k 0                           
                  ( s, a) T ( s, a, s ')  R( s, a, s ')   V  ( s ') 
                                                                           
                  a             s'

• The value function is the unique solution to its Bellman equation


                                                                                                            13
KAIST AIPR Lab.



Optimal Value Functions

• Policies are partially ordered
       ' if and only if V  (s)  V  ' (s) for all s  S
• Optimal policy π*
     Policies that is better than or equal to all other policies
• Optimal state-value function V *
    V * (s)  max V  ( s), for all s  S
                
• Optimal action-value function Q*
     Q* (s, a)  max Q ( s, a), for all s  S and a  A(s)
                   
                 E rt 1   V * (st 1 ) | st  s, at  a



                                                                            14
KAIST AIPR Lab.



Bellman Optimality Equation

• The Bellman equation for V *
                             *
     V ( s)  max Q ( s, a)
      *
             aA( s )

            max E * Rt | st  s, at  a
                a

                       k                                
            max E *   rt  k 1 | st  s, at  a 
              a
                       k 0                              
                                 
                                                                   
            max E * rt 1     k rt  k  2 | st  s, at  a 
              a
                                k 0                              
            max E rt 1   V * (st 1 ) | st  s, at  a
                a


            max  T ( s, a, s ')  R( s, a, s ')   V * ( s ') 
                                                                
                a
                        s'


                                                                               15
KAIST AIPR Lab.



Bellman Optimality Equation (cont’d)

• The Bellman optimality equation for Q*
                      
    Q* (s, a)  E rt 1   max Q* (st 1 , a ') | st  s, at  a
                              a'
                                                                        
                   T (s, a, s ')  R( s, a, s ')   max Q* ( s ', a ') 
                   s'
                                                       a'                

• Optimal policy from Q*
     * (s)  arg max Q* (s, a)
                  aA( s )

• Optimal policy from V *
    Any policy that is greedy with respect to V * (s)  max Q (s, a)
                                                                                         *


                                                                              aA( s )




                                                                                                     16
KAIST AIPR Lab.



Dynamic Programming (DP)

• Algorithms for optimal policies under a perfect model
• Limited utility in reinforcement learning, but theoretically
  important
• Foundation for the understanding of other methods




                                                                    17
KAIST AIPR Lab.



Policy Evaluation

• How to compute the state-value function V 
• Recall, the Bellman equation for V 
    V (s)    ( s, a) T ( s, a, s ')  R( s, a, s ')   V ( s ') 
                                                             
                                                                     
                  a           s'

• A sequence of approximate value functions V0 ,V1 ,V2 ,
• Successive approximation
    Vk 1 ( s)   T ( s, a, s ')  R( s, a, s ')   Vk ( s ') 
                       s'
• Vk converge to V  as k  




                                                                                  18
KAIST AIPR Lab.



Policy Improvement

• Policy improvement theorem (proof)
     If Q (s,  '(s))  V  (s), for all s  S then, V  ' (s)  V  (s)
     Better to switch action iff Q (s,  '(s))  V  (s)
• The new greedy policy,  '
     Selecting the action that appears best
        '(s)  arg max Q ( s, a)
                       a

               arg max  T ( s, a, s ')  R( s, a, s ')   V  ( s ') 
                                                                       
                       a
                            s'


• What if V  '  V  ?
     Both     ' and           are optimal policies


                                                                                     19
KAIST AIPR Lab.



Policy Iteration

•  0  V   1  V    2    *  V * ,
       E
          0    I
                       E1
                              I
                                      E
                                              I
                                                   E
                                                      
  where  denotes a policy evaluation and
            E
              
            denotes a policy improvement
            I
              
• Policy iteration finishes when a policy is stable




                                                              20
KAIST AIPR Lab.



Policy Iteration (cont’d)
 Initialization
 V (s)  and arbitrarily for all

 Policy Evaluation
 repeat
       0
       for each s  S do
               v  V (s)
               V (s)   s ' T (s,  (s), s ')  R(s,  ( s), s ')   V ( s ') 
                 max(, v  V (s) )
     end for
 until    (a small positive number)

 Policy Improvement
 policy  stable  true
 for all s  S do
       b   ( s)
        (s)  arg max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ')
     If b   (s) then policy  stable  false
 end for
 If policy  stable then stop; else go to Policy evaluation


                                                                                            21
KAIST AIPR Lab.



Value Iteration

• Turning the Bellman optimality equation into an update rule
    Vk 1 (s)  max E rt 1   Vk (st 1 ) | st  s, at  a
                         a

                      max  T ( s, a, s ')  R(s, a, s ')   Vk (s ')  for all s  S
                        a
                              s'

• Policy π, such that
   (s)  arg max  T ( s, a, s ')  R( s, a, s ')   V ( s ') 
                 a       s'




                                                                                        22
KAIST AIPR Lab.



Value Iteration (cont’d)
 Initialization V arbitrarily, e.g., V (s)  0 for all s  S 

 repeat
     0
     for each s  S do
             v  V (s)
             V (s)  max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ')
               max(, v  V (s) )
     end for
 until    (a small positive number)

 Output a deterministic policy,  , such that
      (s)  arg max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ')




                                                                                     23
KAIST AIPR Lab.



Temporal-Difference (TD) Prediction

• Model free method
• Basic Update rule
    NewEstimate  OldEstimate  StepSize Target  OldEstimate
• The simplest TD method, TD(0)                                error

    V (st )  V (st )    Rt  V (st )
              V (st )    rt 1   V (st 1 )  V (st )
    α: step size




                                                                               24
KAIST AIPR Lab.



Advantages of TD Prediction Methods

• Bootstrapping
    Estimate on the basis of other estimates (a guess from a guess)
• Over DP methods
    Model free methods
• Wait only one time step
    In case of continuing tasks and no episodes
• Guarantee convergence to the correct answer
    Sufficiently small α
    Selecting all actions infinitely often



                                                                       25
KAIST AIPR Lab.



Sarsa: On-Policy TD Control

• On-policy
    Improve the policy that is used to make decisions
                
• Estimate Q under the current policy π
• Under TD(0), apply to the corresponding algorithm
    Q(st , at )  Q(st , at )   rt 1   Q(st 1, at 1 )  Q(st , at ) 
    For every quintuples of events, (st , at , rt 1 , st 1 , at 1 ) (Sarsa)
    If st 1 is terminal, then Q(st 1 , at 1 )  0
• Change π toward greediness w.r.t. Q
• Converges if all pairs are visited infinite times and policy
  converges to the greedy (e.g., ε= t in ε-greedy)
                                     1




                                                                                      26
KAIST AIPR Lab.



Sarsa: On-Policy TD Control (cont’d)
 Initialize Q(s, a) arbitrarily
 Repeat (for each episode):
     Initialize s
     Choose a from s using policy derived from Q (e.g., ε-greedy)
     Repeat (for each step of episode):
         Take action a, observe r , s '
         Choose( a ' from s ' using policy derived from Q(e.g., ε-greedy)
         Q(s, a)  Q(s, a)    r   Q(s ', a ')  Q(s, a)
         s  s '; a  a '
     until   s   is terminal




                                                                                    27
KAIST AIPR Lab.



Q-Learning: Off-Policy TD Control

• Off-policy
    Behavior policy
    Estimation policy (may be deterministic (e.g., greedy) )
• Simplest form, one-step Q-learning, is defined by
    Q( st , at )  Q( st , at )   rt 1   max Q( st 1 , a)  Q( st , at ) 
                                                a                               
• Directly approximate Q*, the optimal action-value function
• Qt converges to Q* with probability 1
    Correct converges if all pairs continue to be updated




                                                                                             28
KAIST AIPR Lab.



Q-Learning: Off-Policy TD Control (cont’d)
 Initialize Q(s, a) arbitrarily
 Repeat (for each episode):
     Initialize s
     Repeat (for each step of episode):
         Choose( a from s using policy derived from Q (e.g., ε-greedy)
         Take action a, observe r , s '
         Q(s, a)  Q(s, a)    r   max a ' Q(s ', a ')  Q(s, a)
         s  s'
     until   s   is terminal




                                                                                 29
KAIST AIPR Lab.



Example: Cliffwalking

• ε-greedy action selection
    ε=0.1 (fixed)
• Sarsa
    Learns longer but safer
• Q-learning
    Learns the optimal policy
• If ε were reduced,
    Both converge to the
     optimal policy



                                         30
KAIST AIPR Lab.



Summary

• Goal of reinforcement learning
    To find an optimal policy to maximize the long-term reward
• Model-based methods
    Policy iteration: a sequence of improving policies and value
     function
    Value iteration: backup operations for V *
• Model-free methods
                      
    Sarsa estimates Q for the behavior policy  , change  toward
     greediness w.r.t. Q
    Q-learning directly approximates the optimal action-value
     function

                                                                       31
KAIST AIPR Lab.



References

[1] R. Sutton and A. Barto. Reinforcement Learning: An
     Introduction. Pages 51-158, 1998.
[2] S. Russel and P. Norvig. Artificial Intelligence: A Modern
     Approach. Pages 613-784, 2003.




                                                                    32
KAIST AIPR Lab.



Q&A

• Thank you




                      33
KAIST AIPR Lab.



Appendix 1. Policy Improvement Theorem

• Proof)




                                           34
KAIST AIPR Lab.



Appendix 2. Convergence of Value Iteration

• Prove




                                                     35
KAIST AIPR Lab.



Appendix 3. Target Estimation

  • DP                                          • Simple TD




    V (st )  E  rt 1   V (st 1 )   V (st )  V (st )    rt 1   V (st 1 )  V (st )




                                                                                                     36

Mais conteúdo relacionado

Mais procurados

State Space Search in ai
State Space Search in aiState Space Search in ai
State Space Search in aivikas dhakane
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Local search algorithms
Local search algorithmsLocal search algorithms
Local search algorithmsbambangsueb
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningCloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimizationSuman Chatterjee
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Uninformed Search technique
Uninformed Search techniqueUninformed Search technique
Uninformed Search techniqueKapil Dahal
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsDrBaljitSinghKhehra
 

Mais procurados (20)

Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
State Space Search in ai
State Space Search in aiState Space Search in ai
State Space Search in ai
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Local search algorithms
Local search algorithmsLocal search algorithms
Local search algorithms
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Practical Swarm Optimization (PSO)
Practical Swarm Optimization (PSO)Practical Swarm Optimization (PSO)
Practical Swarm Optimization (PSO)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Uninformed Search technique
Uninformed Search techniqueUninformed Search technique
Uninformed Search technique
 
AI_ppt.pptx
AI_ppt.pptxAI_ppt.pptx
AI_ppt.pptx
 
Intelligent Agents
Intelligent Agents Intelligent Agents
Intelligent Agents
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Planning
Planning Planning
Planning
 
Artificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning ModelsArtificial Neural Networks-Supervised Learning Models
Artificial Neural Networks-Supervised Learning Models
 

Destaque

Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learningDeep Learning JP
 
Deep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionDeep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionXiaohu ZHU
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningGianluca Bontempi
 
Streamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve ProductivityStreamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve ProductivityKevin Fream
 
07 history of cv vision paradigms - system - algorithms - applications - eva...
07  history of cv vision paradigms - system - algorithms - applications - eva...07  history of cv vision paradigms - system - algorithms - applications - eva...
07 history of cv vision paradigms - system - algorithms - applications - eva...zukun
 
Machine Learning techniques
Machine Learning techniques Machine Learning techniques
Machine Learning techniques Jigar Patel
 
Power of Code: What you don’t know about what you know
Power of Code: What you don’t know about what you knowPower of Code: What you don’t know about what you know
Power of Code: What you don’t know about what you knowcdathuraliya
 
Applying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network RoutingApplying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network Routingbutest
 
Graphical Models for chains, trees and grids
Graphical Models for chains, trees and gridsGraphical Models for chains, trees and grids
Graphical Models for chains, trees and gridspotaters
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Modelsbutest
 
Les outils de modélisation des Big Data
Les outils de modélisation des Big DataLes outils de modélisation des Big Data
Les outils de modélisation des Big DataKezhan SHI
 
graphical models for the Internet
graphical models for the Internetgraphical models for the Internet
graphical models for the Internetantiw
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Quantum Information Technology
Quantum Information TechnologyQuantum Information Technology
Quantum Information TechnologyFenny Thakrar
 
Web Crawling and Reinforcement Learning
Web Crawling and Reinforcement LearningWeb Crawling and Reinforcement Learning
Web Crawling and Reinforcement LearningFrancesco Gadaleta
 
A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...Alberto Fernandez Villan
 

Destaque (20)

Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionDeep Reinforcement Learning An Introduction
Deep Reinforcement Learning An Introduction
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine Learning
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Streamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve ProductivityStreamlining Technology to Reduce Complexity and Improve Productivity
Streamlining Technology to Reduce Complexity and Improve Productivity
 
07 history of cv vision paradigms - system - algorithms - applications - eva...
07  history of cv vision paradigms - system - algorithms - applications - eva...07  history of cv vision paradigms - system - algorithms - applications - eva...
07 history of cv vision paradigms - system - algorithms - applications - eva...
 
Machine Learning techniques
Machine Learning techniques Machine Learning techniques
Machine Learning techniques
 
Power of Code: What you don’t know about what you know
Power of Code: What you don’t know about what you knowPower of Code: What you don’t know about what you know
Power of Code: What you don’t know about what you know
 
Applying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network RoutingApplying Reinforcement Learning for Network Routing
Applying Reinforcement Learning for Network Routing
 
Graphical Models for chains, trees and grids
Graphical Models for chains, trees and gridsGraphical Models for chains, trees and grids
Graphical Models for chains, trees and grids
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
 
Les outils de modélisation des Big Data
Les outils de modélisation des Big DataLes outils de modélisation des Big Data
Les outils de modélisation des Big Data
 
graphical models for the Internet
graphical models for the Internetgraphical models for the Internet
graphical models for the Internet
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Quantum Information Technology
Quantum Information TechnologyQuantum Information Technology
Quantum Information Technology
 
Web Crawling and Reinforcement Learning
Web Crawling and Reinforcement LearningWeb Crawling and Reinforcement Learning
Web Crawling and Reinforcement Learning
 
A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...A real-time big data architecture for glasses detection using computer vision...
A real-time big data architecture for glasses detection using computer vision...
 

Último

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Último (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Reinforcement Learning

  • 1. Reinforcement Learning: An Introduction Ch. 3, 4, 6 R. Sutton and A. Barto. KAIST AIPR Lab. Jung-Yeol Lee 3rd June 2010 1
  • 2. KAIST AIPR Lab. Contents • Reinforcement learning • Markov decision processes • Value function • Policy iteration • Value iteration • Sarsa • Q-learning 2
  • 3. KAIST AIPR Lab. Reinforcement Learning • An approach to machine learning • How to take actions in an environment responding to those actions and presenting new situations • To find a policy that maps situations to the actions • To discover which action yield the most reward signal over the long run 3
  • 4. KAIST AIPR Lab. Agent-Environment Interface • Agent  The learner and decision maker • Environment  Everything outside the agent  Responding to actions and presenting new situations  Giving a reward (feedback, or reinforcement) 4
  • 5. KAIST AIPR Lab. Agent-Environment Interface (cont’d) state st Agent reward r t action at st+1 Environment r t+1 • Agent and environment interact at time steps t  0,1, 2,3,...  The environment’s state, st  S where S is the set of possible states  An action, at  A(st ) where A( st ) is the set of actions available in state st  A numerical reward, rt 1  • Agent’s policy,  t   t ( s, a), the probability that at  a if st  s   t (s)  A(s), the deterministic policy 5
  • 6. KAIST AIPR Lab. Goals and Rewards • Goal  What we want to achieve, not how we want to achieve it • Rewards  To formalize the idea of a goal  A numerical value by the environment. 6
  • 7. KAIST AIPR Lab. Returns • Specific function of the reward sequence • Types of returns  Episodic tasks Rt  rt 1  rt 2  rt 3   rT , where T is a final time step  Continuing tasks ( T   ) The additional concept, discount rate   Rt  rt 1   rt  2   rt 3  2    k rt  k 1 , where 0    1 k 0 7
  • 8. KAIST AIPR Lab. Exploration vs. Exploitation • Exploration  To discover better action selections  To improve its knowledge • Exploitation  To maximize its reward based on what it already knows • Exploration-exploitation dilemma  Both can’t be pursued exclusively without failing 8
  • 9. KAIST AIPR Lab. Markov Property • State signal retaining all relevant information • “Independence of path” property • Formally,  Pr st 1  s ', rt 1  r | st , at , rt , st 1 , at 1 , , r1 , s0 , a0   Pr st 1  s ', rt 1  r | st , at  9
  • 10. KAIST AIPR Lab. Markov Decision Processes (MDP) • 4-tuple, (S , A, T , R)  S is a set of states  A is a set of actions  Transition probabilities T (s, a, s ')  Pr st 1  s ' | st  s, at  a , for all s, s '  S , a  A(s)  The expected reward R(s, a, s ')  E rt 1 | st  s, at  a, st 1  s ' • Finite MDP: the state and action spaces are finite 10
  • 11. KAIST AIPR Lab. Example: Gridworld • S  1, 2,...,14 • A  up, down, right , left • E.g., T (5, right ,6)  1, T (5, right ,10)  0, T (7, right ,7)  1 • R(s, a, s ')  1, s, s ', a 11
  • 12. KAIST AIPR Lab. Value Functions • “How good” it is to perform a given action in a given state • The value of a state s under a policy   The state-value function for policy • Expected return when starting in s and following   k  V ( s)  E Rt | st  s  E   rt  k 1 | st  s    k 0   The action-value function for policy  • Expected return starting from s , taking the action a and following   k  Q ( s, a)  E Rt | st  s, at  a  E   rt  k 1 | st  s, at  a    k 0  12
  • 13. KAIST AIPR Lab. Bellman Equation • Particular recursive relationships of value functions • The Bellman equation for V   V ( s)  E   rt  k 1 | st  s    k   k 0      E rt 1     k rt  k  2 | st  s   k 0        ( s, a) T ( s, a, s ')  R( s, a, s ')   E {  k rt  k  2 | st 1  s '} a s'  k 0     ( s, a) T ( s, a, s ')  R( s, a, s ')   V  ( s ')    a s' • The value function is the unique solution to its Bellman equation 13
  • 14. KAIST AIPR Lab. Optimal Value Functions • Policies are partially ordered    ' if and only if V  (s)  V  ' (s) for all s  S • Optimal policy π*  Policies that is better than or equal to all other policies • Optimal state-value function V *  V * (s)  max V  ( s), for all s  S  • Optimal action-value function Q*  Q* (s, a)  max Q ( s, a), for all s  S and a  A(s)   E rt 1   V * (st 1 ) | st  s, at  a 14
  • 15. KAIST AIPR Lab. Bellman Optimality Equation • The Bellman equation for V * * V ( s)  max Q ( s, a) * aA( s )  max E * Rt | st  s, at  a a  k   max E *   rt  k 1 | st  s, at  a  a  k 0      max E * rt 1     k rt  k  2 | st  s, at  a  a  k 0   max E rt 1   V * (st 1 ) | st  s, at  a a  max  T ( s, a, s ')  R( s, a, s ')   V * ( s ')    a s' 15
  • 16. KAIST AIPR Lab. Bellman Optimality Equation (cont’d) • The Bellman optimality equation for Q*   Q* (s, a)  E rt 1   max Q* (st 1 , a ') | st  s, at  a a'    T (s, a, s ')  R( s, a, s ')   max Q* ( s ', a ')  s'  a'  • Optimal policy from Q*   * (s)  arg max Q* (s, a) aA( s ) • Optimal policy from V *  Any policy that is greedy with respect to V * (s)  max Q (s, a) * aA( s ) 16
  • 17. KAIST AIPR Lab. Dynamic Programming (DP) • Algorithms for optimal policies under a perfect model • Limited utility in reinforcement learning, but theoretically important • Foundation for the understanding of other methods 17
  • 18. KAIST AIPR Lab. Policy Evaluation • How to compute the state-value function V  • Recall, the Bellman equation for V   V (s)    ( s, a) T ( s, a, s ')  R( s, a, s ')   V ( s ')      a s' • A sequence of approximate value functions V0 ,V1 ,V2 , • Successive approximation  Vk 1 ( s)   T ( s, a, s ')  R( s, a, s ')   Vk ( s ')  s' • Vk converge to V  as k   18
  • 19. KAIST AIPR Lab. Policy Improvement • Policy improvement theorem (proof)  If Q (s,  '(s))  V  (s), for all s  S then, V  ' (s)  V  (s)  Better to switch action iff Q (s,  '(s))  V  (s) • The new greedy policy,  '  Selecting the action that appears best  '(s)  arg max Q ( s, a) a  arg max  T ( s, a, s ')  R( s, a, s ')   V  ( s ')    a s' • What if V  '  V  ?  Both  ' and  are optimal policies 19
  • 20. KAIST AIPR Lab. Policy Iteration •  0  V   1  V    2    *  V * , E  0 I  E1  I  E  I  E  where  denotes a policy evaluation and E   denotes a policy improvement I  • Policy iteration finishes when a policy is stable 20
  • 21. KAIST AIPR Lab. Policy Iteration (cont’d) Initialization V (s)  and arbitrarily for all Policy Evaluation repeat 0 for each s  S do v  V (s) V (s)   s ' T (s,  (s), s ')  R(s,  ( s), s ')   V ( s ')    max(, v  V (s) ) end for until    (a small positive number) Policy Improvement policy  stable  true for all s  S do b   ( s)  (s)  arg max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ') If b   (s) then policy  stable  false end for If policy  stable then stop; else go to Policy evaluation 21
  • 22. KAIST AIPR Lab. Value Iteration • Turning the Bellman optimality equation into an update rule  Vk 1 (s)  max E rt 1   Vk (st 1 ) | st  s, at  a a  max  T ( s, a, s ')  R(s, a, s ')   Vk (s ')  for all s  S a s' • Policy π, such that  (s)  arg max  T ( s, a, s ')  R( s, a, s ')   V ( s ')  a s' 22
  • 23. KAIST AIPR Lab. Value Iteration (cont’d) Initialization V arbitrarily, e.g., V (s)  0 for all s  S  repeat 0 for each s  S do v  V (s) V (s)  max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ')   max(, v  V (s) ) end for until    (a small positive number) Output a deterministic policy,  , such that  (s)  arg max a  s ' T (s, a, s ')  R(s, a, s ')   V (s ') 23
  • 24. KAIST AIPR Lab. Temporal-Difference (TD) Prediction • Model free method • Basic Update rule  NewEstimate  OldEstimate  StepSize Target  OldEstimate • The simplest TD method, TD(0) error  V (st )  V (st )    Rt  V (st )  V (st )    rt 1   V (st 1 )  V (st )  α: step size 24
  • 25. KAIST AIPR Lab. Advantages of TD Prediction Methods • Bootstrapping  Estimate on the basis of other estimates (a guess from a guess) • Over DP methods  Model free methods • Wait only one time step  In case of continuing tasks and no episodes • Guarantee convergence to the correct answer  Sufficiently small α  Selecting all actions infinitely often 25
  • 26. KAIST AIPR Lab. Sarsa: On-Policy TD Control • On-policy  Improve the policy that is used to make decisions  • Estimate Q under the current policy π • Under TD(0), apply to the corresponding algorithm  Q(st , at )  Q(st , at )   rt 1   Q(st 1, at 1 )  Q(st , at )   For every quintuples of events, (st , at , rt 1 , st 1 , at 1 ) (Sarsa)  If st 1 is terminal, then Q(st 1 , at 1 )  0 • Change π toward greediness w.r.t. Q • Converges if all pairs are visited infinite times and policy converges to the greedy (e.g., ε= t in ε-greedy) 1 26
  • 27. KAIST AIPR Lab. Sarsa: On-Policy TD Control (cont’d) Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r , s ' Choose( a ' from s ' using policy derived from Q(e.g., ε-greedy) Q(s, a)  Q(s, a)    r   Q(s ', a ')  Q(s, a) s  s '; a  a ' until s is terminal 27
  • 28. KAIST AIPR Lab. Q-Learning: Off-Policy TD Control • Off-policy  Behavior policy  Estimation policy (may be deterministic (e.g., greedy) ) • Simplest form, one-step Q-learning, is defined by  Q( st , at )  Q( st , at )   rt 1   max Q( st 1 , a)  Q( st , at )   a  • Directly approximate Q*, the optimal action-value function • Qt converges to Q* with probability 1  Correct converges if all pairs continue to be updated 28
  • 29. KAIST AIPR Lab. Q-Learning: Off-Policy TD Control (cont’d) Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose( a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe r , s ' Q(s, a)  Q(s, a)    r   max a ' Q(s ', a ')  Q(s, a) s  s' until s is terminal 29
  • 30. KAIST AIPR Lab. Example: Cliffwalking • ε-greedy action selection  ε=0.1 (fixed) • Sarsa  Learns longer but safer • Q-learning  Learns the optimal policy • If ε were reduced,  Both converge to the optimal policy 30
  • 31. KAIST AIPR Lab. Summary • Goal of reinforcement learning  To find an optimal policy to maximize the long-term reward • Model-based methods  Policy iteration: a sequence of improving policies and value function  Value iteration: backup operations for V * • Model-free methods   Sarsa estimates Q for the behavior policy  , change  toward greediness w.r.t. Q  Q-learning directly approximates the optimal action-value function 31
  • 32. KAIST AIPR Lab. References [1] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. Pages 51-158, 1998. [2] S. Russel and P. Norvig. Artificial Intelligence: A Modern Approach. Pages 613-784, 2003. 32
  • 33. KAIST AIPR Lab. Q&A • Thank you 33
  • 34. KAIST AIPR Lab. Appendix 1. Policy Improvement Theorem • Proof) 34
  • 35. KAIST AIPR Lab. Appendix 2. Convergence of Value Iteration • Prove 35
  • 36. KAIST AIPR Lab. Appendix 3. Target Estimation • DP • Simple TD V (st )  E  rt 1   V (st 1 ) V (st )  V (st )    rt 1   V (st 1 )  V (st ) 36