SlideShare uma empresa Scribd logo
1 de 19
LECTURE 21: REINFORCEMENT LEARNING ,[object Object]
Resources:RSAB/TO: The RL ProblemAM: RL TutorialRKVM: RL SIMTM: Intro to RLGT: Active SpeechAudio: URL:
Attributions ,[object Object]
The original slides have been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides).
A slightly more advanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials.
The objectives of this lecture are:
describe the RL problem;
present idealized form of the RL problem for which we have precise theoretical results;
introduce key components of the mathematics: value functions and Bellman equations;
describe trade-offs between applicability and mathematical tractability.,[object Object]
The Agent Learns A Policy ,[object Object]
Policy at state t, t, is a mapping from states to action probabilities.
t (s,a)= probability that at = a when st = s.
Reinforcement learning methods specify how the agent changes its policy as a result of experience.
Roughly, the agent’s goal is to get as much reward as it can over the long run.
Learning can occur in several ways:
Adaptation of classifier parameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”).
Selection of the most appropriate next training pattern during classifier training (e.g., active learning).
Common algorithm design issues include rate of convergence, bias vs. variance, adaptation speed, and batch vs. incremental adaptation. ,[object Object]

Mais conteúdo relacionado

Mais procurados

Pages from fin econometrics brandt_1
Pages from fin econometrics brandt_1Pages from fin econometrics brandt_1
Pages from fin econometrics brandt_1
NBER
 
Fin econometricslecture
Fin econometricslectureFin econometricslecture
Fin econometricslecture
NBER
 
101 Tips for a Successful Automation Career Appendix F
101 Tips for a Successful Automation Career Appendix F101 Tips for a Successful Automation Career Appendix F
101 Tips for a Successful Automation Career Appendix F
ISA Interchange
 
Skiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksortSkiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksort
zukun
 
DirichletProcessNotes
DirichletProcessNotesDirichletProcessNotes
DirichletProcessNotes
Angie Shen
 

Mais procurados (20)

Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Intro rl
Intro rlIntro rl
Intro rl
 
Dp
DpDp
Dp
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
Affine cascade models for term structure dynamics of sovereign yield curves
Affine cascade models for term structure dynamics of sovereign yield curvesAffine cascade models for term structure dynamics of sovereign yield curves
Affine cascade models for term structure dynamics of sovereign yield curves
 
2.time domain analysis of lti systems
2.time domain analysis of lti systems2.time domain analysis of lti systems
2.time domain analysis of lti systems
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Alternative Approach for Computing the Activation Factor of the PNLMS Algorithm
Alternative Approach for Computing the Activation Factor of the PNLMS AlgorithmAlternative Approach for Computing the Activation Factor of the PNLMS Algorithm
Alternative Approach for Computing the Activation Factor of the PNLMS Algorithm
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order Execution
 
Pages from fin econometrics brandt_1
Pages from fin econometrics brandt_1Pages from fin econometrics brandt_1
Pages from fin econometrics brandt_1
 
Fin econometricslecture
Fin econometricslectureFin econometricslecture
Fin econometricslecture
 
101 Tips for a Successful Automation Career Appendix F
101 Tips for a Successful Automation Career Appendix F101 Tips for a Successful Automation Career Appendix F
101 Tips for a Successful Automation Career Appendix F
 
How Unstable is an Unstable System
How Unstable is an Unstable SystemHow Unstable is an Unstable System
How Unstable is an Unstable System
 
Skiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksortSkiena algorithm 2007 lecture08 quicksort
Skiena algorithm 2007 lecture08 quicksort
 
Digital Signal Processing
Digital Signal ProcessingDigital Signal Processing
Digital Signal Processing
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
 
Signal
SignalSignal
Signal
 
DirichletProcessNotes
DirichletProcessNotesDirichletProcessNotes
DirichletProcessNotes
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 

Destaque (6)

MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Lab 10.doc
Lab 10.docLab 10.doc
Lab 10.doc
 
Fryderyk Zosi
Fryderyk ZosiFryderyk Zosi
Fryderyk Zosi
 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THE
 
Recognizing and Organizing Opinions Expressed in the World ...
Recognizing and Organizing Opinions Expressed in the World ...Recognizing and Organizing Opinions Expressed in the World ...
Recognizing and Organizing Opinions Expressed in the World ...
 
5k ser joven es...
5k ser joven es...5k ser joven es...
5k ser joven es...
 

Semelhante a lecture_21.pptx - PowerPoint Presentation

Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 
Lecture notes
Lecture notesLecture notes
Lecture notes
butest
 

Semelhante a lecture_21.pptx - PowerPoint Presentation (20)

How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
the bellman equation
 the bellman equation the bellman equation
the bellman equation
 
(ppt
(ppt(ppt
(ppt
 
RL intro
RL introRL intro
RL intro
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
Planning in Markov Stochastic Task Domains
Planning in Markov Stochastic Task DomainsPlanning in Markov Stochastic Task Domains
Planning in Markov Stochastic Task Domains
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
Download
DownloadDownload
Download
butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 

lecture_21.pptx - PowerPoint Presentation

  • 1.
  • 2. Resources:RSAB/TO: The RL ProblemAM: RL TutorialRKVM: RL SIMTM: Intro to RLGT: Active SpeechAudio: URL:
  • 3.
  • 4. The original slides have been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides).
  • 5. A slightly more advanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials.
  • 6. The objectives of this lecture are:
  • 7. describe the RL problem;
  • 8. present idealized form of the RL problem for which we have precise theoretical results;
  • 9. introduce key components of the mathematics: value functions and Bellman equations;
  • 10.
  • 11.
  • 12. Policy at state t, t, is a mapping from states to action probabilities.
  • 13. t (s,a)= probability that at = a when st = s.
  • 14. Reinforcement learning methods specify how the agent changes its policy as a result of experience.
  • 15. Roughly, the agent’s goal is to get as much reward as it can over the long run.
  • 16. Learning can occur in several ways:
  • 17. Adaptation of classifier parameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”).
  • 18. Selection of the most appropriate next training pattern during classifier training (e.g., active learning).
  • 19.
  • 20.
  • 21.
  • 22. In general, we want to maximize the expected return, E[Rt], for each step t, where Rt = rt+1 + rt+2 + … + rT, where T is a final time step at which a terminal state is reached, ending an episode. (You can view this as a variant of the forward backward calculation in HMMs.)
  • 23. Here episodic tasks denote a complete transaction (e.g., a play of a game, a trip through a maze, a phone call to a support line).
  • 24. Some tasks do not have a natural episode and can be considered continuing tasks. For these tasks, we can define the return as: where  is the discounting rate and 0    1.  close to zero favors short-term returns (shortsighted) while  close to 1 favors long-term returns.  can also be thought of as a “forgetting factor” in that, since it is less than one, it weights near-term future actions more heavily than longer-term future actions.
  • 25.
  • 27. Reward = -1 for each step taken when you are not at the top of the hill.
  • 28. Return = -(number of steps)
  • 29. Return is maximized by minimizing the number of step to reach the top of the hill.
  • 30. Other distinctions include deterministic versus dynamic: the context for a task can change as a function of time (e.g., an airline reservation system).
  • 31.
  • 32.
  • 33.
  • 34. wait for someone to bring it a can, or
  • 35. go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected.
  • 37. Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of taking an action in a state under policy p is the expected return starting from that state, taking that action, and thereafter followingp:
  • 38.
  • 39. So:
  • 40. Or, without the expectation operator:
  • 41.
  • 42.
  • 43. V* is the unique solution of this system of nonlinear equations.
  • 44. The optimal action is again found through the maximization process:
  • 45.
  • 46. we have enough space an time to do the computation;
  • 47.
  • 48. BUT, number of states is often huge (e.g., backgammon has about 10**20 states).We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation.
  • 49. Q-Learning* Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The core of the algorithm is a simple value iteration update. For each state, s, from the state set S, and for each action, a, from the action set A, we can calculate an update to its expected discounted reward with the following expression: where rt is an observed real reward at timet, αt(s,a) are the learning rates such that 0 ≤ αt(s,a) ≤ 1, and γ is the discount factor such that 0 ≤ γ < 1. This can be thought of as incrementally maximizing the next step (one look ahead). May not produce the globally optimal solution. * From Wikipedia (http://en.wikipedia.org/wiki/Q-learning)
  • 50.
  • 51. Policy: stochastic rule for selecting actions
  • 52. Return: the function of future rewards agent tries to maximize
  • 65. The need for approximation
  • 66. Other forms of learning such as Q-learning.

Notas do Editor

  1. MS Equation 3.0 was used with settings of: 18, 12, 8, 18, 12.