lecture_21.pptx - PowerPoint Presentation

2. Resources:RSAB/TO: The RL ProblemAM: RL TutorialRKVM: RL SIMTM: Intro to RLGT: Active SpeechAudio: URL:

4. The original slides have been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides).

5. A slightly more advanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials.

6. The objectives of this lecture are:

7. describe the RL problem;

8. present idealized form of the RL problem for which we have precise theoretical results;

9. introduce key components of the mathematics: value functions and Bellman equations;

12. Policy at state t, t, is a mapping from states to action probabilities.

13. t (s,a)= probability that at = a when st = s.

14. Reinforcement learning methods specify how the agent changes its policy as a result of experience.

15. Roughly, the agent’s goal is to get as much reward as it can over the long run.

16. Learning can occur in several ways:

17. Adaptation of classifier parameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”).

18. Selection of the most appropriate next training pattern during classifier training (e.g., active learning).

22. In general, we want to maximize the expected return, E[Rt], for each step t, where Rt = rt+1 + rt+2 + … + rT, where T is a final time step at which a terminal state is reached, ending an episode. (You can view this as a variant of the forward backward calculation in HMMs.)

23. Here episodic tasks denote a complete transaction (e.g., a play of a game, a trip through a maze, a phone call to a support line).

24. Some tasks do not have a natural episode and can be considered continuing tasks. For these tasks, we can define the return as: where  is the discounting rate and 0    1.  close to zero favors short-term returns (shortsighted) while  close to 1 favors long-term returns.  can also be thought of as a “forgetting factor” in that, since it is less than one, it weights near-term future actions more heavily than longer-term future actions.

26. Return:

27. Reward = -1 for each step taken when you are not at the top of the hill.

28. Return = -(number of steps)

29. Return is maximized by minimizing the number of step to reach the top of the hill.

30. Other distinctions include deterministic versus dynamic: the context for a task can change as a function of time (e.g., an airline reservation system).

34. wait for someone to bring it a can, or

35. go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected.

36. A Recycling Robot MDP

37. Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of taking an action in a state under policy p is the expected return starting from that state, taking that action, and thereafter followingp:

39. So:

40. Or, without the expectation operator:

43. V* is the unique solution of this system of nonlinear equations.

44. The optimal action is again found through the maximization process:

46. we have enough space an time to do the computation;

48. BUT, number of states is often huge (e.g., backgammon has about 10**20 states).We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

49. Q-Learning* Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The core of the algorithm is a simple value iteration update. For each state, s, from the state set S, and for each action, a, from the action set A, we can calculate an update to its expected discounted reward with the following expression: where rt is an observed real reward at timet, αt(s,a) are the learning rates such that 0 ≤ αt(s,a) ≤ 1, and γ is the discount factor such that 0 ≤ γ < 1. This can be thought of as incrementally maximizing the next step (one look ahead). May not produce the globally optimal solution. * From Wikipedia (http://en.wikipedia.org/wiki/Q-learning)

51. Policy: stochastic rule for selecting actions

52. Return: the function of future rewards agent tries to maximize

53. Episodic and continuing tasks

54. Markov Property

55. Markov Decision Process

56. Transition probabilities

57. Value functions

58. State-value function for a policy

59. Action-value function for a policy

60. Optimal state-value function

61. Optimal action-value function

62. Optimal value functions

63. Optimal policies

64. Bellman Equations

65. The need for approximation

66. Other forms of learning such as Q-learning.

lecture_21.pptx - PowerPoint Presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a lecture_21.pptx - PowerPoint Presentation

Semelhante a lecture_21.pptx - PowerPoint Presentation (20)

Mais de butest

Mais de butest (20)

lecture_21.pptx - PowerPoint Presentation

Notas do Editor