2. Resources:RSAB/TO: The RL ProblemAM: RL TutorialRKVM: RL SIMTM: Intro to RLGT: Active SpeechAudio: URL:
3.
4. The original slides have been incorporated into many machine learning courses, including Tim Oates’ Introduction of Machine Learning, which contains links to several good lectures on various topics in machine learning (and is where I first found these slides).
5. A slightly more advanced version of the same material is available as part of Andrew Moore’s excellent set of statistical data mining tutorials.
17. Adaptation of classifier parameters based on prior and current data (e.g., many help systems now ask you “was this answer helpful to you”).
18. Selection of the most appropriate next training pattern during classifier training (e.g., active learning).
19.
20.
21.
22. In general, we want to maximize the expected return, E[Rt], for each step t, where Rt = rt+1 + rt+2 + … + rT, where T is a final time step at which a terminal state is reached, ending an episode. (You can view this as a variant of the forward backward calculation in HMMs.)
23. Here episodic tasks denote a complete transaction (e.g., a play of a game, a trip through a maze, a phone call to a support line).
24. Some tasks do not have a natural episode and can be considered continuing tasks. For these tasks, we can define the return as: where is the discounting rate and 0 1. close to zero favors short-term returns (shortsighted) while close to 1 favors long-term returns. can also be thought of as a “forgetting factor” in that, since it is less than one, it weights near-term future actions more heavily than longer-term future actions.
29. Return is maximized by minimizing the number of step to reach the top of the hill.
30. Other distinctions include deterministic versus dynamic: the context for a task can change as a function of time (e.g., an airline reservation system).
35. go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, it has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected.
37. Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: The value of taking an action in a state under policy p is the expected return starting from that state, taking that action, and thereafter followingp:
48. BUT, number of states is often huge (e.g., backgammon has about 10**20 states).We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation.
49. Q-Learning* Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The core of the algorithm is a simple value iteration update. For each state, s, from the state set S, and for each action, a, from the action set A, we can calculate an update to its expected discounted reward with the following expression: where rt is an observed real reward at timet, αt(s,a) are the learning rates such that 0 ≤ αt(s,a) ≤ 1, and γ is the discount factor such that 0 ≤ γ < 1. This can be thought of as incrementally maximizing the next step (one look ahead). May not produce the globally optimal solution. * From Wikipedia (http://en.wikipedia.org/wiki/Q-learning)