Slides from my oral presentation of my paper at AAAI'08 in Chicago. The paper was co-authored with Ruggiero Cavallo and David Parkes; it was based on my undergraduate thesis, which David advised.
2. Introduction Economic paradigms applied to hierarchical reinforcement learning Building on the work of: Holland Classifier system (Holland 1986) Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998) Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality We introduce the EHQ algorithm
9. Credit assignment problem: How to distribute reward in the system?Hierarchical Reinforcement Learning Root Drive to work Eat Breakfast eat donut drink coffee eat cereal stop drive forward turn right turn left
10. Hierarchical Reinforcement Learning Decompose an MDP, M, into a set of subtasks { M0 , M1, … , Mn} where Mi consists of: Ti : termination predicate partitioning Mi into active states Si and exit-states Ei Ai: set of actions that can be performed in Mi Ri: local-reward function
11. Hierarchical Reinforcement Learning A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj
12. HOFuel Domain Grid world navigation task A={north, south, east, west, fill-up} The fill-up action is available only in the left hand room Begin with 5 units of fuel Based on concepts described by Dietterich (2000).
13. Hierarchy for HOFuel fill-up north east south west fill-up available only in “Leave left room” macroaction Root Leave left room Reach goal
15. Optimality Concepts Global Optimality Hierarchical Optimality A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a) Recursive Optimality
16. Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.
17. Optimality in HOFuel Hierarchically Optimal Recursively Optimal Root Leave left room Reach goal
18. Intuitive Motivation for EHQ Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door Root Leave left room Reach goal
19. Safe State Abstraction To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space. Principles for safe state abstraction shown in [Dietterich 2000].
20. Value Decomposition Different HRL algorithms use different additive decompositions for Q(s,a). In the most general form, Q(s,a) can be decomposed into: QV(i,s,a): expected discounted reward to i upon completion of a QC(i,s,a): expected discounted reward to i after a completes, until i exits QE(i,s,a): expected total discounted reward after subtask i exits (Dietterich 2000a, Andre and Russell 2002) Local reward to subtask i Reward not seen directly by subtask i
21. Decentralization An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action.
26. EHQ Transfer System parent child child child Children submit bids (bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)
27. EHQ Transfer System parent child child child Parent passes control to “winning” child (based on exploration policy)
28. EHQ Transfer System 0 0 0 0 parent child child child +5 +2 -6 +3 Child executes until reaches exit-state, reward accrues to child
29. EHQ Transfer System +4 0 0 0 0 parent child child child -4 +5 +2 -6 +3 Child returns control and pays bid to parent
30. EHQ Transfer System +4 0 0 0 0 -1 parent child child child -4 +5 +2 -6 +3 +1 Parent pays child subsidy for exit-state obtained
31. EHQ Subsidy Policy Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves
32. EHQ Transfer System +4 -1 0 0 0 0 parent child child child +1 -4 +5 +2 -6 +3 During execution, both parent and child update their local Q-values based on their stream of rewards
36. Taxi Domain RO = HO in this domain, which is taken from [Dietterich 2000]
37.
38. EHQ appears to converge, but does not clearly surpass MAXQQ
39.
40. References Andre, D., and Russell, S. 2002. State abstraction for programmable reinforcement learning agents. In AAAI-02. Edmonton, Alberta: AAAI Press. Baum, E. B., and Durdanovich, I. 1998. Evolution of cooperative problem-solving in an artificial economy. Journal of Artificial Intelligence Research. Dean, T., and Lin, S.-H. 1995. Decomposition techniques for planning in stochastic domains. In IJCAI-95, 1121–1127. San Francisco, CA: Morgan Kaufmann Publishers. Dietterich, T. G. 2000a. Hierarchical reinforcement learning with MAXQ value function decomposition. Journal of Artificial Intelligence Research13:227–303.
41. References Dietterich, T. G. 2000b. State abstraction in MAXQ hierarchical reinforcement learning. Advances in Neural Information Processing Systems 12:994–1000. Holland, J. 1986. Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems. In Machine Learning, volume 2. San Mateo, CA: Morgan Kaufmann. Marthi, B.; Russell, S.; and Andre, D. 2006. A compact, hierarchically optimal Q-function decomposition. In UAI-06. Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10.
Notas do Editor
HRL is a variation on RL where the problem is decomposed into a set of sub-problems. These sub-problems can then be solved more-or-less independently and their solutions combined to build a solution to the original problem. There are several potential advantages to this approach: first, state abstraction – in many cases, certain aspects of the original state space can be ignored in the context of a particular subproblem, allowing that sub-problem to be solved in a much smaller “abstract” state space. Second, the hierarchical structure of the decomposition lends itself to value decomposition – traditional RL Q-values can intstead be expressed as a sum of several components; the components of Q-values can often be re-used, reducing the number of values that must be learned. Additionally, the solution policy to a given sub-problem may be able to be re-used in other parts of the hierarchy.
Convert to non-technical slide on HRL. Why HRL – allows state abstraction, decompose into sub-problems
To help illustrate these concepts, we introduce the HOFuel domain, constructed to emphasize the distinction between the RO and HO solution policies. It is a grid-world navigation task with a fuel constraint. Running into walls is no-op with a penalty; add opti
But HRL can introduce a tension for some domains; solving sub-problems without enough regard for how the solutions to individual sub-problems impact the overall solution quality can lead to solutions that are sub-optimal from the perspective of the original problem. Additionally, the structure of the hierarchy itself may artificially limit the solution quality .We thus differentiate between three concepts of optimality. The first, global optimality, is equivalent to the traditional notion of optimality in reinforcement learning.
The second, Hierarchical optimality, is equivalent to global optimality except where constrained by the hierarchy.
The third, recursive optimality, is defined as each subtask being solved optimally with respect to the solutions to the sub problems below it in the hierarchy. The globally optimal solution policy is always equivalent to or better than the HO solution. Similarly, the HO solution policy is always equivalent to or better than the RO solution policy.RO is easier, because the agent only has to reason about local rewards. Resolving this tension will be the focus of my work
We conceptualize the hierarchy as though each sub-problem is being solved by a different agent. Dietterich (2000) noted that exit-reward payments could alter incentives in the problem to make the RO and HO solutions equivalent. We took further inspiration from the Hayek system development by the Eric Baum, which involved agents buying and selling control of the world to solve the problem. Hayek was itself based on Holland classifiers; both systems are applied to traditional RL not HRL.Hayek and market like system; Baum buying control of the world in evolutionary context; Holland in RL, not HRL work.
HRL decompositions can improve learning speed by allowing extraneous state variables within a given subtask to be ignored within that subtask.
EHQ follows this decomposition framework, as do several other HRL algorithms in literatio. Notably, not all model Qe explicitly (or at all).
ALispQ and HOCQ provide impressive HO convergence results, however, EHQ can achieve HO using a simple and decentralized pricing mechanism.
Add rewards in timesteps ….
Modeling QE allows for HO convergence, but is often depends on many state variables, lessening the potential for state abstractions and slowing learning speed.In practice, we found it beneficial limit Ej to the set of reachable exit-states, as discovered empirically during learning.(briefly mention the other possible normalizations if time permits)
Replace this with a high-level overview of the algorithm? (ie agent at each node in the hierarchies does a form of Q-learning to update it’s local QV and QC values. Parent models the expected reward of invoking a macroaction, implemented by a child agent, by receiving a “bid” from that agent of it’s expected reward for the given state. When the parent chooses a macroaction to invoke, control is passed to the child agent along with information about what subsidies that child will be paid for its possible exit-states. When the child reaches an exit-state, it receives the subsidy for the state it achieved. Control is returned to the parent, which receives reward equal to the child’s bid less the subsidy it paid the child.
Normalizing to min reachable (briefly mention the other possible normalizations if time permits)