SlideShare uma empresa Scribd logo
1 de 31
Economic Hierarchical Q-Learning Erik G. Schultink, Ruggiero Cavalloand David C. Parkes Harvard University AAAI-08 July 17, 2008
Introduction Economic paradigms applied to hierarchical reinforcement learning Building on the work of: Holland Classifier system  (Holland 1986) Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998) Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality We introduce the EHQ algorithm
[object Object]
Each sub-problem solved by a different agent
Leaf nodes are primitive actions; non-leaf nodes are macroactions
State abstraction
Addresses curse-of-dimensionality, leaving smaller state space to explore
Rewards accrue only for primitive actions
Credit assignment problem: How to distribute reward in the system?Hierarchical Reinforcement Learning Root Drive to work Eat Breakfast eat donut drink coffee eat cereal stop drive forward turn right turn left
Hierarchical Reinforcement Learning Decompose an MDP, M, into a set of subtasks  { M0 , M1, … , Mn} where Mi consists of:  Ti : termination predicate partitioning Mi into active states Si and exit-states Ei   Ai: set of actions that can be performed in Mi Ri: local-reward function
Hierarchical Reinforcement Learning A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj
HOFuel Domain Grid world navigation task A={north, south, east, west, fill-up} The fill-up action is available only in the left hand room Begin with 5 units of fuel Based on concepts described by Dietterich (2000).
Hierarchy for HOFuel fill-up north east south west fill-up available only in “Leave left room” macroaction Root Leave left room Reach goal
Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality
Optimality Concepts Global Optimality Hierarchical Optimality A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a) Recursive Optimality
Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.
Optimality in HOFuel Hierarchically Optimal Recursively Optimal Root Leave left room Reach goal
Intuitive Motivation for EHQ Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door Root Leave left room Reach goal
Safe State Abstraction To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space. Principles for safe state abstraction shown in [Dietterich 2000].
Value Decomposition Different HRL algorithms use different additive decompositions for Q(s,a).  In the most general form, Q(s,a) can be decomposed into: QV(i,s,a): 	expected discounted reward to i 			upon completion of a QC(i,s,a): 	expected discounted reward to i 				after a completes, until i exits QE(i,s,a): 	expected total discounted reward  			after subtask i exits (Dietterich 2000a, Andre and Russell 2002) Local reward to subtask i Reward not seen directly by subtask i
Decentralization An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action.
Summary of Related HRL Algorithms * shown only empirically ,[object Object]
 MAXQQ – [Dietterich 2000]
ALispQ – [Andre and Russell 2002]
 HOCQ – [Marthi and Russell 2006],[object Object]
EHQ Transfer System parent child child child Children submit bids  (bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)
EHQ Transfer System parent child child child Parent passes control to “winning” child (based on exploration policy)
EHQ Transfer System 0 0 0 0 parent child child child +5 +2 -6 +3 Child executes until reaches exit-state, reward accrues to child
EHQ Transfer System +4 0 0 0 0 parent child child child -4 +5 +2 -6 +3 Child returns control and pays bid to parent
EHQ Transfer System +4 0 0 0 0 -1 parent child child child -4 +5 +2 -6 +3 +1 Parent pays child subsidy for exit-state obtained
EHQ Subsidy Policy Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves

Mais conteúdo relacionado

Último

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Último (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Economic Hierarchical Q-Learning

  • 1. Economic Hierarchical Q-Learning Erik G. Schultink, Ruggiero Cavalloand David C. Parkes Harvard University AAAI-08 July 17, 2008
  • 2. Introduction Economic paradigms applied to hierarchical reinforcement learning Building on the work of: Holland Classifier system (Holland 1986) Eric Baum’s Hayek system, with competitive, evolutionary agents that buy and sell control of the world to collectively solve the problem (Baum et al. 1998) Our thesis is that price systems can help resolve the tension between recursive optimality and hierarchical optimality We introduce the EHQ algorithm
  • 3.
  • 4. Each sub-problem solved by a different agent
  • 5. Leaf nodes are primitive actions; non-leaf nodes are macroactions
  • 7. Addresses curse-of-dimensionality, leaving smaller state space to explore
  • 8. Rewards accrue only for primitive actions
  • 9. Credit assignment problem: How to distribute reward in the system?Hierarchical Reinforcement Learning Root Drive to work Eat Breakfast eat donut drink coffee eat cereal stop drive forward turn right turn left
  • 10. Hierarchical Reinforcement Learning Decompose an MDP, M, into a set of subtasks { M0 , M1, … , Mn} where Mi consists of: Ti : termination predicate partitioning Mi into active states Si and exit-states Ei Ai: set of actions that can be performed in Mi Ri: local-reward function
  • 11. Hierarchical Reinforcement Learning A hierarchical policy πis a set of {π1, π2, … , πn}, where πi is a mapping from state s to either a primitive action a or πj
  • 12. HOFuel Domain Grid world navigation task A={north, south, east, west, fill-up} The fill-up action is available only in the left hand room Begin with 5 units of fuel Based on concepts described by Dietterich (2000).
  • 13. Hierarchy for HOFuel fill-up north east south west fill-up available only in “Leave left room” macroaction Root Leave left room Reach goal
  • 14. Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality
  • 15. Optimality Concepts Global Optimality Hierarchical Optimality A hierarchically optimal (HO) policy selects the same primitive actions as the optimal policy in every state, subject to constraints of the hierarchy. (Dietterich 2000a) Recursive Optimality
  • 16. Optimality Concepts Global Optimality Hierarchical Optimality Recursive Optimality A policy is recursively optimal (RO) if, for each subtask in the hierarchy, the policy πi is optimal given the policies for all descendents of the subtask Mi in the hierarchy.
  • 17. Optimality in HOFuel Hierarchically Optimal Recursively Optimal Root Leave left room Reach goal
  • 18. Intuitive Motivation for EHQ Transfer between agentsto incentivize “Leave left room” to choose upper door over lower door Root Leave left room Reach goal
  • 19. Safe State Abstraction To obtain hierarchical optimality, we must use state abstractions that are safe – that is, the optimal policy in the original space is also optimal in the abstract space. Principles for safe state abstraction shown in [Dietterich 2000].
  • 20. Value Decomposition Different HRL algorithms use different additive decompositions for Q(s,a). In the most general form, Q(s,a) can be decomposed into: QV(i,s,a): expected discounted reward to i upon completion of a QC(i,s,a): expected discounted reward to i after a completes, until i exits QE(i,s,a): expected total discounted reward after subtask i exits (Dietterich 2000a, Andre and Russell 2002) Local reward to subtask i Reward not seen directly by subtask i
  • 21. Decentralization An HRL algorithm is decentralized if every agent in the hierarchy needs only locally stored information to select an action.
  • 22.
  • 23. MAXQQ – [Dietterich 2000]
  • 24. ALispQ – [Andre and Russell 2002]
  • 25.
  • 26. EHQ Transfer System parent child child child Children submit bids (bid = V*(s) = expected reward they will obtain during execution, including expected exit-state subsidy)
  • 27. EHQ Transfer System parent child child child Parent passes control to “winning” child (based on exploration policy)
  • 28. EHQ Transfer System 0 0 0 0 parent child child child +5 +2 -6 +3 Child executes until reaches exit-state, reward accrues to child
  • 29. EHQ Transfer System +4 0 0 0 0 parent child child child -4 +5 +2 -6 +3 Child returns control and pays bid to parent
  • 30. EHQ Transfer System +4 0 0 0 0 -1 parent child child child -4 +5 +2 -6 +3 +1 Parent pays child subsidy for exit-state obtained
  • 31. EHQ Subsidy Policy Rather than explicitly model QE, EHQ provides subsidies to the child subtask for the quality, from the perspective of the parent, of the exit-state the child achieves
  • 32. EHQ Transfer System +4 -1 0 0 0 0 parent child child child +1 -4 +5 +2 -6 +3 During execution, both parent and child update their local Q-values based on their stream of rewards
  • 35. HOFuel Subsidy Convergence Root Leave left room Reach goal
  • 36. Taxi Domain RO = HO in this domain, which is taken from [Dietterich 2000]
  • 37.
  • 38. EHQ appears to converge, but does not clearly surpass MAXQQ
  • 39.
  • 40. References Andre, D., and Russell, S. 2002. State abstraction for programmable reinforcement learning agents. In AAAI-02. Edmonton, Alberta: AAAI Press. Baum, E. B., and Durdanovich, I. 1998. Evolution of cooperative problem-solving in an artificial economy. Journal of Artificial Intelligence Research. Dean, T., and Lin, S.-H. 1995. Decomposition techniques for planning in stochastic domains. In IJCAI-95, 1121–1127. San Francisco, CA: Morgan Kaufmann Publishers. Dietterich, T. G. 2000a. Hierarchical reinforcement learning with MAXQ value function decomposition. Journal of Artificial Intelligence Research13:227–303.
  • 41. References Dietterich, T. G. 2000b. State abstraction in MAXQ hierarchical reinforcement learning. Advances in Neural Information Processing Systems 12:994–1000. Holland, J. 1986. Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems. In Machine Learning, volume 2. San Mateo, CA: Morgan Kaufmann. Marthi, B.; Russell, S.; and Andre, D. 2006. A compact, hierarchically optimal Q-function decomposition. In UAI-06. Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10.

Notas do Editor

  1. HRL is a variation on RL where the problem is decomposed into a set of sub-problems. These sub-problems can then be solved more-or-less independently and their solutions combined to build a solution to the original problem. There are several potential advantages to this approach: first, state abstraction – in many cases, certain aspects of the original state space can be ignored in the context of a particular subproblem, allowing that sub-problem to be solved in a much smaller “abstract” state space. Second, the hierarchical structure of the decomposition lends itself to value decomposition – traditional RL Q-values can intstead be expressed as a sum of several components; the components of Q-values can often be re-used, reducing the number of values that must be learned. Additionally, the solution policy to a given sub-problem may be able to be re-used in other parts of the hierarchy.
  2. Convert to non-technical slide on HRL. Why HRL – allows state abstraction, decompose into sub-problems
  3. To help illustrate these concepts, we introduce the HOFuel domain, constructed to emphasize the distinction between the RO and HO solution policies. It is a grid-world navigation task with a fuel constraint. Running into walls is no-op with a penalty; add opti
  4. But HRL can introduce a tension for some domains; solving sub-problems without enough regard for how the solutions to individual sub-problems impact the overall solution quality can lead to solutions that are sub-optimal from the perspective of the original problem. Additionally, the structure of the hierarchy itself may artificially limit the solution quality .We thus differentiate between three concepts of optimality. The first, global optimality, is equivalent to the traditional notion of optimality in reinforcement learning.
  5. The second, Hierarchical optimality, is equivalent to global optimality except where constrained by the hierarchy.
  6. The third, recursive optimality, is defined as each subtask being solved optimally with respect to the solutions to the sub problems below it in the hierarchy. The globally optimal solution policy is always equivalent to or better than the HO solution. Similarly, the HO solution policy is always equivalent to or better than the RO solution policy.RO is easier, because the agent only has to reason about local rewards. Resolving this tension will be the focus of my work
  7. We conceptualize the hierarchy as though each sub-problem is being solved by a different agent. Dietterich (2000) noted that exit-reward payments could alter incentives in the problem to make the RO and HO solutions equivalent. We took further inspiration from the Hayek system development by the Eric Baum, which involved agents buying and selling control of the world to solve the problem. Hayek was itself based on Holland classifiers; both systems are applied to traditional RL not HRL.Hayek and market like system; Baum buying control of the world in evolutionary context; Holland in RL, not HRL work.
  8. HRL decompositions can improve learning speed by allowing extraneous state variables within a given subtask to be ignored within that subtask.
  9. EHQ follows this decomposition framework, as do several other HRL algorithms in literatio. Notably, not all model Qe explicitly (or at all).
  10. ALispQ and HOCQ provide impressive HO convergence results, however, EHQ can achieve HO using a simple and decentralized pricing mechanism.
  11. Add rewards in timesteps ….
  12. Modeling QE allows for HO convergence, but is often depends on many state variables, lessening the potential for state abstractions and slowing learning speed.In practice, we found it beneficial limit Ej to the set of reachable exit-states, as discovered empirically during learning.(briefly mention the other possible normalizations if time permits)
  13. Replace this with a high-level overview of the algorithm? (ie agent at each node in the hierarchies does a form of Q-learning to update it’s local QV and QC values. Parent models the expected reward of invoking a macroaction, implemented by a child agent, by receiving a “bid” from that agent of it’s expected reward for the given state. When the parent chooses a macroaction to invoke, control is passed to the child agent along with information about what subsidies that child will be paid for its possible exit-states. When the child reaches an exit-state, it receives the subsidy for the state it achieved. Control is returned to the parent, which receives reward equal to the child’s bid less the subsidy it paid the child.
  14. Normalizing to min reachable (briefly mention the other possible normalizations if time permits)