1) Subgoal learning, macro-actions, partial observation, and clustering of features were explored for difficult reinforcement learning settings.
2) MCTS was adapted for these settings by incorporating simulations, action categorization, macro-action building, feature clustering, and a tree of subgoals with nodes containing goals.
3) This approach combines techniques from the state of the art and was tested on problems from other projects, showing successful application to complex real-world problems.
10 more lessons learned from building Machine Learning systems - MLConf
reinforcement learning for difficult settings
1. SUBGOAL LEARNING,
MACRO-ACTIONS,
PARTIAL OBSERVATION,
CLUSTERING OF FEATURES,
and other stuff for difficult reinforcement learning
settings.
One slide out of topic, sorry :-)
Yesterday I was particularly interested in the discussion around
deep networks and convolution networks and Yann LeCun et al for
computer vision, thanks :-)
Seemingly computational power is a big part
of computer vision, right ?
2. MASH WP6 – Goal Planning
Controlling a 3D avatar
or a robot arm:
- without expert help
- without model
- without parallel runs
- without knowing the target
- with expensive runs
- using existing human expertise if any, in a
way compliant with crowd-sourcing (human does
not know the platform)
3. Category of problems
● MDP solving: you have access to the model
● Generative models:
– Cases in which you can “undo”
– Cases in which you can not
The hardest reinforcement
learning setting you can
find, with expensive sims
4. Goals of the project
– Adapting MCTS for such problems
– Parallel model-free MCTS
– Facilitating and Testing Crowd-Sourcing
– Other methods for such problems
5. Outline
● What we have done
– Extension to partially observable expensive “very” model-
free problems
– Experiments on other WPs testbeds
– Experiments on our testbeds
● Parallelization
● Conclusions
● Perspectives
6. MCTS / UCT
● MCTS = UCT (nearly)
● Very good for high-dimensional problem with
little expertise
● Requires many simulations
● Principle:
● Do simulations (plenty of)
● Adaptive decisions: first simulations with stupid
strategies, and online improve the simulated
strategy.
7. Change #1: Macro-Actions (MA)
● With low-level decisions, actions should often
be repeated for being meaningfull
● Example with left-right:
RRRRRR makes sense
LLLLLLLL makes sense
RLLLRRL makes no sense
● Automatically categorize actions: eventually
stationary, opposite, cyclic; define MA
==> state of the art + automatization
8. Change #2: Clustering Features
● Many state variables are very similar
● Clustering:
– Performsimulations
– Groups of correlated features
● Strongly reduce the state space
dimension
9. Change #3: memory
● Partially Observable problems require
memory
● Tree of subgoals:
10. Change #3: memory
● Partially Observable problems require
memory
● Tree of subgoals:
I choose
this action
11. Change #3: memory
● Partially Observable problems require
memory
● Tree of subgoals: Each
node
contains
a goal,
i.e.
features
to be
activated
12. Change #3: memory
● Partially Observable problems require
memory
● Tree of subgoals: Each
decisions node
contains
made by a goal,
i.e.
“voting”: MA features
to be
correlated activated
with
expected transitions
13. Change #3: memory
● Partially Observable problems require
memory
● MCTS
Tree of subgoals: is an Each
decisions node
contains
extremely
made by natural tool for a goal,
i.e.
“voting”: MA features
to be
correlated
building subgoals activated
with
expected transitions
14. Summary CluVo + GMCTS:
all in one slide
1) Simulations, categorization of actions
2) Building of macro-actions
3) Clustering of features
4) MCTS by simulations, correlations, voting:
1) Node creation as in MCTS, but node=subgoal
2) Simulations biased by rewards as in MCTS
3) Goals → votes → MA → decisions
Vote: actions which statistically activate
the goal features, in the current state, are preferred
15. Other developments
● Q-learning
● Fitted Q-iteration
● Direct Policy Search
Main issue: representation (macro-actions,
clusters of features).
==> Direct Policy Search
● also uses MA
● but needs a memory
==> GMCTS quite convenient / focusing
simulations.
16. Results of Cluvo+GMCTS
on other WPs testcases
● Blue flag then red flag ok
● Looks easy, but in a fully agnostic framework
and thousands of variables it is not that easy.
● Same algorithm performed correctly on “catch
as many flags as possible”.
● Combines many things of the state of the art:
– Macro-actions
– Subgoal learning
– Clustering of features
– MCTS / UCT
17. All you can eat: DPS could do it,
with MA (no memory needed)
18. Blue Flag then Red Flag:
Clustering ok for 3 out of 12 runs
8h learning Generalization
19. Test on testcases from
other WPs
● This has taken most of the manpower, easy
problems but with very difficult setting
● No crowd-sourcing
● We have other testbeds with external
developpers (same platform)
20. Results on the
game of Go (~8 contributors)
● automatic modifications of the bandit (moderate
success, far less efficient than supervised learning
of databases or expert handcrafting)
==> Maths for crowd-sourcing:
– Automated regression testing by MSHT (good for crowd-
sourcing)
– Constraints on the way human enter expertise, for
preserving consistency
● automatic precomputing of moves (opening books)
==> both are quite parallel, but very expensive for
moderate progress
21. Results on Urban Rivals
(18 millions of players, ~15
developpers on the core)
● Also partial observability, but information fully
revealed frequently
● Easy to simulate
● MCTS was great for this application:
– No human expertise needed
– Consistent independently of human
expertise
● Solved the problem, whereas many
engineers failed
23. Results on Energy management
(~7/8 developers, high turn over)
● Existing solutions often very poor for short
term volatility and high-dimension of the state
space
● Simulation-based approaches: rigorous use
of cross-validation, detailed non-simplified
simulations
● MCTS + DPS (for choosing the default
policy): stable and efficient
25. Conclusions (1)
● MCTS adapted to partially observable
expensive agnostic settings
– MCTS + all existing tricks from the state of the art
– Integration into MCTS probably more natural than in
many algorithms (in particular subgoal learning)
– big implementation and experimentation work; more
publications to come
● An unexpected positive result:
● Merge between two simulation-based tools, DPS
(long term effects) and MCTS (short term effects)
● Quite natural, highly parallel
● Virtually no model bias
● Really efficient in stochastic setting (not adversarial)
26. Conclusions (2)
● We tested on problems with external
developpers:
● Simu-based optimization make precise models
possible
● Interface with humans: Automatic non-regression
testing / constraints for consistency / interface for
using human knowledge ok
● WP problems did not motivate alternate developers,
but principles could be tested on other testcases
● No real crowd-sourcing, but some moderate
teams of motivated developpers ==> easier
● Principles developped for the platform are re-
used for an industrial platform
27. Perspectives
● The GMCTS / CLUVO program is stable and
able to work on very hard settings by combining
many state-of-the-art techniques, can have a
long life; no crowd-sourcing
● The application of simulation based methods is
efficient, compliant with non-linear stochastic
dynamics and parallel ==> validated for
industrialization in energy management
● MCTS variants for partially observable settings
are ok also far from the hard Mash setting
28. Publication
● Main MASH publication: JBHoock 's paper
2012:
● Categorizing actions for automatically designing
macro-actions
● Clustering of features
● GMCTS for building subgoals
● Many ideas in it, it's a big part of his ph.D. in
one article.
29. Publications
● Undecidability of adversarial planning with
unbounded horizon Planning with:
● DPS: Convergence rates -of robust uncertainties
adversarial
- finite state space
optimization / noisy optimization
- no observation
● - deterministic problem
Parallel MCTS / nested MCTS
● MCTS in continuous settings
Optimal average reward:
- undecidable
● MCTS for PO setting (real-world: Urban
- unapproximable Rivals)
● Model-free MCTS
● Hybridization MCTS/DPS
● Simulation-based optimization in power
systems
30. Publications
● Undecidability of adversarial planning with
unbounded horizon
Planning with:
● DPS: Convergence rates -of robust uncertainties
adversarial
optimization / noisy optimization
- finite state space
- partially observation
● Parallel MCTS / nested MCTS
- stochastic problem
● MCTS in continuous settings all strategies,
- for
stops almost surely
● MCTS for PO setting (real-world: Urban Rivals)
Optimal average reward
● Model-free MCTS is decidable
● Hybridization MCTS/DPS
● Simulation-based optimization in power
systems
31. Publications
● Undecidability of adversarial planning with
unbounded horizon
● DPS: Convergence rates of robust
optimization / noisy optimization
● Parallel MCTS / nested MCTS
● MCTS in continuous settings
Optimal rates in the parallel case
● MCTS for POfor robust(real-world: Urban Rivals)
setting optimization w.r.t.
● Model-free MCTSmonotonous compositions
==> essentially, bounds and
● Hybridization MCTS/DPS evolutionary
patches for
computation
● Simulation-based optimization in power
systems
32. Publications
● Undecidability of adversarial planning with
unbounded horizon
● DPS: Convergence rates of robust
optimization / noisy optimization
● Parallel MCTS / nested MCTS
● MCTS in continuous settings
Optimal rates for noisy quadratic
● MCTS for PO setting optimization with
black-box (real-world: Urban Rivals)
● Model-free MCTS linear in the regret
variance
● Hybridization MCTS/DPS
● Simulation-based optimization in power
systems
33. Publications
● Undecidability of adversarial planning with
unbounded horizon
● DPS: Convergence rates of robust
optimization / noisy optimization
● Parallel MCTS / nested MCTS
● MCTS in continuous settings
● MCTS for PO setting (real-world: Urban Rivals)
● Model-free MCTS Consistency proof in the
● continuous case
Hybridization MCTS/DPS
● Simulation-based optimization in power
systems
34. Publications
● Undecidability of adversarial planning with
unbounded horizon
● DPS: Convergence rates of robust
optimization / noisy optimization
● Parallel MCTS / nested MCTS
● MCTS in continuous settings
● MCTS for PO setting (real-world: Urban Rivals)
● Model-free MCTS
● Hybridization MCTS/DPS
● Simulation-based optimization in power
systems