reinforcement learning for difficult settings

SUBGOAL LEARNING,
MACRO-ACTIONS,
PARTIAL OBSERVATION,
CLUSTERING OF FEATURES,
and other stuff for difficult reinforcement learning
settings.

One slide out of topic, sorry :-)
Yesterday I was particularly interested in the discussion around
deep networks and convolution networks and Yann LeCun et al for
computer vision, thanks :-)

Seemingly computational power is a big part
of computer vision, right ?

MASH WP6 – Goal Planning
Controlling a 3D avatar
or a robot arm:

- without expert help
- without model
- without parallel runs
- without knowing the target
- with expensive runs
- using existing human expertise if any, in a
way compliant with crowd-sourcing (human does
not know the platform)

Category of problems
● MDP solving: you have access to the model
● Generative models:
– Cases in which you can “undo”
– Cases in which you can not

The hardest reinforcement
learning setting you can
find, with expensive sims

Goals of the project

– Adapting MCTS for such problems

– Parallel model-free MCTS

– Facilitating and Testing Crowd-Sourcing

– Other methods for such problems

Outline
● What we have done
– Extension to partially observable expensive “very” model-
free problems
– Experiments on other WPs testbeds
– Experiments on our testbeds
● Parallelization
● Conclusions
● Perspectives

MCTS / UCT
● MCTS = UCT (nearly)
● Very good for high-dimensional problem with
little expertise
● Requires many simulations
● Principle:
● Do simulations (plenty of)
● Adaptive decisions: first simulations with stupid
strategies, and online improve the simulated
strategy.

Change #1: Macro-Actions (MA)
● With low-level decisions, actions should often
be repeated for being meaningfull
● Example with left-right:
RRRRRR makes sense
LLLLLLLL makes sense
RLLLRRL makes no sense
● Automatically categorize actions: eventually
stationary, opposite, cyclic; define MA
==> state of the art + automatization

Change #2: Clustering Features
● Many state variables are very similar
● Clustering:
– Performsimulations
– Groups of correlated features

● Strongly reduce the state space
dimension

Change #3: memory
● Partially Observable problems require
memory
● Tree of subgoals:

Change #3: memory
memory
● Tree of subgoals:
I choose
this action

Change #3: memory
memory
● Tree of subgoals: Each
node
contains
a goal,
i.e.
features
to be
activated

Change #3: memory
memory
● Tree of subgoals: Each
decisions node
contains
made by a goal,
i.e.
“voting”: MA features
to be
correlated activated
with
expected transitions

Change #3: memory
memory
● MCTS
Tree of subgoals: is an Each
decisions node
contains
extremely
made by natural tool for a goal,
i.e.
“voting”: MA features
to be
correlated
building subgoals activated
with
expected transitions

Summary CluVo + GMCTS:
all in one slide
1) Simulations, categorization of actions
2) Building of macro-actions
3) Clustering of features
4) MCTS by simulations, correlations, voting:
1) Node creation as in MCTS, but node=subgoal
2) Simulations biased by rewards as in MCTS
3) Goals → votes → MA → decisions

Vote: actions which statistically activate
the goal features, in the current state, are preferred

Other developments
● Q-learning
● Fitted Q-iteration
● Direct Policy Search

Main issue: representation (macro-actions,
clusters of features).
==> Direct Policy Search
● also uses MA
● but needs a memory
==> GMCTS quite convenient / focusing
simulations.

Results of Cluvo+GMCTS
on other WPs testcases
● Blue flag then red flag ok
● Looks easy, but in a fully agnostic framework
and thousands of variables it is not that easy.
● Same algorithm performed correctly on “catch
as many flags as possible”.
● Combines many things of the state of the art:
– Macro-actions
– Subgoal learning
– Clustering of features
– MCTS / UCT

All you can eat: DPS could do it,
with MA (no memory needed)

Blue Flag then Red Flag:
Clustering ok for 3 out of 12 runs

8h learning Generalization

Test on testcases from
other WPs

● This has taken most of the manpower, easy
problems but with very difficult setting
● No crowd-sourcing
● We have other testbeds with external
developpers (same platform)

Results on the
game of Go (~8 contributors)
● automatic modifications of the bandit (moderate
success, far less efficient than supervised learning
of databases or expert handcrafting)
==> Maths for crowd-sourcing:
– Automated regression testing by MSHT (good for crowd-
sourcing)
– Constraints on the way human enter expertise, for
preserving consistency
● automatic precomputing of moves (opening books)
==> both are quite parallel, but very expensive for
moderate progress

Results on Urban Rivals
(18 millions of players, ~15
developpers on the core)
● Also partial observability, but information fully
revealed frequently
● Easy to simulate
● MCTS was great for this application:
– No human expertise needed
– Consistent independently of human
expertise
● Solved the problem, whereas many
engineers failed

MineSweeper – building a code on
top of an existing heuristic
(1 existing code...)

Results on Energy management
(~7/8 developers, high turn over)
● Existing solutions often very poor for short
term volatility and high-dimension of the state
space
● Simulation-based approaches: rigorous use
of cross-validation, detailed non-simplified
simulations
● MCTS + DPS (for choosing the default
policy): stable and efficient

Conclusions (1)
● MCTS adapted to partially observable
expensive agnostic settings
– MCTS + all existing tricks from the state of the art
– Integration into MCTS probably more natural than in
many algorithms (in particular subgoal learning)
– big implementation and experimentation work; more
publications to come
● An unexpected positive result:
● Merge between two simulation-based tools, DPS
(long term effects) and MCTS (short term effects)
● Quite natural, highly parallel
● Virtually no model bias
● Really efficient in stochastic setting (not adversarial)

Conclusions (2)
● We tested on problems with external
developpers:
● Simu-based optimization make precise models
possible
● Interface with humans: Automatic non-regression
testing / constraints for consistency / interface for
using human knowledge ok
● WP problems did not motivate alternate developers,
but principles could be tested on other testcases
● No real crowd-sourcing, but some moderate
teams of motivated developpers ==> easier
● Principles developped for the platform are re-
used for an industrial platform

Perspectives
● The GMCTS / CLUVO program is stable and
able to work on very hard settings by combining
many state-of-the-art techniques, can have a
long life; no crowd-sourcing
● The application of simulation based methods is
efficient, compliant with non-linear stochastic
dynamics and parallel ==> validated for
industrialization in energy management
● MCTS variants for partially observable settings
are ok also far from the hard Mash setting

Publication
● Main MASH publication: JBHoock 's paper
2012:
● Categorizing actions for automatically designing
macro-actions
● Clustering of features
● GMCTS for building subgoals
● Many ideas in it, it's a big part of his ph.D. in
one article.

Publications
● Undecidability of adversarial planning with
unbounded horizon Planning with:
● DPS: Convergence rates -of robust uncertainties
adversarial
- finite state space
optimization / noisy optimization
- no observation
● - deterministic problem
Parallel MCTS / nested MCTS
● MCTS in continuous settings
Optimal average reward:
- undecidable
● MCTS for PO setting (real-world: Urban
- unapproximable Rivals)
● Model-free MCTS
● Hybridization MCTS/DPS
● Simulation-based optimization in power
systems

Publications
unbounded horizon
Planning with:
● DPS: Convergence rates -of robust uncertainties
adversarial
- finite state space
- partially observation
● Parallel MCTS / nested MCTS
- stochastic problem
● MCTS in continuous settings all strategies,
- for
stops almost surely
● MCTS for PO setting (real-world: Urban Rivals)
Optimal average reward
● Model-free MCTS is decidable
systems

Publications
unbounded horizon
● DPS: Convergence rates of robust
Optimal rates in the parallel case
● MCTS for POfor robust(real-world: Urban Rivals)
setting optimization w.r.t.
● Model-free MCTSmonotonous compositions
==> essentially, bounds and
● Hybridization MCTS/DPS evolutionary
patches for
computation
systems

Publications
unbounded horizon
Optimal rates for noisy quadratic
● MCTS for PO setting optimization with
black-box (real-world: Urban Rivals)
● Model-free MCTS linear in the regret
variance
systems

Publications
unbounded horizon
● Model-free MCTS Consistency proof in the
● continuous case
Hybridization MCTS/DPS
systems

Publications
unbounded horizon
● Model-free MCTS
systems

reinforcement learning for difficult settings

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a reinforcement learning for difficult settings

Semelhante a reinforcement learning for difficult settings (20)

reinforcement learning for difficult settings