Direct policy search

DIRECT POLICY SEARCH

0. What is Direct Policy Search ?

1. Direct Policy Search:
Parametric Policies for Financial Applications

2. Parametric Bellman values for Stock Problems

3. Direct Policy Search: Optimization Tools

First, you need to know what is
direct policy search (DPS).

Principle of DPS:

(1) Define a parametric policy Pi
with parameters t1,...,tk.

(2) maximize
(t1,...,tk) → average reward when applying
Policy pi(t1,...,tk) on the problem.

==> You must define Pi
==> You must choose a noisy optimization algorithm
==> There is a Pi by default (an actor neural network),
but it's only a default solution (overload it)

Strengths of DPS:

- Good warm start
If I have a solution for problem A, and
if I switch to problem B close to A, then I quickly
get good results.

- Benefits from expert knowledge on the structure

- No constraint on the structure of the objective function

- Anytime (i.e. not that bad in restricted time)

Drawbacks:
- needs structured direct policy search
- not directly applicable to partial observation

Virtual MashDecision computeDecision(MashState & state,
Const Vector<double> params)

==> “params” = t1,...,tk
==> returns the decision pi(t1,...,tk,state)

Does it make sense ?

Overload this function, and DPS is ready to work.

Well, DPS (somewhere between alpha and beta)
might be full of bugs :-)

Direct Policy Search:
Parametric Policies for Financial
Application

Bengio et al papers on DPS for financial applications

Stocks (various assets) + Cash - Can be applied on data sets
(no simulator, no elasticity model)
decision =
tradingUnit(A, prevision(B,data))
because policy has no impact
on prices
Where:
- tradingUnit is designed by human experts - 22 params in first paper
- prevision's outputs are chosen
by human experts - reduced weight sharing
- prevision is a neural network
- A and B are parameters in other paper
==> ~ 800 parameters
Then, (if I understand correctly)
B is optimized by LMS (prevision criterion)
==> poor results, little correlation between - there exist much bigger DPS
LMS and financial performance
A and B are optimized on the expected return (Sigaud et al., 27 000)
(by DPS) ==> much better
- nb: noisy optimization

An alternate solution:

parametric Bellman values

for Stock Problems

What is a Bellman function ?

V(s): expected benefit, in the future,
if playing optimally from state s.

V(s) is useful for playing optimally.

Rule for an optimal decision:

d(s) = argmax V(s') + r(s,d)
d

- s'=nextState(s,d)
- d(s): optimal decision in state s
- V(s'): Bellman value in state s'
- r(s,d): reward associated to
decision d in state s

Remark 1: V(s) known
up to an additive constant is enough

Remark 2: dV(s)/d(si)
is the price of stock i

Example with one stock, soon.

Q-rule for an optimal decision:

d(s) = argmax Q(s,d)
d

- d(s): optimal decision in state s
- Q(s,d) : optimal future reward if
decision = d in s

==> approximate Q instead of V
==> we don't need r(s,d)
nor newState(s,d)

I have enough
stock;
I pay only if it's
V(stock) (in euros)
cheap.

I need a
lot of stock!
I accept to pay a
lot.

Slope = marginal price (euros/KWh)

Stock (in kWh)

Examples:
For one stock:
- very simple: constant price
- piecewise linear (can ensure convexity)
- “tanh” function
- neural network, SVM, sum of Gaussians...

For several stocks:
- each stock separately
- 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)
where S=a1.s1+a2.s2+a3.s3
- neural network, SVM, sum of Gaussians...

How to choose coefficients ?
- dynamic programming: robust, but slow in high dim
- direct policy search:
- initializing coefficients from expert advice
- or: supervised machine learning for approximating
an expert advice
==> and then optimize

Conclusions:

V: Very convenient representation of policy:
we can view prices.
Q: some advantages (model-free models)

Yet, less readable than direct rules.

And expensive: we need one optimization for making
the decision, for each time step of a simulation.
==> but this optimization can be
a simple sort (as a first approximation).

Simpler ? Adrien has a parametric strategy for stocks
==> we should see how to generalize it
==> transformation “constants → parameters” ==> DPS

Questions (strategic decisions for the DPS):
- start with Adrien's policy, improve it, generalize it,
parametrize it ? interface with ARM ?
- or another strategy ?
- or a parametric V function, and we assume we have
r(s,d) and newState(s,d) (often true)
- or a parametric Q function ?
(more generic, unusual but appealing,
but neglects some
existing knowledge r(s,d) and newState(s,d) )

Further work:
- finish the validation of Adrien's policy on stock
(better than random as a policy; better than random
as a UCT-Monte-Carlo)
- generalize ? variants ?
- introduce into DPS, compare to the baseline (neural net)
- introduce DPS's result into MCTS

Direct Policy Search:

Optimization Tools

& Optimization Tricks

- Classical tools: Evolution Strategies,
Cross-Entropy, Pso, ...
==> more or less supposed to be
robust to local minima
==> no gradient
==> robust to noisy objective function
==> weak for high dimension (but: see locality, next slide)

- Hopefully:
- good initialization: nearly convex
- random seeds: no noise

==> NewUoa is my favorite choice
- no gradient
- can “really” work in high-dimension
- update rule surprisingly fast
- people who try to show that their
algorithm is better than NewUoa
suffer a lot in noise-free case

Improvements of optimization algorithms:

- active learning: when optimization on scenarios,
choose “good” scenarios

==> maybe “quasi-randomization” ?
Just choosing a representative sample of
scenarios. ==> simple, robust...

- local improvement: when a gradient step/update
is performed, only update variables concerned
by the simulation you've used for generating
the update

==> difficult to use in NewUoa

Roadmap:

- default policy for energy management problems:
test, generalize, formalize, simplify...

- this default policy ==> a parametric policy

- test in DPS: strategy A

- interface DPS with NewUoa and/or others (openDP opt?)

- Strategy A: test into MCTS ==> Strategy B

==> IMHO, strategy A = good tool for fast
readable non-myopic results

==> IMHO, strategy B = good for combining A with
the efficiency of A for short term combinatorial effects.

- Also, validating the partial observation (sounds good).

Direct policy search

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a Direct policy search

Semelhante a Direct policy search (20)

Último

Último (20)

Direct policy search