Direct policy search (DPS) is a reinforcement learning technique that directly optimizes parametric policies for decision making. Key aspects of DPS include:
(1) Defining a parametric policy with parameters that are optimized.
(2) Choosing an optimization algorithm like cross-entropy or CMA-ES to maximize the average reward of the policy over many simulations or real interactions.
(3) Applying DPS involves overloading the default policy function to define the parametric policy, and choosing an optimization algorithm to update the policy parameters.
Developer Data Modeling Mistakes: From Postgres to NoSQL
Direct policy search
1. DIRECT POLICY SEARCH
0. What is Direct Policy Search ?
1. Direct Policy Search:
Parametric Policies for Financial Applications
2. Parametric Bellman values for Stock Problems
3. Direct Policy Search: Optimization Tools
2. First, you need to know what is
direct policy search (DPS).
Principle of DPS:
(1) Define a parametric policy Pi
with parameters t1,...,tk.
(2) maximize
(t1,...,tk) → average reward when applying
Policy pi(t1,...,tk) on the problem.
==> You must define Pi
==> You must choose a noisy optimization algorithm
==> There is a Pi by default (an actor neural network),
but it's only a default solution (overload it)
3. Strengths of DPS:
- Good warm start
If I have a solution for problem A, and
if I switch to problem B close to A, then I quickly
get good results.
- Benefits from expert knowledge on the structure
- No constraint on the structure of the objective function
- Anytime (i.e. not that bad in restricted time)
Drawbacks:
- needs structured direct policy search
- not directly applicable to partial observation
4. Virtual MashDecision computeDecision(MashState & state,
Const Vector<double> params)
==> “params” = t1,...,tk
==> returns the decision pi(t1,...,tk,state)
Does it make sense ?
Overload this function, and DPS is ready to work.
Well, DPS (somewhere between alpha and beta)
might be full of bugs :-)
6. Bengio et al papers on DPS for financial applications
Stocks (various assets) + Cash - Can be applied on data sets
(no simulator, no elasticity model)
decision =
tradingUnit(A, prevision(B,data))
because policy has no impact
on prices
Where:
- tradingUnit is designed by human experts - 22 params in first paper
- prevision's outputs are chosen
by human experts - reduced weight sharing
- prevision is a neural network
- A and B are parameters in other paper
==> ~ 800 parameters
Then, (if I understand correctly)
B is optimized by LMS (prevision criterion)
==> poor results, little correlation between - there exist much bigger DPS
LMS and financial performance
A and B are optimized on the expected return (Sigaud et al., 27 000)
(by DPS) ==> much better
- nb: noisy optimization
8. What is a Bellman function ?
V(s): expected benefit, in the future,
if playing optimally from state s.
V(s) is useful for playing optimally.
9. Rule for an optimal decision:
d(s) = argmax V(s') + r(s,d)
d
- s'=nextState(s,d)
- d(s): optimal decision in state s
- V(s'): Bellman value in state s'
- r(s,d): reward associated to
decision d in state s
10. Remark 1: V(s) known
up to an additive constant is enough
Remark 2: dV(s)/d(si)
is the price of stock i
Example with one stock, soon.
11. Q-rule for an optimal decision:
d(s) = argmax Q(s,d)
d
- d(s): optimal decision in state s
- Q(s,d) : optimal future reward if
decision = d in s
==> approximate Q instead of V
==> we don't need r(s,d)
nor newState(s,d)
12. I have enough
stock;
I pay only if it's
V(stock) (in euros)
cheap.
I need a
lot of stock!
I accept to pay a
lot.
Slope = marginal price (euros/KWh)
Stock (in kWh)
13. Examples:
For one stock:
- very simple: constant price
- piecewise linear (can ensure convexity)
- “tanh” function
- neural network, SVM, sum of Gaussians...
For several stocks:
- each stock separately
- 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)
where S=a1.s1+a2.s2+a3.s3
- neural network, SVM, sum of Gaussians...
14. How to choose coefficients ?
- dynamic programming: robust, but slow in high dim
- direct policy search:
- initializing coefficients from expert advice
- or: supervised machine learning for approximating
an expert advice
==> and then optimize
15. Conclusions:
V: Very convenient representation of policy:
we can view prices.
Q: some advantages (model-free models)
Yet, less readable than direct rules.
And expensive: we need one optimization for making
the decision, for each time step of a simulation.
==> but this optimization can be
a simple sort (as a first approximation).
Simpler ? Adrien has a parametric strategy for stocks
==> we should see how to generalize it
==> transformation “constants → parameters” ==> DPS
16. Questions (strategic decisions for the DPS):
- start with Adrien's policy, improve it, generalize it,
parametrize it ? interface with ARM ?
- or another strategy ?
- or a parametric V function, and we assume we have
r(s,d) and newState(s,d) (often true)
- or a parametric Q function ?
(more generic, unusual but appealing,
but neglects some
existing knowledge r(s,d) and newState(s,d) )
Further work:
- finish the validation of Adrien's policy on stock
(better than random as a policy; better than random
as a UCT-Monte-Carlo)
- generalize ? variants ?
- introduce into DPS, compare to the baseline (neural net)
- introduce DPS's result into MCTS
17. Questions (strategic decisions for the DPS):
- start with Adrien's policy, improve it, generalize it,
parametrize it ? interface with ARM ?
- or another strategy ?
- or a parametric V function, and we assume we have
r(s,d) and newState(s,d) (often true)
- or a parametric Q function ?
(more generic, unusual but appealing,
but neglects some
existing knowledge r(s,d) and newState(s,d) )
Further work:
- finish the validation of Adrien's policy on stock
(better than random as a policy; better than random
as a UCT-Monte-Carlo)
- generalize ? variants ?
- introduce into DPS, compare to the baseline (neural net)
- introduce DPS's result into MCTS
19. - Classical tools: Evolution Strategies,
Cross-Entropy, Pso, ...
==> more or less supposed to be
robust to local minima
==> no gradient
==> robust to noisy objective function
==> weak for high dimension (but: see locality, next slide)
- Hopefully:
- good initialization: nearly convex
- random seeds: no noise
==> NewUoa is my favorite choice
- no gradient
- can “really” work in high-dimension
- update rule surprisingly fast
- people who try to show that their
algorithm is better than NewUoa
suffer a lot in noise-free case
20. Improvements of optimization algorithms:
- active learning: when optimization on scenarios,
choose “good” scenarios
==> maybe “quasi-randomization” ?
Just choosing a representative sample of
scenarios. ==> simple, robust...
- local improvement: when a gradient step/update
is performed, only update variables concerned
by the simulation you've used for generating
the update
==> difficult to use in NewUoa
21. Roadmap:
- default policy for energy management problems:
test, generalize, formalize, simplify...
- this default policy ==> a parametric policy
- test in DPS: strategy A
- interface DPS with NewUoa and/or others (openDP opt?)
- Strategy A: test into MCTS ==> Strategy B
==> IMHO, strategy A = good tool for fast
readable non-myopic results
==> IMHO, strategy B = good for combining A with
the efficiency of A for short term combinatorial effects.
- Also, validating the partial observation (sounds good).