SlideShare uma empresa Scribd logo
1 de 21
DIRECT POLICY SEARCH


0. What is Direct Policy Search ?

1. Direct Policy Search:
   Parametric Policies for Financial Applications

2. Parametric Bellman values for Stock Problems

3. Direct Policy Search: Optimization Tools
First, you need to know what is
              direct policy search (DPS).

                  Principle of DPS:

 (1) Define a parametric policy Pi
     with parameters t1,...,tk.

 (2) maximize
     (t1,...,tk) → average reward when applying
     Policy pi(t1,...,tk) on the problem.

                ==> You must define Pi
 ==> You must choose a noisy optimization algorithm
==> There is a Pi by default (an actor neural network),
      but it's only a default solution (overload it)
Strengths of DPS:

- Good warm start
     If I have a solution for problem A, and
     if I switch to problem B close to A, then I quickly
     get good results.

- Benefits from expert knowledge on the structure

- No constraint on the structure of the objective function

- Anytime (i.e. not that bad in restricted time)

                          Drawbacks:
            - needs structured direct policy search
         - not directly applicable to partial observation
Virtual MashDecision computeDecision(MashState & state,
             Const Vector<double> params)

                ==> “params” = t1,...,tk
        ==> returns the decision pi(t1,...,tk,state)

                  Does it make sense ?

    Overload this function, and DPS is ready to work.

    Well, DPS (somewhere between alpha and beta)
                might be full of bugs :-)
Direct Policy Search:
Parametric Policies for Financial
          Application
Bengio et al papers on DPS for financial applications


       Stocks (various assets) + Cash              - Can be applied on data sets
                                                      (no simulator, no elasticity model)
           decision =
       tradingUnit(A, prevision(B,data))
                                                      because policy has no impact
                                                      on prices
                     Where:
- tradingUnit is designed by human experts         - 22 params in first paper
- prevision's outputs are chosen
          by human experts                         - reduced weight sharing
- prevision is a neural network
- A and B are parameters                               in other paper
                                                         ==> ~ 800 parameters
Then,                                                      (if I understand correctly)
B is optimized by LMS (prevision criterion)
    ==> poor results, little correlation between   - there exist much bigger DPS
       LMS and financial performance
A and B are optimized on the expected return             (Sigaud et al., 27 000)
   (by DPS) ==> much better
                                                   - nb: noisy optimization
An alternate solution:

parametric Bellman values

   for Stock Problems
What is a Bellman function ?

V(s): expected benefit, in the future,
  if playing optimally from state s.

V(s) is useful for playing optimally.
Rule for an optimal decision:

  d(s) = argmax V(s') + r(s,d)
            d

- s'=nextState(s,d)
- d(s): optimal decision in state s
- V(s'): Bellman value in state s'
- r(s,d): reward associated to
          decision d in state s
Remark 1: V(s) known
up to an additive constant is enough

       Remark 2: dV(s)/d(si)
       is the price of stock i

  Example with one stock, soon.
Q-rule for an optimal decision:

      d(s) = argmax Q(s,d)
                d

- d(s): optimal decision in state s
- Q(s,d) : optimal future reward if
   decision = d in s

==> approximate Q instead of V
==> we don't need r(s,d)
       nor newState(s,d)
I have enough
                                             stock;
                                        I pay only if it's
V(stock) (in euros)
                                            cheap.


       I need a
     lot of stock!
  I accept to pay a
          lot.




                      Slope = marginal price (euros/KWh)




                                              Stock (in kWh)
Examples:
For one stock:
   - very simple: constant price
   - piecewise linear (can ensure convexity)
   - “tanh” function
   - neural network, SVM, sum of Gaussians...


For several stocks:
   - each stock separately
   - 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)
                   where S=a1.s1+a2.s2+a3.s3
   - neural network, SVM, sum of Gaussians...
How to choose coefficients ?
- dynamic programming: robust, but slow in high dim
- direct policy search:
     - initializing coefficients from expert advice
     - or: supervised machine learning for approximating
             an expert advice
     ==> and then optimize
Conclusions:

V: Very convenient representation of policy:
   we can view prices.
Q: some advantages (model-free models)

Yet, less readable than direct rules.

And expensive: we need one optimization for making
  the decision, for each time step of a simulation.
  ==> but this optimization can be
        a simple sort (as a first approximation).

Simpler ? Adrien has a parametric strategy for stocks
   ==> we should see how to generalize it
   ==> transformation “constants → parameters” ==> DPS
Questions (strategic decisions for the DPS):
     - start with Adrien's policy, improve it, generalize it,
           parametrize it ? interface with ARM ?
     - or another strategy ?
     - or a parametric V function, and we assume we have
           r(s,d) and newState(s,d) (often true)
     - or a parametric Q function ?
         (more generic, unusual but appealing,
         but neglects some
         existing knowledge r(s,d) and newState(s,d) )

Further work:
   - finish the validation of Adrien's policy on stock
       (better than random as a policy; better than random
            as a UCT-Monte-Carlo)
   - generalize ? variants ?
   - introduce into DPS, compare to the baseline (neural net)
   - introduce DPS's result into MCTS
Questions (strategic decisions for the DPS):
     - start with Adrien's policy, improve it, generalize it,
           parametrize it ? interface with ARM ?
     - or another strategy ?
     - or a parametric V function, and we assume we have
           r(s,d) and newState(s,d) (often true)
     - or a parametric Q function ?
         (more generic, unusual but appealing,
         but neglects some
         existing knowledge r(s,d) and newState(s,d) )

Further work:
   - finish the validation of Adrien's policy on stock
       (better than random as a policy; better than random
            as a UCT-Monte-Carlo)
   - generalize ? variants ?
   - introduce into DPS, compare to the baseline (neural net)
   - introduce DPS's result into MCTS
Direct Policy Search:

 Optimization Tools

& Optimization Tricks
- Classical tools: Evolution Strategies,
   Cross-Entropy, Pso, ...
   ==> more or less supposed to be
          robust to local minima
   ==> no gradient
   ==> robust to noisy objective function
   ==> weak for high dimension (but: see locality, next slide)

- Hopefully:
   - good initialization: nearly convex
   - random seeds: no noise

==> NewUoa is my favorite choice
   - no gradient
   - can “really” work in high-dimension
   - update rule surprisingly fast
   - people who try to show that their
       algorithm is better than NewUoa
       suffer a lot in noise-free case
Improvements of optimization algorithms:

     - active learning: when optimization on scenarios,
            choose “good” scenarios

           ==> maybe “quasi-randomization” ?
                Just choosing a representative sample of
                scenarios. ==> simple, robust...

     - local improvement: when a gradient step/update
            is performed, only update variables concerned
            by the simulation you've used for generating
            the update

           ==> difficult to use in NewUoa
Roadmap:

- default policy for energy management problems:
      test, generalize, formalize, simplify...

- this default policy ==> a parametric policy

- test in DPS: strategy A

- interface DPS with NewUoa and/or others (openDP opt?)

- Strategy A: test into MCTS ==> Strategy B

==> IMHO, strategy A = good tool for fast
        readable non-myopic results

==> IMHO, strategy B = good for combining A with
   the efficiency of A for short term combinatorial effects.

- Also, validating the partial observation (sounds good).

Mais conteúdo relacionado

Mais procurados

Support vector machine
Support vector machineSupport vector machine
Support vector machine
Musa Hawamdah
 
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
Dongseo University
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
jianingy
 

Mais procurados (20)

Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Svm vs ls svm
Svm vs ls svmSvm vs ls svm
Svm vs ls svm
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
RT-BDI: A Real-Time BDI model
RT-BDI: A Real-Time BDI modelRT-BDI: A Real-Time BDI model
RT-BDI: A Real-Time BDI model
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
 
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
A BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machineA BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machine
 
Dask glm-scipy2017-final
Dask glm-scipy2017-finalDask glm-scipy2017-final
Dask glm-scipy2017-final
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
 
Modeling interest rates and derivatives
Modeling interest rates and derivativesModeling interest rates and derivatives
Modeling interest rates and derivatives
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 

Destaque

Keywords and examples of machine learning
Keywords and examples of machine learningKeywords and examples of machine learning
Keywords and examples of machine learning
Olivier Teytaud
 

Destaque (16)

Bias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniquesBias correction, and other uncertainty management techniques
Bias correction, and other uncertainty management techniques
 
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy SearchSimulation-based optimization: Upper Confidence Tree and Direct Policy Search
Simulation-based optimization: Upper Confidence Tree and Direct Policy Search
 
Debugging
DebuggingDebugging
Debugging
 
Power systemsilablri
Power systemsilablriPower systemsilablri
Power systemsilablri
 
Examples of operational research
Examples of operational researchExamples of operational research
Examples of operational research
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Bias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimizationBias and Variance in Continuous EDA: massively parallel continuous optimization
Bias and Variance in Continuous EDA: massively parallel continuous optimization
 
Keywords and examples of machine learning
Keywords and examples of machine learningKeywords and examples of machine learning
Keywords and examples of machine learning
 
Disappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree SearchDisappointing results & open problems in Monte-Carlo Tree Search
Disappointing results & open problems in Monte-Carlo Tree Search
 
Simple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimizationSimple regret bandit algorithms for unstructured noisy optimization
Simple regret bandit algorithms for unstructured noisy optimization
 
Combining games artificial intelligences & improving random seeds
Combining games artificial intelligences & improving random seedsCombining games artificial intelligences & improving random seeds
Combining games artificial intelligences & improving random seeds
 
Fuzzy control - superfast survey
Fuzzy control - superfast surveyFuzzy control - superfast survey
Fuzzy control - superfast survey
 
Planning for power systems
Planning for power systemsPlanning for power systems
Planning for power systems
 
Artificial intelligence for power systems
Artificial intelligence for power systemsArtificial intelligence for power systems
Artificial intelligence for power systems
 
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
Monte Carlo Tree Search in 2014 (MCMC days in Marseille)
 
Réseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielleRéseaux neuronaux profonds & intelligence artificielle
Réseaux neuronaux profonds & intelligence artificielle
 

Semelhante a Direct policy search

Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
Prince Jain
 

Semelhante a Direct policy search (20)

Optimization
OptimizationOptimization
Optimization
 
Uncertainties in large scale power systems
Uncertainties in large scale power systemsUncertainties in large scale power systems
Uncertainties in large scale power systems
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Differential Machine Learning Masterclass
Differential Machine Learning MasterclassDifferential Machine Learning Masterclass
Differential Machine Learning Masterclass
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Dynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameDynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris Game
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPK
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Quantitative techniques
Quantitative techniquesQuantitative techniques
Quantitative techniques
 
PSO and Its application in Engineering
PSO and Its application in EngineeringPSO and Its application in Engineering
PSO and Its application in Engineering
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Direct policy search

  • 1. DIRECT POLICY SEARCH 0. What is Direct Policy Search ? 1. Direct Policy Search: Parametric Policies for Financial Applications 2. Parametric Bellman values for Stock Problems 3. Direct Policy Search: Optimization Tools
  • 2. First, you need to know what is direct policy search (DPS). Principle of DPS: (1) Define a parametric policy Pi with parameters t1,...,tk. (2) maximize (t1,...,tk) → average reward when applying Policy pi(t1,...,tk) on the problem. ==> You must define Pi ==> You must choose a noisy optimization algorithm ==> There is a Pi by default (an actor neural network), but it's only a default solution (overload it)
  • 3. Strengths of DPS: - Good warm start If I have a solution for problem A, and if I switch to problem B close to A, then I quickly get good results. - Benefits from expert knowledge on the structure - No constraint on the structure of the objective function - Anytime (i.e. not that bad in restricted time) Drawbacks: - needs structured direct policy search - not directly applicable to partial observation
  • 4. Virtual MashDecision computeDecision(MashState & state, Const Vector<double> params) ==> “params” = t1,...,tk ==> returns the decision pi(t1,...,tk,state) Does it make sense ? Overload this function, and DPS is ready to work. Well, DPS (somewhere between alpha and beta) might be full of bugs :-)
  • 5. Direct Policy Search: Parametric Policies for Financial Application
  • 6. Bengio et al papers on DPS for financial applications Stocks (various assets) + Cash - Can be applied on data sets (no simulator, no elasticity model) decision = tradingUnit(A, prevision(B,data)) because policy has no impact on prices Where: - tradingUnit is designed by human experts - 22 params in first paper - prevision's outputs are chosen by human experts - reduced weight sharing - prevision is a neural network - A and B are parameters in other paper ==> ~ 800 parameters Then, (if I understand correctly) B is optimized by LMS (prevision criterion) ==> poor results, little correlation between - there exist much bigger DPS LMS and financial performance A and B are optimized on the expected return (Sigaud et al., 27 000) (by DPS) ==> much better - nb: noisy optimization
  • 7. An alternate solution: parametric Bellman values for Stock Problems
  • 8. What is a Bellman function ? V(s): expected benefit, in the future, if playing optimally from state s. V(s) is useful for playing optimally.
  • 9. Rule for an optimal decision: d(s) = argmax V(s') + r(s,d) d - s'=nextState(s,d) - d(s): optimal decision in state s - V(s'): Bellman value in state s' - r(s,d): reward associated to decision d in state s
  • 10. Remark 1: V(s) known up to an additive constant is enough Remark 2: dV(s)/d(si) is the price of stock i Example with one stock, soon.
  • 11. Q-rule for an optimal decision: d(s) = argmax Q(s,d) d - d(s): optimal decision in state s - Q(s,d) : optimal future reward if decision = d in s ==> approximate Q instead of V ==> we don't need r(s,d) nor newState(s,d)
  • 12. I have enough stock; I pay only if it's V(stock) (in euros) cheap. I need a lot of stock! I accept to pay a lot. Slope = marginal price (euros/KWh) Stock (in kWh)
  • 13. Examples: For one stock: - very simple: constant price - piecewise linear (can ensure convexity) - “tanh” function - neural network, SVM, sum of Gaussians... For several stocks: - each stock separately - 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S) where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
  • 14. How to choose coefficients ? - dynamic programming: robust, but slow in high dim - direct policy search: - initializing coefficients from expert advice - or: supervised machine learning for approximating an expert advice ==> and then optimize
  • 15. Conclusions: V: Very convenient representation of policy: we can view prices. Q: some advantages (model-free models) Yet, less readable than direct rules. And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be a simple sort (as a first approximation). Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it ==> transformation “constants → parameters” ==> DPS
  • 16. Questions (strategic decisions for the DPS): - start with Adrien's policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) ) Further work: - finish the validation of Adrien's policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPS's result into MCTS
  • 17. Questions (strategic decisions for the DPS): - start with Adrien's policy, improve it, generalize it, parametrize it ? interface with ARM ? - or another strategy ? - or a parametric V function, and we assume we have r(s,d) and newState(s,d) (often true) - or a parametric Q function ? (more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) ) Further work: - finish the validation of Adrien's policy on stock (better than random as a policy; better than random as a UCT-Monte-Carlo) - generalize ? variants ? - introduce into DPS, compare to the baseline (neural net) - introduce DPS's result into MCTS
  • 18. Direct Policy Search: Optimization Tools & Optimization Tricks
  • 19. - Classical tools: Evolution Strategies, Cross-Entropy, Pso, ... ==> more or less supposed to be robust to local minima ==> no gradient ==> robust to noisy objective function ==> weak for high dimension (but: see locality, next slide) - Hopefully: - good initialization: nearly convex - random seeds: no noise ==> NewUoa is my favorite choice - no gradient - can “really” work in high-dimension - update rule surprisingly fast - people who try to show that their algorithm is better than NewUoa suffer a lot in noise-free case
  • 20. Improvements of optimization algorithms: - active learning: when optimization on scenarios, choose “good” scenarios ==> maybe “quasi-randomization” ? Just choosing a representative sample of scenarios. ==> simple, robust... - local improvement: when a gradient step/update is performed, only update variables concerned by the simulation you've used for generating the update ==> difficult to use in NewUoa
  • 21. Roadmap: - default policy for energy management problems: test, generalize, formalize, simplify... - this default policy ==> a parametric policy - test in DPS: strategy A - interface DPS with NewUoa and/or others (openDP opt?) - Strategy A: test into MCTS ==> Strategy B ==> IMHO, strategy A = good tool for fast readable non-myopic results ==> IMHO, strategy B = good for combining A with the efficiency of A for short term combinatorial effects. - Also, validating the partial observation (sounds good).