O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

1.024 visualizações

Publicada em

Ben Lau is a quantitative researcher in a macro hedge fund in Hong Kong and he looks to apply mathematical models and signal processing techniques to study the financial market. Prior joining the financial industry, he specialized in using his mathematical modelling skills to discover the mysteries of the universe whilst working at Stanford Linear Accelerator Centre, a national accelerator laboratory where he studied the asymmetry between matter and antimatter by analysing tens of billions of collision events created by the particle accelerators. Ben was awarded his Ph.D. in Particle Physics from Princeton University and his undergraduate degree (with First Class Honours) at the Chinese University of Hong Kong.

Abstract Summary:

Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.

Publicada em: Tecnologia
  • Entre para ver os comentários

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

  1. 1. Deep Reinforcement Learning using deep learning to play self-driving car games Ben Lau Ben Lau - Deep Learning and Reinforcement MLConf 2017, New York City
  2. 2. What is Reinforcement Learning? Ben Lau - Deep Learning and Reinforcement 3 classes of learning Supervised Learning  Label data  Direct Feedback Unsupervised Learning  No labels data  No feedback  “Find Hidden Structure Reinforcement Learning  Using reward as feedback  Learn series of actions  Trial and Error
  3. 3. RL: Agent and Environment Ben Lau - Deep Learning and Reinforcement 𝑅𝑡 Agent Action 𝐴 𝑡 Environment Reward Observation 𝑂𝑡 At each step t the Agent • Receive observation 𝑂𝑡 • Execute action 𝐴 𝑡 • Receive reward 𝑅𝑡 the Environment • Receive action 𝐴 𝑡 • Sends observation 𝑂𝑡+1 • Sends reward 𝑅𝑡+1
  4. 4. RL: State Ben Lau - Deep Learning and Reinforcement Experience is a sequence of observations, actions, rewards 𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡 The state is a summary of experience 𝑠𝑡 = 𝑓(𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡) Note: Not all the state are fully observable Fully Observable Not Fully Observable
  5. 5. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL  Estimate the optimal value function 𝑄∗(𝑠, 𝑎)  This is the maximum value achievable under any policy Policy-Based RL  Search directly for the optimal policy 𝜋∗  This is the policy achieving maximum future reward Model-based RL  Build a model of the environment  Plan (e.g. by lookahead) using model
  6. 6. Deep Learning + RL  AI Ben Lau - Deep Learning and Reinforcement reward Game input Deep convolution network Stee r Gas Peda l Brake
  7. 7. Policies Ben Lau - Deep Learning and Reinforcement A deterministic policy is the agent’s behavior  It is a map from state to action:  𝑎 𝑡 = 𝜋(𝑠𝑡) In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sum of future rewards Choose at to maximize 𝑅𝑡 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2 𝑟𝑡+3 + ⋯ 𝛾 is a discount factor [0,1], as the reward is less certain when further away State(s) Action(a) Obstacle Brake Corner Left/Right Straight line Acceleration
  8. 8. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL  Estimate the optimal value function 𝑄∗(𝑠, 𝑎)  This is the maximum value achievable under any policy
  9. 9. Value Function Ben Lau - Deep Learning and Reinforcement  A value function is a prediction of future reward  How much reward will I get from action a in state s?  A Q-value function gives expected total reward  From state-action pair (s, a)  Under policy 𝜋  With discount factor 𝛾 𝑄 𝜋 𝑠, 𝑎 = 𝐸 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2 𝑟𝑡+3 + ⋯ 𝑠, 𝑎]  An optimal value function is the maximum achievable value 𝑄∗ 𝑠, 𝑎 = 𝑀𝑎𝑥 𝑎 𝑄 𝜋 𝑠, 𝑎  Once we have the 𝑄∗ we can act optimally 𝜋∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄∗ (𝑠, 𝑎)
  10. 10. Understanding Q Function Ben Lau - Deep Learning and Reinforcement  The best way to understand Q function is considering a “strategy guide”  Suppose you are playing a difficult game (DOOM)  If you have a strategy guide, it’s pretty easy  Just follow the guide  Suppose you are in state s, and need to make a decision, If you have this m Q-function(strategy guide), then it is easy, just pick the action with highest Q Doom Strategy Guide
  11. 11. How to find Q-function Ben Lau - Deep Learning and Reinforcement  Discount Future Reward:𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 which can be written as:  𝑅𝑡 = 𝑟𝑡 + 𝛾𝑅𝑡+1 Recall the definition of Q-function (max reward if choose action a in state s)  𝑄 𝑠𝑡, 𝑎 𝑡 = max 𝑅𝑡+1 Therefore, we can rewrite the Q-function as below  𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’ It can be solved by dynamic programming or iterative solution
  12. 12. Deep Q-Network (DQN) Ben Lau - Deep Learning and Reinforcement  Action-Value function (Q-function) often very big  DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network  𝑄 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎, 𝑤  Training become finding sets of optimal weights w instead  In the literature we often called “non-linear function approximation” State Action Value A 1 140.11 A 2 139.22 B 1 145.89 B 2 140.23 C 1 123.67 C 2 135.27 ≈
  13. 13. DQN Demo Using DeepQ network to play Doom
  14. 14. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Policy-Based RL  Search directly for the optimal policy 𝜋∗  This is the policy achieving maximum future reward
  15. 15. Deep Policy Network Ben Lau - Deep Learning and Reinforcement Review: A policy is the agent’s behavior  It is a map from state to action:  at = π(st)  We can directly search the policy  Let’s parameterize the policy by some model parameters 𝜃  𝑎 = 𝜋(𝑠, 𝜃)  We called it Policy-Based reinforcement learning because we will adjust the model parameters 𝜃 directly  The goal is to maximize the total discount reward from beginning maximize total 𝑅 = 𝑟1 + 𝛾𝑟2 + 𝛾2 𝑟3 + ⋯
  16. 16. Policy Gradient Ben Lau - Deep Learning and Reinforcement How to make good action more likely?  Define objective function as total discounted reward 𝐿 𝜃 = 𝐸 𝑟1 + 𝛾𝑟2 + 𝛾2 𝑟3 + ⋯ |𝜋 𝜃(𝑠, 𝑎) or 𝐿 𝜃 = 𝐸 𝑅|𝜋 𝜃(𝑠, 𝑎) Where the expectations of the total reward R is calculated under some probability distribution 𝑝(𝑎|𝜃) parameterized by 𝜃  The goal become maximize the total reward by compute the gradient 𝜕𝐿(𝜃) 𝜕𝜃
  17. 17. Policy Gradient (II) Ben Lau - Deep Learning and Reinforcement Recall: Q-function is the maximum discounted future reward in state s, actio 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥𝑅𝑡+1  In the continuous case we can written as 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑅𝑡+1 Therefore, we can compute the gradient as 𝜕𝐿(𝜃) 𝜕𝜃 = 𝐸 𝑝(𝑎|𝜃) 𝜕𝑄 𝜕𝜃  Using chain-rule, we can re-write as 𝜕𝐿(𝜃) 𝜕𝜃 = 𝐸 𝑝(𝑎|𝜃) 𝜕𝑄 𝜃(𝑠,𝑎) 𝜕𝑎 𝜕𝑎 𝜕𝜃 No dynamics model required! 1. Only requires Q is differential w.r.t. a 2. As long as a can be parameterized as function of 𝜃
  18. 18. The power of Policy Gradient Ben Lau - Deep Learning and Reinforcement Because the policy gradient does not require the dynamical model therefore, no prior domain knowledge is required AlphaGo doesn’t pre-programme any domain knowledge It keep playing many times (via self-play) and adjust the policy parameters 𝜃 to maximize the reward(winning probability)
  19. 19. Intuition: Value vs Policy RL Ben Lau - Deep Learning and Reinforcement  Valued Based RL is similar to driving instructor : A score is given for any action is taken by student  Policy Based RL is similar to a driver : It is the actual policy how to drive a car
  20. 20. The car racing game TORCS Ben Lau - Deep Learning and Reinforcement  TORCS is a state of the art open source simulator written in C++  Main Features  Sophisticated dynamics  Provided with several tracks, controllers  Sensors  Rangefinder  Speed  Position on track  Rotation speed of wheels  RPM  Angle with tracks Quite realistic to self-driving cars… Track sensors
  21. 21. Deep Learning Recipe Ben Lau - Deep Learning and Reinforcement reward Game input state s Deep Neural network Stee r Gas Peda l Brak e  Rangefinder  Speed  Position on track  Rotation speed of wheels  RPM  Angle with tracks Compute the optimal policy 𝜋 via policy gradient
  22. 22. Design of the reward function Ben Lau - Deep Learning and Reinforcement  Obvious choice : Highest velocity of the car 𝑅 = 𝑉𝑐𝑎𝑟 cos 𝜃  However, experience found that learning not very stable  Use modify reward function 𝑅 = 𝑉𝑥 cos 𝜃 −𝑉𝑥 sin 𝜃 −𝑉𝑥|track pos| Encourage stay in the center of the track
  23. 23. Source code available here: Google: DDPG Keras Ben Lau - Deep Learning and Reinforcement
  24. 24. Training Set: Aalborg Track
  25. 25. Validation Set: Alpine Tracks Recall basic Machine Learning, make sure you need to test the model In the validation set, not the training set
  26. 26. Learning how to brake Ben Lau - Deep Learning and Reinforcement Since we try to maximize the velocity of the car The AI agent don’t want to hit the brake at all! (As it go against the reward function) Using Stochastic Brake Idea
  27. 27. Final Demo – Car does not stay center of track Ben Lau - Deep Learning and Reinforcement
  28. 28. Future Application Ben Lau - Deep Learning and Reinforcement Self driving cars:
  29. 29. Future Application
  30. 30. Thank you! Twitter: @yanpanlau
  31. 31. Appendix
  32. 32. How to find Q-function (II) Ben Lau - Deep Learning and Reinforcement  𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′ , 𝑎′ ) We could use iterative method to solve the Q-function, given a transition (s,a,  We want 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′ , 𝑎′ ) to be same as 𝑄 𝑠, 𝑎  Consider find Q-function is a regression task, we can define a loss function  Loss function = 1 2 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎) 2  Q is optimal when the loss function is minimum target prediction

×