O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.

O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.

O slideshow foi denunciado.

Gostou da apresentação? Compartilhe-a!

- Ashirth Barthur, Security Scientist... by MLconf 1427 views
- Ross Goodwin, Technologist, Sunspri... by MLconf 462 views
- Josh Patterson, Advisor, Skymind – ... by MLconf 716 views
- Luna Dong, Principal Scientist, Ama... by MLconf 858 views
- Chris Fregly, Research Scientist, P... by MLconf 1218 views
- Funda Gunes, Senior Research Statis... by MLconf 1343 views

1.024 visualizações

Publicada em

Abstract Summary:

Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.

Publicada em:
Tecnologia

Sem downloads

Visualizações totais

1.024

No SlideShare

0

A partir de incorporações

0

Número de incorporações

10

Compartilhamentos

0

Downloads

33

Comentários

8

Gostaram

2

Nenhuma nota no slide

- 1. Deep Reinforcement Learning using deep learning to play self-driving car games Ben Lau Ben Lau - Deep Learning and Reinforcement MLConf 2017, New York City
- 2. What is Reinforcement Learning? Ben Lau - Deep Learning and Reinforcement 3 classes of learning Supervised Learning Label data Direct Feedback Unsupervised Learning No labels data No feedback “Find Hidden Structure Reinforcement Learning Using reward as feedback Learn series of actions Trial and Error
- 3. RL: Agent and Environment Ben Lau - Deep Learning and Reinforcement 𝑅𝑡 Agent Action 𝐴 𝑡 Environment Reward Observation 𝑂𝑡 At each step t the Agent • Receive observation 𝑂𝑡 • Execute action 𝐴 𝑡 • Receive reward 𝑅𝑡 the Environment • Receive action 𝐴 𝑡 • Sends observation 𝑂𝑡+1 • Sends reward 𝑅𝑡+1
- 4. RL: State Ben Lau - Deep Learning and Reinforcement Experience is a sequence of observations, actions, rewards 𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡 The state is a summary of experience 𝑠𝑡 = 𝑓(𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡) Note: Not all the state are fully observable Fully Observable Not Fully Observable
- 5. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL Estimate the optimal value function 𝑄∗(𝑠, 𝑎) This is the maximum value achievable under any policy Policy-Based RL Search directly for the optimal policy 𝜋∗ This is the policy achieving maximum future reward Model-based RL Build a model of the environment Plan (e.g. by lookahead) using model
- 6. Deep Learning + RL AI Ben Lau - Deep Learning and Reinforcement reward Game input Deep convolution network Stee r Gas Peda l Brake
- 7. Policies Ben Lau - Deep Learning and Reinforcement A deterministic policy is the agent’s behavior It is a map from state to action: 𝑎 𝑡 = 𝜋(𝑠𝑡) In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sum of future rewards Choose at to maximize 𝑅𝑡 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2 𝑟𝑡+3 + ⋯ 𝛾 is a discount factor [0,1], as the reward is less certain when further away State(s) Action(a) Obstacle Brake Corner Left/Right Straight line Acceleration
- 8. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL Estimate the optimal value function 𝑄∗(𝑠, 𝑎) This is the maximum value achievable under any policy
- 9. Value Function Ben Lau - Deep Learning and Reinforcement A value function is a prediction of future reward How much reward will I get from action a in state s? A Q-value function gives expected total reward From state-action pair (s, a) Under policy 𝜋 With discount factor 𝛾 𝑄 𝜋 𝑠, 𝑎 = 𝐸 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2 𝑟𝑡+3 + ⋯ 𝑠, 𝑎] An optimal value function is the maximum achievable value 𝑄∗ 𝑠, 𝑎 = 𝑀𝑎𝑥 𝑎 𝑄 𝜋 𝑠, 𝑎 Once we have the 𝑄∗ we can act optimally 𝜋∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄∗ (𝑠, 𝑎)
- 10. Understanding Q Function Ben Lau - Deep Learning and Reinforcement The best way to understand Q function is considering a “strategy guide” Suppose you are playing a difficult game (DOOM) If you have a strategy guide, it’s pretty easy Just follow the guide Suppose you are in state s, and need to make a decision, If you have this m Q-function(strategy guide), then it is easy, just pick the action with highest Q Doom Strategy Guide
- 11. How to find Q-function Ben Lau - Deep Learning and Reinforcement Discount Future Reward:𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 which can be written as: 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑅𝑡+1 Recall the definition of Q-function (max reward if choose action a in state s) 𝑄 𝑠𝑡, 𝑎 𝑡 = max 𝑅𝑡+1 Therefore, we can rewrite the Q-function as below 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’ It can be solved by dynamic programming or iterative solution
- 12. Deep Q-Network (DQN) Ben Lau - Deep Learning and Reinforcement Action-Value function (Q-function) often very big DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network 𝑄 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎, 𝑤 Training become finding sets of optimal weights w instead In the literature we often called “non-linear function approximation” State Action Value A 1 140.11 A 2 139.22 B 1 145.89 B 2 140.23 C 1 123.67 C 2 135.27 ≈
- 13. DQN Demo Using DeepQ network to play Doom
- 14. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Policy-Based RL Search directly for the optimal policy 𝜋∗ This is the policy achieving maximum future reward
- 15. Deep Policy Network Ben Lau - Deep Learning and Reinforcement Review: A policy is the agent’s behavior It is a map from state to action: at = π(st) We can directly search the policy Let’s parameterize the policy by some model parameters 𝜃 𝑎 = 𝜋(𝑠, 𝜃) We called it Policy-Based reinforcement learning because we will adjust the model parameters 𝜃 directly The goal is to maximize the total discount reward from beginning maximize total 𝑅 = 𝑟1 + 𝛾𝑟2 + 𝛾2 𝑟3 + ⋯
- 16. Policy Gradient Ben Lau - Deep Learning and Reinforcement How to make good action more likely? Define objective function as total discounted reward 𝐿 𝜃 = 𝐸 𝑟1 + 𝛾𝑟2 + 𝛾2 𝑟3 + ⋯ |𝜋 𝜃(𝑠, 𝑎) or 𝐿 𝜃 = 𝐸 𝑅|𝜋 𝜃(𝑠, 𝑎) Where the expectations of the total reward R is calculated under some probability distribution 𝑝(𝑎|𝜃) parameterized by 𝜃 The goal become maximize the total reward by compute the gradient 𝜕𝐿(𝜃) 𝜕𝜃
- 17. Policy Gradient (II) Ben Lau - Deep Learning and Reinforcement Recall: Q-function is the maximum discounted future reward in state s, actio 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥𝑅𝑡+1 In the continuous case we can written as 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑅𝑡+1 Therefore, we can compute the gradient as 𝜕𝐿(𝜃) 𝜕𝜃 = 𝐸 𝑝(𝑎|𝜃) 𝜕𝑄 𝜕𝜃 Using chain-rule, we can re-write as 𝜕𝐿(𝜃) 𝜕𝜃 = 𝐸 𝑝(𝑎|𝜃) 𝜕𝑄 𝜃(𝑠,𝑎) 𝜕𝑎 𝜕𝑎 𝜕𝜃 No dynamics model required! 1. Only requires Q is differential w.r.t. a 2. As long as a can be parameterized as function of 𝜃
- 18. The power of Policy Gradient Ben Lau - Deep Learning and Reinforcement Because the policy gradient does not require the dynamical model therefore, no prior domain knowledge is required AlphaGo doesn’t pre-programme any domain knowledge It keep playing many times (via self-play) and adjust the policy parameters 𝜃 to maximize the reward(winning probability)
- 19. Intuition: Value vs Policy RL Ben Lau - Deep Learning and Reinforcement Valued Based RL is similar to driving instructor : A score is given for any action is taken by student Policy Based RL is similar to a driver : It is the actual policy how to drive a car
- 20. The car racing game TORCS Ben Lau - Deep Learning and Reinforcement TORCS is a state of the art open source simulator written in C++ Main Features Sophisticated dynamics Provided with several tracks, controllers Sensors Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks Quite realistic to self-driving cars… Track sensors
- 21. Deep Learning Recipe Ben Lau - Deep Learning and Reinforcement reward Game input state s Deep Neural network Stee r Gas Peda l Brak e Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks Compute the optimal policy 𝜋 via policy gradient
- 22. Design of the reward function Ben Lau - Deep Learning and Reinforcement Obvious choice : Highest velocity of the car 𝑅 = 𝑉𝑐𝑎𝑟 cos 𝜃 However, experience found that learning not very stable Use modify reward function 𝑅 = 𝑉𝑥 cos 𝜃 −𝑉𝑥 sin 𝜃 −𝑉𝑥|track pos| Encourage stay in the center of the track
- 23. Source code available here: Google: DDPG Keras Ben Lau - Deep Learning and Reinforcement
- 24. Training Set: Aalborg Track
- 25. Validation Set: Alpine Tracks Recall basic Machine Learning, make sure you need to test the model In the validation set, not the training set
- 26. Learning how to brake Ben Lau - Deep Learning and Reinforcement Since we try to maximize the velocity of the car The AI agent don’t want to hit the brake at all! (As it go against the reward function) Using Stochastic Brake Idea
- 27. Final Demo – Car does not stay center of track Ben Lau - Deep Learning and Reinforcement
- 28. Future Application Ben Lau - Deep Learning and Reinforcement Self driving cars:
- 29. Future Application
- 30. Thank you! Twitter: @yanpanlau
- 31. Appendix
- 32. How to find Q-function (II) Ben Lau - Deep Learning and Reinforcement 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′ , 𝑎′ ) We could use iterative method to solve the Q-function, given a transition (s,a, We want 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′ , 𝑎′ ) to be same as 𝑄 𝑠, 𝑎 Consider find Q-function is a regression task, we can define a loss function Loss function = 1 2 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎) 2 Q is optimal when the loss function is minimum target prediction

Nenhum painel de recortes público que contém este slide

Parece que você já adicionou este slide ao painel

Criar painel de recortes

Entre para ver os comentários