Intro to Reinforcement Learning

Introduction to Reinforcement Learning
- Utkarsh Garg
How do we learn to do Stuff?
• When any living organism, gets exposed to a specific
stimulus (or a situation), there is an effect of strengthening
the future behaviour of that organism when it has been
exposed to a specific stimulus prompting it to execute the
learned behaviour.
• The organism’s behaviour is controlled by detectable
changes in the environment, which is something external
that influences an activity. For example, our bodies can
detect touch, sound, vision, etc.
• The organism’s brain uses reinforcement or punishment to
modify the likelihood of a behaviour. As well, it involves
voluntary behaviour that can be described with the following
example on animal behaviour:
• dog can be trained to jump higher when rewarded by
dog treats, meaning its behaviour was reinforced by
treats to perform specific actions
With the advancements in Robotics Arm Manipulation, Google Deep Mind
beating a professional Alpha Go Player, and recently the OpenAI team beating a
professional DOTA player, the field of reinforcement learning has really exploded
in recent years
Before we understand how these systems were able to accomplish something
like above, lets first learn about the building blocks of Reinforcement learning.
Let’s learn to crawl before we run!
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Grid World
Deterministic Grid World Stochastic Grid World
• Need to reach from point A to B
• Each segment shows time in hrs. A to C takes 4 mins
• The shortest path in this problem is ACDEGHB
• This is a deterministic problem
• Let’s say we introduce some traffic with some
probabilities in each path
• There is 25% chance it will take 10 mins and 75%
chance it will take 3 mins to reach point C from point
A. Similar some probabilities for other segments
• Now, if we run the simulation multiple times, the
shortest time path would be different for each
iteration due to randomness in traffic introduced in
the system. This is called a Stochastic process
• Finding the shortest time route is not straight forward
anymore. In real world we may not know these
probabilities as well. Our goal is now to find the most
probable shortest path.
Another Example
Reinforcement
Learning
• Reinforcement learning (RL) is an area of machine learning
concerned with how software agents ought to take actions in
an environment so as to maximize some notion of cumulative
reward.
A simple example of the above system:
 Imagine a baby is given a TV remote control at your home
(environment)
 The baby (agent) will first observe the TV and its state (if its
on/of, what channel etc.)
 Then the curious baby will take certain actions like hitting
the remote control (action) and observe how would the TV
response (next state)
 As a non-responding TV is dull, the baby dislike it (receiving
a negative reward) and will take less actions that will lead
to such a result (updating the policy) and vice versa.
 The baby will repeat the process until he/she finds a policy
(what to do under different circumstances) that he/she is
happy with (maximizing the total (discounted) rewards).
BREAKOUT
Reward and Policy
• The reward structure of our system depends on how and what we want our system to learn
R(s) = -2.0R(s) = -0.4
R(s) = -0.03R(s) = -0.01
• We not only want the system to greedily get whatever the
highest reward it is getting right now but we also want it to
consider the future reward.
Why?
It leads to better strategies!
• Therefore, we want to:
• Maximize the sum of rewards
• Prefer rewards now more than later since we deal with a stochastic
process and we never know if the action we take leads to the target state
with the reward
Calculating Rewards
In the picture on the left,
• the two paths are policies
• Each circle is a state and each diamond a reward
• The agent needs to decide the optimal path (or policy) so
that it maximizes its total reward
• If it was a deterministic process, both paths would lead to
equal sum of rewards
• But since we are dealing with a Stochastic process, we
cannot wait for the 4th circle as the policy may not take us to
max reward
One way to model this is to exponentially decay future
rewards:
𝛾(gamma) is the decaying factor. Therefore, the reward
equation becomes:
Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴
r_5+ …
The above equation gives us a quantitative basis to say that the
agent would prefer path 1 as the value of Total discounted
award is more than the second case.
Done with basics.
Let’s go Deeper
Q - Learning
What is Q?
• Q-value: Q(s,a) is the value of total discounted rewards, when the agent
takes an action a and then follows the most optimal path (that is why we
have max over all actions in below equation).
• And Q*(s,a) is this value for the best action at state s.
By having this value for all combinations of states and actions,
Q table
Reward Value
1 Step -0.04
Power +0.5
Mines -10
End +1 or -1
𝛾 = 0.9
Learned Q Values
Exploration Vs Exploitation
• There is an important concept of the exploration and
exploitation trade off in reinforcement learning.
• Exploration is all about finding more information about
an environment, whereas exploitation is exploiting
already known information to maximize the rewards.
• Real Life Example: Say you go to the same restaurant
(which you like) every day. You are basically exploiting.
But on the other hand, if you search for new restaurant
every time before going to any one of them, then it’s
exploration. Exploration is very important for the
search of future rewards which might be higher than
the near rewards i.e. you may find a new restaurant
even better than when you were exploiting.
Generalization across States
• Basic Q-Learning keeps a table of all q-values
• In realistic situations, we cannot possibly learn about every
single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory
• Instead, we want to generalize:
• Learn about some small number of training states from
experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning, and we’ll see it
over and over again
State space
• Discretized vertical distance from lower pipe
• Discretized horizontal distance from next pair of pipes
• Life: Dead or Living
Actions
• Click
• Do nothing
Rewards
• +1 if Flappy Bird still alive
• -1000 if Flappy Bird is dead
• 6-7 hours of Q-learning
Generalization Example 1
Let’s say we discover
through experience
that this state is bad:
In naïve q-learning,
we know nothing
about this state:
Or even this one!
Generalization Example 2
• Solution: describe a state using a vector of features
(properties)
• Features are functions from states to real numbers (often
0/1) that capture important properties of the state
• Example features:
• Distance to closest ghost
• Distance to closest dot
• Number of ghosts
• 1 / (dist to dot)2
• Is Pacman in a tunnel? (0/1)
• …… etc.
• Is it the exact state on this slide?
• Can also describe a q-state (s, a) with features (e.g. action
moves closer to food)
• Now instead of a Q table, we have these features using
which we can train any supervised learning algo to learn
the Q values and hence the right actions
Feature Based Representation
Generalization Example 3 (play video)
4 Actions available:
• The avg angle of the
blades
• Difference in angle
between front and back
• Difference in angle
between left and right
• Angle for the tail rotor
Task:
Learn to hover
States:
• Data from various sensors
Note! The most efficient policy it
found was to fly inverted!
Going even
Deeper…
Deep Q Networks (DQN)
Alpha Go
• In 2016, initial version of alphago lee beat 17 times world champion lee
sedol.
• Just a year later, alphago zero beat unlike its predecessor was trained
without any data from real human games
• It learned only by playing against itself. The 2016 version was defeated
100-0 by alphago zero.
• Go has shown us that AI has started to move beyond what
humans can tell it to do.
• This was shown when the alphago made the move37. For
humans or the world champion, it was a seemingly bad
move, but it turn out to be a game changing move which led
to alphago’s victory
Arch Link : https://applied-
data.science/static/main/res/alpha_go_zero_cheat_sheet.png
Alpha Go Training Graph
Self Driving Cars
Supervised learning based self driving car (with simulator)
https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s
The reinforcement learning way to do this!
https://wayve.ai/blog/learning-to-drive-in-a-day-with-
reinforcement-learning
Landing Spacex Rockets
https://www.youtube.com/watch?v=4_igzo4qNmQ
Thank You
1 de 29

Recomendados

Q-Learning Algorithm: A Concise Introduction [Shakeeb A.] por
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
71 visualizações17 slides
An introduction to reinforcement learning por
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
4.3K visualizações30 slides
anintroductiontoreinforcementlearning-180912151720.pdf por
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
1 visão30 slides
semi supervised Learning and Reinforcement learning (1).pptx por
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta
2 visualizações54 slides
Reinfrocement Learning por
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
174 visualizações58 slides
RL.ppt por
RL.pptRL.ppt
RL.pptAzharJamil15
27 visualizações16 slides

Mais conteúdo relacionado

Similar a Intro to Reinforcement Learning

Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen) por
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
115 visualizações11 slides
Search-Beyond-Classical-no-exercise-answers.pdf por
Search-Beyond-Classical-no-exercise-answers.pdfSearch-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfMrRRThirrunavukkaras
3 visualizações29 slides
Rl chapter 1 introduction por
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
991 visualizações35 slides
Deep Reinforcement Learning por
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
2.9K visualizações38 slides
Reinforcement learning por
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
279 visualizações17 slides
Survey of Modern Reinforcement Learning por
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
170 visualizações38 slides

Similar a Intro to Reinforcement Learning(20)

Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen) por Hogeon Seo
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Hogeon Seo115 visualizações
Search-Beyond-Classical-no-exercise-answers.pdf por MrRRThirrunavukkaras
Search-Beyond-Classical-no-exercise-answers.pdfSearch-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdf
MrRRThirrunavukkaras3 visualizações
Rl chapter 1 introduction por ConnorShorten2
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
ConnorShorten2991 visualizações
Deep Reinforcement Learning por Usman Qayyum
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum2.9K visualizações
Reinforcement learning por Elias Hasnat
Reinforcement learningReinforcement learning
Reinforcement learning
Elias Hasnat279 visualizações
Survey of Modern Reinforcement Learning por Julia Maddalena
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
Julia Maddalena170 visualizações
Organizational behaviour por rajlaxmipardeshi
Organizational behaviourOrganizational behaviour
Organizational behaviour
rajlaxmipardeshi29 visualizações
Reinforcement learning por Ding Li
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li397 visualizações
reiniforcement learning.ppt por charusharma165
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma16511 visualizações
Reinforcement Learning por Salem-Kabbani
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani931 visualizações
(ppt por butest
(ppt(ppt
(ppt
butest860 visualizações
Reinforcement learning, Q-Learning por Kuppusamy P
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P1.7K visualizações
02LocalSearch.pdf por vijaykumar95844
02LocalSearch.pdf02LocalSearch.pdf
02LocalSearch.pdf
vijaykumar9584410 visualizações
YijueRL.ppt por Shoaib Iqbal
YijueRL.pptYijueRL.ppt
YijueRL.ppt
Shoaib Iqbal4 visualizações
RL_online _presentation_1.ppt por ssuser43a599
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
ssuser43a5995 visualizações
Making smart decisions in real-time with Reinforcement Learning por Ruth Yakubu
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
Ruth Yakubu59 visualizações
Reinforcement learning por Farzad M. Zaravand
Reinforcement learningReinforcement learning
Reinforcement learning
Farzad M. Zaravand127 visualizações
Problem space por harman_sekhon
Problem spaceProblem space
Problem space
harman_sekhon580 visualizações

Último

Multi-objective distributed generation integration in radial distribution sy... por
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...IJECEIAES
15 visualizações14 slides
cloud computing-virtualization.pptx por
cloud computing-virtualization.pptxcloud computing-virtualization.pptx
cloud computing-virtualization.pptxRajaulKarim20
85 visualizações31 slides
What is Whirling Hygrometer.pdf por
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdfIIT KHARAGPUR
11 visualizações3 slides
Codes and Conventions.pptx por
Codes and Conventions.pptxCodes and Conventions.pptx
Codes and Conventions.pptxIsabellaGraceAnkers
7 visualizações5 slides
Saikat Chakraborty Java Oracle Certificate.pdf por
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdfSaikatChakraborty787148
14 visualizações1 slide
String.pptx por
String.pptxString.pptx
String.pptxAnanthi Palanisamy
47 visualizações24 slides

Último(20)

Multi-objective distributed generation integration in radial distribution sy... por IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 visualizações
cloud computing-virtualization.pptx por RajaulKarim20
cloud computing-virtualization.pptxcloud computing-virtualization.pptx
cloud computing-virtualization.pptx
RajaulKarim2085 visualizações
What is Whirling Hygrometer.pdf por IIT KHARAGPUR
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdf
IIT KHARAGPUR 11 visualizações
Codes and Conventions.pptx por IsabellaGraceAnkers
Codes and Conventions.pptxCodes and Conventions.pptx
Codes and Conventions.pptx
IsabellaGraceAnkers7 visualizações
Saikat Chakraborty Java Oracle Certificate.pdf por SaikatChakraborty787148
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdf
SaikatChakraborty78714814 visualizações
9_DVD_Dynamic_logic_circuits.pdf por Usha Mehta
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdf
Usha Mehta28 visualizações
LFA-NPG-Paper.pdf por harinsrikanth
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
harinsrikanth40 visualizações
_MAKRIADI-FOTEINI_diploma thesis.pptx por fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi6 visualizações
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx por AnnieRachelJohn
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxSTUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
AnnieRachelJohn31 visualizações
CHEMICAL KINETICS.pdf por AguedaGutirrez
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdf
AguedaGutirrez8 visualizações
SWM L1-L14_drhasan (Part 1).pdf por MahmudHasan747870
SWM L1-L14_drhasan (Part 1).pdfSWM L1-L14_drhasan (Part 1).pdf
SWM L1-L14_drhasan (Part 1).pdf
MahmudHasan74787048 visualizações
SWM L15-L28_drhasan (Part 2).pdf por MahmudHasan747870
SWM L15-L28_drhasan (Part 2).pdfSWM L15-L28_drhasan (Part 2).pdf
SWM L15-L28_drhasan (Part 2).pdf
MahmudHasan74787028 visualizações
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... por ahmedmesaiaoun
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
ahmedmesaiaoun12 visualizações
An approach of ontology and knowledge base for railway maintenance por IJECEIAES
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenance
IJECEIAES12 visualizações
NEW SUPPLIERS SUPPLIES (copie).pdf por georgesradjou
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdf
georgesradjou14 visualizações

Intro to Reinforcement Learning

  • 1. Introduction to Reinforcement Learning - Utkarsh Garg
  • 2. How do we learn to do Stuff? • When any living organism, gets exposed to a specific stimulus (or a situation), there is an effect of strengthening the future behaviour of that organism when it has been exposed to a specific stimulus prompting it to execute the learned behaviour. • The organism’s behaviour is controlled by detectable changes in the environment, which is something external that influences an activity. For example, our bodies can detect touch, sound, vision, etc. • The organism’s brain uses reinforcement or punishment to modify the likelihood of a behaviour. As well, it involves voluntary behaviour that can be described with the following example on animal behaviour: • dog can be trained to jump higher when rewarded by dog treats, meaning its behaviour was reinforced by treats to perform specific actions
  • 3. With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years Before we understand how these systems were able to accomplish something like above, lets first learn about the building blocks of Reinforcement learning. Let’s learn to crawl before we run!
  • 4.  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards Grid World
  • 5. Deterministic Grid World Stochastic Grid World
  • 6. • Need to reach from point A to B • Each segment shows time in hrs. A to C takes 4 mins • The shortest path in this problem is ACDEGHB • This is a deterministic problem • Let’s say we introduce some traffic with some probabilities in each path • There is 25% chance it will take 10 mins and 75% chance it will take 3 mins to reach point C from point A. Similar some probabilities for other segments • Now, if we run the simulation multiple times, the shortest time path would be different for each iteration due to randomness in traffic introduced in the system. This is called a Stochastic process • Finding the shortest time route is not straight forward anymore. In real world we may not know these probabilities as well. Our goal is now to find the most probable shortest path. Another Example
  • 7. Reinforcement Learning • Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
  • 8. A simple example of the above system:  Imagine a baby is given a TV remote control at your home (environment)  The baby (agent) will first observe the TV and its state (if its on/of, what channel etc.)  Then the curious baby will take certain actions like hitting the remote control (action) and observe how would the TV response (next state)  As a non-responding TV is dull, the baby dislike it (receiving a negative reward) and will take less actions that will lead to such a result (updating the policy) and vice versa.  The baby will repeat the process until he/she finds a policy (what to do under different circumstances) that he/she is happy with (maximizing the total (discounted) rewards).
  • 10. Reward and Policy • The reward structure of our system depends on how and what we want our system to learn R(s) = -2.0R(s) = -0.4 R(s) = -0.03R(s) = -0.01
  • 11. • We not only want the system to greedily get whatever the highest reward it is getting right now but we also want it to consider the future reward. Why? It leads to better strategies!
  • 12. • Therefore, we want to: • Maximize the sum of rewards • Prefer rewards now more than later since we deal with a stochastic process and we never know if the action we take leads to the target state with the reward
  • 13. Calculating Rewards In the picture on the left, • the two paths are policies • Each circle is a state and each diamond a reward • The agent needs to decide the optimal path (or policy) so that it maximizes its total reward • If it was a deterministic process, both paths would lead to equal sum of rewards • But since we are dealing with a Stochastic process, we cannot wait for the 4th circle as the policy may not take us to max reward One way to model this is to exponentially decay future rewards: 𝛾(gamma) is the decaying factor. Therefore, the reward equation becomes: Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴ r_5+ … The above equation gives us a quantitative basis to say that the agent would prefer path 1 as the value of Total discounted award is more than the second case.
  • 15. Q - Learning What is Q? • Q-value: Q(s,a) is the value of total discounted rewards, when the agent takes an action a and then follows the most optimal path (that is why we have max over all actions in below equation). • And Q*(s,a) is this value for the best action at state s. By having this value for all combinations of states and actions, Q table Reward Value 1 Step -0.04 Power +0.5 Mines -10 End +1 or -1 𝛾 = 0.9
  • 17. Exploration Vs Exploitation • There is an important concept of the exploration and exploitation trade off in reinforcement learning. • Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards. • Real Life Example: Say you go to the same restaurant (which you like) every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards i.e. you may find a new restaurant even better than when you were exploiting.
  • 18. Generalization across States • Basic Q-Learning keeps a table of all q-values • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to hold the q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar situations • This is a fundamental idea in machine learning, and we’ll see it over and over again
  • 19. State space • Discretized vertical distance from lower pipe • Discretized horizontal distance from next pair of pipes • Life: Dead or Living Actions • Click • Do nothing Rewards • +1 if Flappy Bird still alive • -1000 if Flappy Bird is dead • 6-7 hours of Q-learning Generalization Example 1
  • 20. Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one! Generalization Example 2
  • 21. • Solution: describe a state using a vector of features (properties) • Features are functions from states to real numbers (often 0/1) that capture important properties of the state • Example features: • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1 / (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …… etc. • Is it the exact state on this slide? • Can also describe a q-state (s, a) with features (e.g. action moves closer to food) • Now instead of a Q table, we have these features using which we can train any supervised learning algo to learn the Q values and hence the right actions Feature Based Representation
  • 22. Generalization Example 3 (play video) 4 Actions available: • The avg angle of the blades • Difference in angle between front and back • Difference in angle between left and right • Angle for the tail rotor Task: Learn to hover States: • Data from various sensors Note! The most efficient policy it found was to fly inverted!
  • 25. Alpha Go • In 2016, initial version of alphago lee beat 17 times world champion lee sedol. • Just a year later, alphago zero beat unlike its predecessor was trained without any data from real human games • It learned only by playing against itself. The 2016 version was defeated 100-0 by alphago zero. • Go has shown us that AI has started to move beyond what humans can tell it to do. • This was shown when the alphago made the move37. For humans or the world champion, it was a seemingly bad move, but it turn out to be a game changing move which led to alphago’s victory Arch Link : https://applied- data.science/static/main/res/alpha_go_zero_cheat_sheet.png
  • 27. Self Driving Cars Supervised learning based self driving car (with simulator) https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s The reinforcement learning way to do this! https://wayve.ai/blog/learning-to-drive-in-a-day-with- reinforcement-learning