2. How do we learn to do Stuff?
• When any living organism, gets exposed to a specific
stimulus (or a situation), there is an effect of strengthening
the future behaviour of that organism when it has been
exposed to a specific stimulus prompting it to execute the
• The organism’s behaviour is controlled by detectable
changes in the environment, which is something external
that influences an activity. For example, our bodies can
detect touch, sound, vision, etc.
• The organism’s brain uses reinforcement or punishment to
modify the likelihood of a behaviour. As well, it involves
voluntary behaviour that can be described with the following
example on animal behaviour:
• dog can be trained to jump higher when rewarded by
dog treats, meaning its behaviour was reinforced by
treats to perform specific actions
3. With the advancements in Robotics Arm Manipulation, Google Deep Mind
beating a professional Alpha Go Player, and recently the OpenAI team beating a
professional DOTA player, the field of reinforcement learning has really exploded
in recent years
Before we understand how these systems were able to accomplish something
like above, lets first learn about the building blocks of Reinforcement learning.
Let’s learn to crawl before we run!
4. A maze-like problem
The agent lives in a grid
Walls block the agent’s path
Noisy movement: actions do not always go as planned
80% of the time, the action North takes the agent North
(if there is no wall there)
10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have
been taken, the agent stays put
The agent receives rewards each time step
Small “living” reward each step (can be negative)
Big rewards come at the end (good or bad)
Goal: maximize sum of rewards
6. • Need to reach from point A to B
• Each segment shows time in hrs. A to C takes 4 mins
• The shortest path in this problem is ACDEGHB
• This is a deterministic problem
• Let’s say we introduce some traffic with some
probabilities in each path
• There is 25% chance it will take 10 mins and 75%
chance it will take 3 mins to reach point C from point
A. Similar some probabilities for other segments
• Now, if we run the simulation multiple times, the
shortest time path would be different for each
iteration due to randomness in traffic introduced in
the system. This is called a Stochastic process
• Finding the shortest time route is not straight forward
anymore. In real world we may not know these
probabilities as well. Our goal is now to find the most
probable shortest path.
8. A simple example of the above system:
Imagine a baby is given a TV remote control at your home
The baby (agent) will first observe the TV and its state (if its
on/of, what channel etc.)
Then the curious baby will take certain actions like hitting
the remote control (action) and observe how would the TV
response (next state)
As a non-responding TV is dull, the baby dislike it (receiving
a negative reward) and will take less actions that will lead
to such a result (updating the policy) and vice versa.
The baby will repeat the process until he/she finds a policy
(what to do under different circumstances) that he/she is
happy with (maximizing the total (discounted) rewards).
10. Reward and Policy
• The reward structure of our system depends on how and what we want our system to learn
R(s) = -2.0R(s) = -0.4
R(s) = -0.03R(s) = -0.01
11. • We not only want the system to greedily get whatever the
highest reward it is getting right now but we also want it to
consider the future reward.
It leads to better strategies!
12. • Therefore, we want to:
• Maximize the sum of rewards
• Prefer rewards now more than later since we deal with a stochastic
process and we never know if the action we take leads to the target state
with the reward
13. Calculating Rewards
In the picture on the left,
• the two paths are policies
• Each circle is a state and each diamond a reward
• The agent needs to decide the optimal path (or policy) so
that it maximizes its total reward
• If it was a deterministic process, both paths would lead to
equal sum of rewards
• But since we are dealing with a Stochastic process, we
cannot wait for the 4th circle as the policy may not take us to
One way to model this is to exponentially decay future
𝛾(gamma) is the decaying factor. Therefore, the reward
Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴
The above equation gives us a quantitative basis to say that the
agent would prefer path 1 as the value of Total discounted
award is more than the second case.
15. Q - Learning
What is Q?
• Q-value: Q(s,a) is the value of total discounted rewards, when the agent
takes an action a and then follows the most optimal path (that is why we
have max over all actions in below equation).
• And Q*(s,a) is this value for the best action at state s.
By having this value for all combinations of states and actions,
1 Step -0.04
End +1 or -1
𝛾 = 0.9
17. Exploration Vs Exploitation
• There is an important concept of the exploration and
exploitation trade off in reinforcement learning.
• Exploration is all about finding more information about
an environment, whereas exploitation is exploiting
already known information to maximize the rewards.
• Real Life Example: Say you go to the same restaurant
(which you like) every day. You are basically exploiting.
But on the other hand, if you search for new restaurant
every time before going to any one of them, then it’s
exploration. Exploration is very important for the
search of future rewards which might be higher than
the near rewards i.e. you may find a new restaurant
even better than when you were exploiting.
18. Generalization across States
• Basic Q-Learning keeps a table of all q-values
• In realistic situations, we cannot possibly learn about every
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory
• Instead, we want to generalize:
• Learn about some small number of training states from
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning, and we’ll see it
over and over again
19. State space
• Discretized vertical distance from lower pipe
• Discretized horizontal distance from next pair of pipes
• Life: Dead or Living
• Do nothing
• +1 if Flappy Bird still alive
• -1000 if Flappy Bird is dead
• 6-7 hours of Q-learning
Generalization Example 1
20. Let’s say we discover
that this state is bad:
In naïve q-learning,
we know nothing
about this state:
Or even this one!
Generalization Example 2
21. • Solution: describe a state using a vector of features
• Features are functions from states to real numbers (often
0/1) that capture important properties of the state
• Example features:
• Distance to closest ghost
• Distance to closest dot
• Number of ghosts
• 1 / (dist to dot)2
• Is Pacman in a tunnel? (0/1)
• …… etc.
• Is it the exact state on this slide?
• Can also describe a q-state (s, a) with features (e.g. action
moves closer to food)
• Now instead of a Q table, we have these features using
which we can train any supervised learning algo to learn
the Q values and hence the right actions
Feature Based Representation
22. Generalization Example 3 (play video)
4 Actions available:
• The avg angle of the
• Difference in angle
between front and back
• Difference in angle
between left and right
• Angle for the tail rotor
Learn to hover
• Data from various sensors
Note! The most efficient policy it
found was to fly inverted!
25. Alpha Go
• In 2016, initial version of alphago lee beat 17 times world champion lee
• Just a year later, alphago zero beat unlike its predecessor was trained
without any data from real human games
• It learned only by playing against itself. The 2016 version was defeated
100-0 by alphago zero.
• Go has shown us that AI has started to move beyond what
humans can tell it to do.
• This was shown when the alphago made the move37. For
humans or the world champion, it was a seemingly bad
move, but it turn out to be a game changing move which led
to alphago’s victory
Arch Link : https://applied-
27. Self Driving Cars
Supervised learning based self driving car (with simulator)
The reinforcement learning way to do this!