SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
Reinforcement Learning
Learning Through Interaction
Sutton:
“When an infant plays, waves its arms, or looks about, it has no explicit
teacher, but it does have a direct sensorimotor connection to its
environment. Exercising this connection produces a wealth of
information about cause and effect, about the consequences of
actions, and about what to do in order to achieve goals”
• Reinforcement learning is a computational approach for this type of
learning. It adopts AI perspective to model learning through
interaction.
• As a single (agent) approaches the system, it takes an
action. Upon this action he gets a reward and jumps to
the next state. Online learning becomes plausible
3
Reinforcement Learning
Reinforcement Objective
• Learning the relation between the current situation (state) and the
action to be taken in order to optimize a “payment”
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
• The action that is taken influences on the next step “closed loop”
• The learner has to discover which action to take (in ML terminology
we can write the feature vector as some features are function of
others)
RL- Elements
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
Basic Elements (Cont)
• Policy (π)- The “strategy” in which the agent decides which action to take.
Abstractly speaking the policy is simply a probability function that is defined for
each state
• Episode – A sequence of states and their actions
• 𝑉π
(𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the
expected reward (e.g. in a chess the expected final outcome of the game if we
follow a given strategy)
• V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all
possible trajectories starting from 𝑠 )
• Q(s,a) - The analog for V(s) : the planar value function for state s and action a
7
Examples
• Tic Tac Toe
GridWorld (0,-1,10,5)
• We wish to find the best slot machine (best = max reward).
Strategy
Play ! .. and find the machine with the biggest reward (on average)
• At the beginning we pick each slot randomly
• After several steps we gain some knowledge
How do we choose which machine to play?
1. Should we always use the best machine ?
2. Should we pick it randomly?
3. Any other mechanism?
9
Slot Machines n-armed bandit
• The common trade-off
1. Play always with best machine -Exploitation
We may miss better machines due to statistical “noise”
2. Choose machine randomly - Exploration
We don’t take the optimal machine,
Epsilon Greedy
We exploit in probability (1- ε) and explore with probability ε
Typically ε=0.1
10
Exploration ,Exploitation & Epsilon Greedy
• Some problems (like n-bandit) are -Next Best Action.
1. A single given state
2. A set of options that are associated with this state
3. A reward for each action
• Sometimes we wish to learn journeys
Examples:
1. Teach a robot to go from point A to point B
2. Find the fastest way to drive home
11
Episodes
• Episode
1. A “time series” of states {S1, S2, S3.. SK}
2. For all state Si There are set of options {O1, O2,..Oki }
3. Learning formula (the “gradient”) depends not only on the immediate
rewards but on the next state as well
12
Episode (Cont.)
• The observed sequence:
st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward
• We optimize our goal function (commonly maximizing the average):
Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor
Classical Example
The Pole Balancing
Find the exact force to implement
in order to keep the pole up
The reward is 1 for every time step that
The pole didn’t fall
Reinforcement Learning – Foundation
Markov Property
Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At }
i.e. : The current state captures the entire history
• Markov processes are fully determined by the transition matrix P
Markov Process (or Markov Chain)
A tuple <S,P> where
S - set of states (mostly finite),
P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s]
Markov Decision Process -MDP
A Markov Reward Process -MRP (Markov Chain with Values)
A tuple < S,P, R, γ>
S ,P as in Markov process,
R a reward function Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1] (as in Gt )
State Value Function for MRP:
v(s) = E [Gt | St = s]
MDP-cont
Bellman Eq.
• v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]=
E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ]
We get a recursion rule:
v(s) = E[Rt+1 + γ v(s t+1) | St = s]
Similalry we can define on a value on state-action space:
Q(s,a)= E [Gt | St = s, At =a]
MDP - MRP with a finite set of actions A
MDP-cont
• Recall - policy π is the strategy – it maps between states and actions.
π(a|s) = P [At = a | St = s]
We assume that for each time t ,and state S π( | St) is fixed (π is stationary )
Clearly for a MDP, a given policy π modifies the MDP:
R -> Rπ P->Pπ
We modify V & Q
Vπ(s) = Eπ [G t | S t = s]
Qπ(s,a) = Eπ [G t | S t = s, At =a]
Policy
• For V (or Q) the optimal value function v* ,for each state s :
v*(s) = max
π
vπ(s) π -policy
Solving MDP ≡ Finding the optimal value function!!!!
Optimal Policy
π ≥ π’ if vπ(s) ≥ v π’(s) ∀s
Theorem
For every MDP there exists optimal policy
Optimal Value Function
• If we know 𝑞∗ (s,a) we can find the optimal policy:
Optimal Value (Cont)
• Dynamic programming
• Monte Carlo
• TD methods
Objectives
Prediction - Find the optimal function
Control – Find the optimal policy
Solution Methods
• A class of algorithms used in many applications such as graph theory
(shortest path) and bio informatics. It has two essential properties:
1. Can be decomposed to sub solutions
2. Solutions can be cashed and reused
RL-MDP satisfies these both
• We assume a full knowledge of MDP !!!
Prediction
Input: MDP and policy
Output: Optimal Value function vπ
Control
Input: MDP
Output: Optimal Value function v* Optimal policy π *
Dynamic Programming
• Assume policy π and MDP we wish to find the optima V π(s)
V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s]
• Since policy and MDP are known it is a linear eq. in vi
but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto)
Prediction – Optimal Value Function
• Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the
policy which will lead to an optimal function
Policy Improvement (policy iteration)
• Policy iteration requires policy updating which can be heavy.
• We can study 𝑉∗ and obtain the policy through
• The idea is that
• Hence we can find 𝑉∗ iteratively (and derive the optimal policy)
Value Iteration
• The formula supports online update
• Bootstrapping
• Mostly we don’t have MDP
DP -Remarks
• A model free (we don’t need MDP)
1. It learns from generating episodes.
2. It must complete an episode for having the required average.
3. It is unbiased
• For a policy π
S0,A0,R1….. St ~ π
We use empirical mean return rather expected return.
V(St) =V(St) +
1
𝑁(𝑡)
[ Gt –V(St ) ] N(t) – Amount of visits at time t
For non-stationary cases we update differently:
V(St) =V(St) +α [ Gt –V(St ) ]
In MC one must terminate the episode to get the value (we calculate the mean
explicitly ) Hence in grid problems it may work bad
Monte Carlo Methods
• Learn the optimal policy (using Q function):
Monte Carlo Control
Temporal Difference –TD
• Motivation –Combining DP & MC
As MC -Learning from experience , no explicit MDP
As DP- Bootstrapping, no need to complete the episodes
Prediction
Recall that for MC we have
Where Gt is known only at the end of the episode.
TD –Methods (Cont.)
• TD method needs to wait until the next step (TD(0))
We can see that it leads to different targets:
MC- Gt
TD - Rt+1 + γ V(S t+1)
• Hence it is a Bootstrapping method
The estimaion of V given a policy is straightforwad since the policy
chooses S t+1.
Driving Home Problem
TD Vs. MC -Summary
MC
• High variance unbiased
• Good convergence
• Easy to understand
• Low sensitivity for i.c
TD
• Efficiency
• Convergence to V π
• More sensitive to i.c.
SARSA
• On Policy method for Qlearning (update after every step):
The next step is using SARSA to develop also a control algorithm, we
learn on policy the Q function and update the policy toward
greedyness
On Policy Control Algorithm
Example Windy Grid-World
Qlearning –Off Policy
• Rather learning from an action that has been offered we simply take
the best action for the state
The control algorithm is straightforward
Value Function Approx.
• Sometimes we have a large scale RL
1. TD backgammon (Appendix)
2. GO – (Deep Mind)
3. Helicopter (continuous)
• Our objectives are still :control & predictions but we have huge
amount of states.
• The tabular solutions that we presented are not scalable.
• Value Function approx. will allow us to use models!!!
Value Function (Cont)
• Consider a large (continuous ) MDP
Vπ (s)= 𝑉′
π (s,w)
Qπ (s,a) =𝑄′
π (s,a,w) w –set of function parameters
• We can train them by both TD & MC .
• We can expand values to unseen states
Type of Approximations
1. Linear Combinations
2. Neural networks (lead to DQN)
3. Wavelet solutions
Function Approximation on the technics
• Define features vectors (X(S)) for the state S. e.g.
Distance from target
Trend in stock
Chess board configuration
• Training methods for W
• SGD
Linear Function get the form: 𝑉′
π =<X(S),W>
RL -Based problems
• No supervisor, only rewards solutions become:
Deep -RL
Why using Deep RL?
• It allows us to find an optimal model (value/policy)
• It allows us to optimize a model
• Commonly we will use SGD
Examples
• Automatic cars
• Atari
• Deep Mind
• TD- Gammon
Q – network
• We follow the value function approx. approach
Q(s,a,w)≈𝑄∗(s,a)
Q-Learning
• We will simply follow TD target function with supervised manners:
Target
r+ γmax
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤)
Loss -MSE
(r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2
• We solve it with SGD
Q Network –Stability Issues
Divergences
• Correlation between successive samples ,non-iid
• Policy is not necessarily stationary (influences on Q value)
• Scale of rewards and Q value is unknown
Deep –Q network
Experience Replay
Replay the data from the past with the current W
It allows to remove correlation in data:
• Pick at upon a greedy algorithm
• Store in memory the tuple(st, at, rt+1, st+1 ) - Replay
• Now calculate the MSE
Experience Repaly
DQN (Cont)
Fixed Target Q- Network
In order to handle oscillations
We calculate targets with respect to old parameters 𝑤−
r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤− )
The loss becomes
(r+ γ max
𝑎′
𝑄(𝑠′
, 𝑎′
, 𝑤−
) −Q(s,a,w) )2
𝑤−
<- w
DQN –Summary
Many further methods:
• RewardValue
• Double DQN
• Parallel Updates
Requires another lecture
Gradient Policy
• We have discussed:
1. Function approximations
2. Algorithms in which policy is learned through the value functions
We can parametrize policy using parameters θ :
πθ (s, a) =P[a| s, θ]
Remark: we focus on model free!!
Policy Based Good & Bad
Good
Better in High dimensions
Convergence faster
Bad
Less efficient for high variance
Local minima
Example: Rock-Paper-Scissors
How to optimize a policy?
• We assume it is differentiable and calculate the log-likelihood
• We assume further Gibbs distribution i.e.
policy exponent in value function
πθ (s, a) α 𝑒−θΦ(𝑠,𝑎)
Deriving by θ implies:
We can also use Gaussian policy
Optimize policy (Cont.)
Actor-Critic
Critic – Update the action-state function by w
Actor –Update the policy θ upon the critic suggestion
• Rather Learning value functions we learn probabilities. Let At the action
that is taken at time t
Pr(At =a) = πt (a) =
𝑒Ht (a)
𝑏=1
𝑘
𝑒Ht (b)
H – Numerical Preference
We assume Gibbs Boltzmann Distribution
R¯t - The average until time t
Rt - The reward at time t
Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) )
Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At
Gradient Bandit algorithm
Further Reading
• Sutton & Barto
http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-
bookdraft2016sep.pdf
Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8
• DeepMind papers
• David Silver –Youtube and ucl.ac.uk
• TD-Backgammon
Thank you

Mais conteúdo relacionado

Mais procurados

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningJungyeol
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 

Mais procurados (20)

Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Finalver
FinalverFinalver
Finalver
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 

Semelhante a Reinfrocement Learning

Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Shakeeb Ahmad Mohammad Mukhtar
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 

Semelhante a Reinfrocement Learning (20)

Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
RL intro
RL introRL intro
RL intro
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 

Mais de Natan Katz

AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptxNatan Katz
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUPNatan Katz
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Natan Katz
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL DivergenceNatan Katz
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihoodNatan Katz
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference projectNatan Katz
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference Natan Katz
 
Variational inference
Variational inference  Variational inference
Variational inference Natan Katz
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesNatan Katz
 

Mais de Natan Katz (16)

final_v.pptx
final_v.pptxfinal_v.pptx
final_v.pptx
 
AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptx
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUP
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs
 
Cyn meetup
Cyn meetupCyn meetup
Cyn meetup
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
Quant2a
Quant2aQuant2a
Quant2a
 
Bismark
BismarkBismark
Bismark
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Deep VI with_beta_likelihood
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihood
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference project
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
 
Ucb
UcbUcb
Ucb
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
Variational inference
Variational inference  Variational inference
Variational inference
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectives
 

Último

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 

Último (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

Reinfrocement Learning

  • 2. Learning Through Interaction Sutton: “When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals” • Reinforcement learning is a computational approach for this type of learning. It adopts AI perspective to model learning through interaction.
  • 3. • As a single (agent) approaches the system, it takes an action. Upon this action he gets a reward and jumps to the next state. Online learning becomes plausible 3 Reinforcement Learning
  • 4. Reinforcement Objective • Learning the relation between the current situation (state) and the action to be taken in order to optimize a “payment” Predicting the expected future reward given the current state (s) : 1. Which actions should we take in order to maximize our gain 2. Which actions should we take in order to maximize the click rate • The action that is taken influences on the next step “closed loop” • The learner has to discover which action to take (in ML terminology we can write the feature vector as some features are function of others)
  • 5. RL- Elements • State (s) - The place where the user agent is right now. Examples: 1. A position on a chess board 2. A potential customer in a sales web • Action (a)- An action that a user can take while he is in a state. Examples: 1. Knight pawn captures bishop 2. The user buys a ticket • Reward (r) - The reward that is obtained due to the action Examples: 1. A better worse position 2. More money or more clicks
  • 6. Basic Elements (Cont) • Policy (π)- The “strategy” in which the agent decides which action to take. Abstractly speaking the policy is simply a probability function that is defined for each state • Episode – A sequence of states and their actions • 𝑉π (𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the expected reward (e.g. in a chess the expected final outcome of the game if we follow a given strategy) • V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all possible trajectories starting from 𝑠 ) • Q(s,a) - The analog for V(s) : the planar value function for state s and action a
  • 9. • We wish to find the best slot machine (best = max reward). Strategy Play ! .. and find the machine with the biggest reward (on average) • At the beginning we pick each slot randomly • After several steps we gain some knowledge How do we choose which machine to play? 1. Should we always use the best machine ? 2. Should we pick it randomly? 3. Any other mechanism? 9 Slot Machines n-armed bandit
  • 10. • The common trade-off 1. Play always with best machine -Exploitation We may miss better machines due to statistical “noise” 2. Choose machine randomly - Exploration We don’t take the optimal machine, Epsilon Greedy We exploit in probability (1- ε) and explore with probability ε Typically ε=0.1 10 Exploration ,Exploitation & Epsilon Greedy
  • 11. • Some problems (like n-bandit) are -Next Best Action. 1. A single given state 2. A set of options that are associated with this state 3. A reward for each action • Sometimes we wish to learn journeys Examples: 1. Teach a robot to go from point A to point B 2. Find the fastest way to drive home 11 Episodes
  • 12. • Episode 1. A “time series” of states {S1, S2, S3.. SK} 2. For all state Si There are set of options {O1, O2,..Oki } 3. Learning formula (the “gradient”) depends not only on the immediate rewards but on the next state as well 12 Episode (Cont.)
  • 13. • The observed sequence: st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward • We optimize our goal function (commonly maximizing the average): Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor Classical Example The Pole Balancing Find the exact force to implement in order to keep the pole up The reward is 1 for every time step that The pole didn’t fall Reinforcement Learning – Foundation
  • 14. Markov Property Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At } i.e. : The current state captures the entire history • Markov processes are fully determined by the transition matrix P Markov Process (or Markov Chain) A tuple <S,P> where S - set of states (mostly finite), P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s] Markov Decision Process -MDP
  • 15. A Markov Reward Process -MRP (Markov Chain with Values) A tuple < S,P, R, γ> S ,P as in Markov process, R a reward function Rs = E [Rt+1 | St = s] γ is a discount factor, γ ∈ [0, 1] (as in Gt ) State Value Function for MRP: v(s) = E [Gt | St = s] MDP-cont
  • 16. Bellman Eq. • v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]= E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ] We get a recursion rule: v(s) = E[Rt+1 + γ v(s t+1) | St = s] Similalry we can define on a value on state-action space: Q(s,a)= E [Gt | St = s, At =a] MDP - MRP with a finite set of actions A MDP-cont
  • 17. • Recall - policy π is the strategy – it maps between states and actions. π(a|s) = P [At = a | St = s] We assume that for each time t ,and state S π( | St) is fixed (π is stationary ) Clearly for a MDP, a given policy π modifies the MDP: R -> Rπ P->Pπ We modify V & Q Vπ(s) = Eπ [G t | S t = s] Qπ(s,a) = Eπ [G t | S t = s, At =a] Policy
  • 18. • For V (or Q) the optimal value function v* ,for each state s : v*(s) = max π vπ(s) π -policy Solving MDP ≡ Finding the optimal value function!!!! Optimal Policy π ≥ π’ if vπ(s) ≥ v π’(s) ∀s Theorem For every MDP there exists optimal policy Optimal Value Function
  • 19. • If we know 𝑞∗ (s,a) we can find the optimal policy: Optimal Value (Cont)
  • 20. • Dynamic programming • Monte Carlo • TD methods Objectives Prediction - Find the optimal function Control – Find the optimal policy Solution Methods
  • 21. • A class of algorithms used in many applications such as graph theory (shortest path) and bio informatics. It has two essential properties: 1. Can be decomposed to sub solutions 2. Solutions can be cashed and reused RL-MDP satisfies these both • We assume a full knowledge of MDP !!! Prediction Input: MDP and policy Output: Optimal Value function vπ Control Input: MDP Output: Optimal Value function v* Optimal policy π * Dynamic Programming
  • 22. • Assume policy π and MDP we wish to find the optima V π(s) V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s] • Since policy and MDP are known it is a linear eq. in vi but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto) Prediction – Optimal Value Function
  • 23. • Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the policy which will lead to an optimal function Policy Improvement (policy iteration)
  • 24. • Policy iteration requires policy updating which can be heavy. • We can study 𝑉∗ and obtain the policy through • The idea is that • Hence we can find 𝑉∗ iteratively (and derive the optimal policy) Value Iteration
  • 25.
  • 26. • The formula supports online update • Bootstrapping • Mostly we don’t have MDP DP -Remarks
  • 27. • A model free (we don’t need MDP) 1. It learns from generating episodes. 2. It must complete an episode for having the required average. 3. It is unbiased • For a policy π S0,A0,R1….. St ~ π We use empirical mean return rather expected return. V(St) =V(St) + 1 𝑁(𝑡) [ Gt –V(St ) ] N(t) – Amount of visits at time t For non-stationary cases we update differently: V(St) =V(St) +α [ Gt –V(St ) ] In MC one must terminate the episode to get the value (we calculate the mean explicitly ) Hence in grid problems it may work bad Monte Carlo Methods
  • 28. • Learn the optimal policy (using Q function): Monte Carlo Control
  • 29. Temporal Difference –TD • Motivation –Combining DP & MC As MC -Learning from experience , no explicit MDP As DP- Bootstrapping, no need to complete the episodes Prediction Recall that for MC we have Where Gt is known only at the end of the episode.
  • 30. TD –Methods (Cont.) • TD method needs to wait until the next step (TD(0)) We can see that it leads to different targets: MC- Gt TD - Rt+1 + γ V(S t+1) • Hence it is a Bootstrapping method The estimaion of V given a policy is straightforwad since the policy chooses S t+1.
  • 32. TD Vs. MC -Summary MC • High variance unbiased • Good convergence • Easy to understand • Low sensitivity for i.c TD • Efficiency • Convergence to V π • More sensitive to i.c.
  • 33. SARSA • On Policy method for Qlearning (update after every step): The next step is using SARSA to develop also a control algorithm, we learn on policy the Q function and update the policy toward greedyness
  • 34. On Policy Control Algorithm
  • 36. Qlearning –Off Policy • Rather learning from an action that has been offered we simply take the best action for the state The control algorithm is straightforward
  • 37. Value Function Approx. • Sometimes we have a large scale RL 1. TD backgammon (Appendix) 2. GO – (Deep Mind) 3. Helicopter (continuous) • Our objectives are still :control & predictions but we have huge amount of states. • The tabular solutions that we presented are not scalable. • Value Function approx. will allow us to use models!!!
  • 38. Value Function (Cont) • Consider a large (continuous ) MDP Vπ (s)= 𝑉′ π (s,w) Qπ (s,a) =𝑄′ π (s,a,w) w –set of function parameters • We can train them by both TD & MC . • We can expand values to unseen states
  • 39. Type of Approximations 1. Linear Combinations 2. Neural networks (lead to DQN) 3. Wavelet solutions
  • 40. Function Approximation on the technics • Define features vectors (X(S)) for the state S. e.g. Distance from target Trend in stock Chess board configuration • Training methods for W • SGD Linear Function get the form: 𝑉′ π =<X(S),W>
  • 41. RL -Based problems • No supervisor, only rewards solutions become:
  • 42. Deep -RL Why using Deep RL? • It allows us to find an optimal model (value/policy) • It allows us to optimize a model • Commonly we will use SGD Examples • Automatic cars • Atari • Deep Mind • TD- Gammon
  • 43. Q – network • We follow the value function approx. approach Q(s,a,w)≈𝑄∗(s,a)
  • 44. Q-Learning • We will simply follow TD target function with supervised manners: Target r+ γmax 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) Loss -MSE (r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2 • We solve it with SGD
  • 45. Q Network –Stability Issues Divergences • Correlation between successive samples ,non-iid • Policy is not necessarily stationary (influences on Q value) • Scale of rewards and Q value is unknown
  • 46. Deep –Q network Experience Replay Replay the data from the past with the current W It allows to remove correlation in data: • Pick at upon a greedy algorithm • Store in memory the tuple(st, at, rt+1, st+1 ) - Replay • Now calculate the MSE
  • 48. DQN (Cont) Fixed Target Q- Network In order to handle oscillations We calculate targets with respect to old parameters 𝑤− r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤− ) The loss becomes (r+ γ max 𝑎′ 𝑄(𝑠′ , 𝑎′ , 𝑤− ) −Q(s,a,w) )2 𝑤− <- w
  • 49. DQN –Summary Many further methods: • RewardValue • Double DQN • Parallel Updates Requires another lecture
  • 50. Gradient Policy • We have discussed: 1. Function approximations 2. Algorithms in which policy is learned through the value functions We can parametrize policy using parameters θ : πθ (s, a) =P[a| s, θ] Remark: we focus on model free!!
  • 51.
  • 52. Policy Based Good & Bad Good Better in High dimensions Convergence faster Bad Less efficient for high variance Local minima Example: Rock-Paper-Scissors
  • 53. How to optimize a policy? • We assume it is differentiable and calculate the log-likelihood • We assume further Gibbs distribution i.e. policy exponent in value function πθ (s, a) α 𝑒−θΦ(𝑠,𝑎) Deriving by θ implies: We can also use Gaussian policy
  • 54. Optimize policy (Cont.) Actor-Critic Critic – Update the action-state function by w Actor –Update the policy θ upon the critic suggestion
  • 55.
  • 56. • Rather Learning value functions we learn probabilities. Let At the action that is taken at time t Pr(At =a) = πt (a) = 𝑒Ht (a) 𝑏=1 𝑘 𝑒Ht (b) H – Numerical Preference We assume Gibbs Boltzmann Distribution R¯t - The average until time t Rt - The reward at time t Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) ) Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At Gradient Bandit algorithm
  • 57. Further Reading • Sutton & Barto http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton- bookdraft2016sep.pdf Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8 • DeepMind papers • David Silver –Youtube and ucl.ac.uk • TD-Backgammon