Reinfrocement Learning

N
Reinforcement Learning
Learning Through Interaction
Sutton:
“When an infant plays, waves its arms, or looks about, it has no explicit
teacher, but it does have a direct sensorimotor connection to its
environment. Exercising this connection produces a wealth of
information about cause and effect, about the consequences of
actions, and about what to do in order to achieve goals”
• Reinforcement learning is a computational approach for this type of
learning. It adopts AI perspective to model learning through
interaction.
• As a single (agent) approaches the system, it takes an
action. Upon this action he gets a reward and jumps to
the next state. Online learning becomes plausible
3
Reinforcement Learning
Reinforcement Objective
• Learning the relation between the current situation (state) and the
action to be taken in order to optimize a “payment”
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
• The action that is taken influences on the next step “closed loop”
• The learner has to discover which action to take (in ML terminology
we can write the feature vector as some features are function of
others)
RL- Elements
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
Basic Elements (Cont)
• Policy (π)- The “strategy” in which the agent decides which action to take.
Abstractly speaking the policy is simply a probability function that is defined for
each state
• Episode – A sequence of states and their actions
• 𝑉π
(𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the
expected reward (e.g. in a chess the expected final outcome of the game if we
follow a given strategy)
• V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all
possible trajectories starting from 𝑠 )
• Q(s,a) - The analog for V(s) : the planar value function for state s and action a
7
Examples
• Tic Tac Toe
GridWorld (0,-1,10,5)
• We wish to find the best slot machine (best = max reward).
Strategy
Play ! .. and find the machine with the biggest reward (on average)
• At the beginning we pick each slot randomly
• After several steps we gain some knowledge
How do we choose which machine to play?
1. Should we always use the best machine ?
2. Should we pick it randomly?
3. Any other mechanism?
9
Slot Machines n-armed bandit
• The common trade-off
1. Play always with best machine -Exploitation
We may miss better machines due to statistical “noise”
2. Choose machine randomly - Exploration
We don’t take the optimal machine,
Epsilon Greedy
We exploit in probability (1- ε) and explore with probability ε
Typically ε=0.1
10
Exploration ,Exploitation & Epsilon Greedy
• Some problems (like n-bandit) are -Next Best Action.
1. A single given state
2. A set of options that are associated with this state
3. A reward for each action
• Sometimes we wish to learn journeys
Examples:
1. Teach a robot to go from point A to point B
2. Find the fastest way to drive home
11
Episodes
• Episode
1. A “time series” of states {S1, S2, S3.. SK}
2. For all state Si There are set of options {O1, O2,..Oki }
3. Learning formula (the “gradient”) depends not only on the immediate
rewards but on the next state as well
12
Episode (Cont.)
• The observed sequence:
st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward
• We optimize our goal function (commonly maximizing the average):
Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor
Classical Example
The Pole Balancing
Find the exact force to implement
in order to keep the pole up
The reward is 1 for every time step that
The pole didn’t fall
Reinforcement Learning – Foundation
Markov Property
Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At }
i.e. : The current state captures the entire history
• Markov processes are fully determined by the transition matrix P
Markov Process (or Markov Chain)
A tuple <S,P> where
S - set of states (mostly finite),
P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s]
Markov Decision Process -MDP
A Markov Reward Process -MRP (Markov Chain with Values)
A tuple < S,P, R, γ>
S ,P as in Markov process,
R a reward function Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1] (as in Gt )
State Value Function for MRP:
v(s) = E [Gt | St = s]
MDP-cont
Bellman Eq.
• v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]=
E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ]
We get a recursion rule:
v(s) = E[Rt+1 + γ v(s t+1) | St = s]
Similalry we can define on a value on state-action space:
Q(s,a)= E [Gt | St = s, At =a]
MDP - MRP with a finite set of actions A
MDP-cont
• Recall - policy π is the strategy – it maps between states and actions.
π(a|s) = P [At = a | St = s]
We assume that for each time t ,and state S π( | St) is fixed (π is stationary )
Clearly for a MDP, a given policy π modifies the MDP:
R -> Rπ P->Pπ
We modify V & Q
Vπ(s) = Eπ [G t | S t = s]
Qπ(s,a) = Eπ [G t | S t = s, At =a]
Policy
• For V (or Q) the optimal value function v* ,for each state s :
v*(s) = max
π
vπ(s) π -policy
Solving MDP ≡ Finding the optimal value function!!!!
Optimal Policy
π ≥ π’ if vπ(s) ≥ v π’(s) ∀s
Theorem
For every MDP there exists optimal policy
Optimal Value Function
• If we know 𝑞∗ (s,a) we can find the optimal policy:
Optimal Value (Cont)
• Dynamic programming
• Monte Carlo
• TD methods
Objectives
Prediction - Find the optimal function
Control – Find the optimal policy
Solution Methods
• A class of algorithms used in many applications such as graph theory
(shortest path) and bio informatics. It has two essential properties:
1. Can be decomposed to sub solutions
2. Solutions can be cashed and reused
RL-MDP satisfies these both
• We assume a full knowledge of MDP !!!
Prediction
Input: MDP and policy
Output: Optimal Value function vπ
Control
Input: MDP
Output: Optimal Value function v* Optimal policy π *
Dynamic Programming
• Assume policy π and MDP we wish to find the optima V π(s)
V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s]
• Since policy and MDP are known it is a linear eq. in vi
but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto)
Prediction – Optimal Value Function
• Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the
policy which will lead to an optimal function
Policy Improvement (policy iteration)
• Policy iteration requires policy updating which can be heavy.
• We can study 𝑉∗ and obtain the policy through
• The idea is that
• Hence we can find 𝑉∗ iteratively (and derive the optimal policy)
Value Iteration
Reinfrocement Learning
• The formula supports online update
• Bootstrapping
• Mostly we don’t have MDP
DP -Remarks
• A model free (we don’t need MDP)
1. It learns from generating episodes.
2. It must complete an episode for having the required average.
3. It is unbiased
• For a policy π
S0,A0,R1….. St ~ π
We use empirical mean return rather expected return.
V(St) =V(St) +
1
𝑁(𝑡)
[ Gt –V(St ) ] N(t) – Amount of visits at time t
For non-stationary cases we update differently:
V(St) =V(St) +α [ Gt –V(St ) ]
In MC one must terminate the episode to get the value (we calculate the mean
explicitly ) Hence in grid problems it may work bad
Monte Carlo Methods
• Learn the optimal policy (using Q function):
Monte Carlo Control
Temporal Difference –TD
• Motivation –Combining DP & MC
As MC -Learning from experience , no explicit MDP
As DP- Bootstrapping, no need to complete the episodes
Prediction
Recall that for MC we have
Where Gt is known only at the end of the episode.
TD –Methods (Cont.)
• TD method needs to wait until the next step (TD(0))
We can see that it leads to different targets:
MC- Gt
TD - Rt+1 + γ V(S t+1)
• Hence it is a Bootstrapping method
The estimaion of V given a policy is straightforwad since the policy
chooses S t+1.
Driving Home Problem
TD Vs. MC -Summary
MC
• High variance unbiased
• Good convergence
• Easy to understand
• Low sensitivity for i.c
TD
• Efficiency
• Convergence to V π
• More sensitive to i.c.
SARSA
• On Policy method for Qlearning (update after every step):
The next step is using SARSA to develop also a control algorithm, we
learn on policy the Q function and update the policy toward
greedyness
On Policy Control Algorithm
Example Windy Grid-World
Qlearning –Off Policy
• Rather learning from an action that has been offered we simply take
the best action for the state
The control algorithm is straightforward
Value Function Approx.
• Sometimes we have a large scale RL
1. TD backgammon (Appendix)
2. GO – (Deep Mind)
3. Helicopter (continuous)
• Our objectives are still :control & predictions but we have huge
amount of states.
• The tabular solutions that we presented are not scalable.
• Value Function approx. will allow us to use models!!!
Value Function (Cont)
• Consider a large (continuous ) MDP
Vπ (s)= 𝑉′
π (s,w)
Qπ (s,a) =𝑄′
π (s,a,w) w –set of function parameters
• We can train them by both TD & MC .
• We can expand values to unseen states
Type of Approximations
1. Linear Combinations
2. Neural networks (lead to DQN)
3. Wavelet solutions
Function Approximation on the technics
• Define features vectors (X(S)) for the state S. e.g.
Distance from target
Trend in stock
Chess board configuration
• Training methods for W
• SGD
Linear Function get the form: 𝑉′
π =<X(S),W>
RL -Based problems
• No supervisor, only rewards solutions become:
Deep -RL
Why using Deep RL?
• It allows us to find an optimal model (value/policy)
• It allows us to optimize a model
• Commonly we will use SGD
Examples
• Automatic cars
• Atari
• Deep Mind
• TD- Gammon
Q – network
• We follow the value function approx. approach
Q(s,a,w)≈𝑄∗(s,a)
Q-Learning
• We will simply follow TD target function with supervised manners:
Target
r+ γmax
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤)
Loss -MSE
(r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2
• We solve it with SGD
Q Network –Stability Issues
Divergences
• Correlation between successive samples ,non-iid
• Policy is not necessarily stationary (influences on Q value)
• Scale of rewards and Q value is unknown
Deep –Q network
Experience Replay
Replay the data from the past with the current W
It allows to remove correlation in data:
• Pick at upon a greedy algorithm
• Store in memory the tuple(st, at, rt+1, st+1 ) - Replay
• Now calculate the MSE
Experience Repaly
DQN (Cont)
Fixed Target Q- Network
In order to handle oscillations
We calculate targets with respect to old parameters 𝑤−
r+ γ max
𝑎′
𝑄(𝑠′, 𝑎′, 𝑤− )
The loss becomes
(r+ γ max
𝑎′
𝑄(𝑠′
, 𝑎′
, 𝑤−
) −Q(s,a,w) )2
𝑤−
<- w
DQN –Summary
Many further methods:
• RewardValue
• Double DQN
• Parallel Updates
Requires another lecture
Gradient Policy
• We have discussed:
1. Function approximations
2. Algorithms in which policy is learned through the value functions
We can parametrize policy using parameters θ :
πθ (s, a) =P[a| s, θ]
Remark: we focus on model free!!
Reinfrocement Learning
Policy Based Good & Bad
Good
Better in High dimensions
Convergence faster
Bad
Less efficient for high variance
Local minima
Example: Rock-Paper-Scissors
How to optimize a policy?
• We assume it is differentiable and calculate the log-likelihood
• We assume further Gibbs distribution i.e.
policy exponent in value function
πθ (s, a) α 𝑒−θΦ(𝑠,𝑎)
Deriving by θ implies:
We can also use Gaussian policy
Optimize policy (Cont.)
Actor-Critic
Critic – Update the action-state function by w
Actor –Update the policy θ upon the critic suggestion
Reinfrocement Learning
• Rather Learning value functions we learn probabilities. Let At the action
that is taken at time t
Pr(At =a) = πt (a) =
𝑒Ht (a)
𝑏=1
𝑘
𝑒Ht (b)
H – Numerical Preference
We assume Gibbs Boltzmann Distribution
R¯t - The average until time t
Rt - The reward at time t
Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) )
Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At
Gradient Bandit algorithm
Further Reading
• Sutton & Barto
http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-
bookdraft2016sep.pdf
Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8
• DeepMind papers
• David Silver –Youtube and ucl.ac.uk
• TD-Backgammon
Thank you
1 de 58

Recomendados

An introduction to reinforcement learning (rl) por
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
4.8K visualizações29 slides
Reinforcement learning por
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
927 visualizações51 slides
Reinforcement Learning : A Beginners Tutorial por
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
15.3K visualizações37 slides
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI por
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
2.9K visualizações43 slides
Generalized Reinforcement Learning por
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement LearningPo-Hsiang (Barnett) Chiu
2.3K visualizações63 slides
An introduction to reinforcement learning por
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
4.4K visualizações30 slides

Mais conteúdo relacionado

Mais procurados

Policy gradient por
Policy gradientPolicy gradient
Policy gradientJie-Han Chen
3.4K visualizações67 slides
Reinforcement learning por
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
44.7K visualizações64 slides
Deep Reinforcement Learning por
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningMeetupDataScienceRoma
687 visualizações42 slides
Reinforcement learning 7313 por
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
13.7K visualizações30 slides
Deep Q-Learning por
Deep Q-LearningDeep Q-Learning
Deep Q-LearningNikolay Pavlov
3.8K visualizações27 slides
Discrete sequential prediction of continuous actions for deep RL por
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
402 visualizações46 slides

Mais procurados(20)

Policy gradient por Jie-Han Chen
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen3.4K visualizações
Reinforcement learning por Chandra Meena
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena44.7K visualizações
Reinforcement learning 7313 por Slideshare
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare13.7K visualizações
Deep Q-Learning por Nikolay Pavlov
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov3.8K visualizações
Discrete sequential prediction of continuous actions for deep RL por Jie-Han Chen
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
Jie-Han Chen402 visualizações
lecture_21.pptx - PowerPoint Presentation por butest
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest398 visualizações
Deep reinforcement learning from scratch por Jie-Han Chen
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen1.1K visualizações
Reinforcement Learning por Jungyeol
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Jungyeol2.4K visualizações
Reinforcement Learning por DongHyun Kwak
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak1.4K visualizações
Finalver por Natan Katz
FinalverFinalver
Finalver
Natan Katz69 visualizações
Intro to Deep Reinforcement Learning por Khaled Saleh
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh805 visualizações
Multi armed bandit por Jie-Han Chen
Multi armed banditMulti armed bandit
Multi armed bandit
Jie-Han Chen655 visualizações
Financial Trading as a Game: A Deep Reinforcement Learning Approach por 謙益 黃
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃1.7K visualizações
Temporal difference learning por Jie-Han Chen
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen2.4K visualizações
Deep Reinforcement Learning por Usman Qayyum
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum3K visualizações
Introduction to Reinforcement Learning por Edward Balaban
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
Edward Balaban3K visualizações
Exploration Strategies in Reinforcement Learning por Dongmin Lee
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
Dongmin Lee2.3K visualizações
An introduction to deep reinforcement learning por Big Data Colombia
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia5.1K visualizações

Similar a Reinfrocement Learning

Deep RL.pdf por
Deep RL.pdfDeep RL.pdf
Deep RL.pdfMohammadHosseinModir
22 visualizações38 slides
Introduction to Deep Reinforcement Learning por
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIDEAS - Int'l Data Engineering and Science Association
137 visualizações34 slides
RL intro por
RL introRL intro
RL introKhangBom
5 visualizações31 slides
Reinforcement learning, Q-Learning por
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
1.8K visualizações24 slides
Demystifying deep reinforement learning por
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
204 visualizações23 slides
Head First Reinforcement Learning por
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
13 visualizações76 slides

Similar a Reinfrocement Learning(20)

RL intro por KhangBom
RL introRL intro
RL intro
KhangBom5 visualizações
Reinforcement learning, Q-Learning por Kuppusamy P
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P1.8K visualizações
Demystifying deep reinforement learning por 재연 윤
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤204 visualizações
Head First Reinforcement Learning por azzeddine chenine
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
azzeddine chenine13 visualizações
An efficient use of temporal difference technique in Computer Game Learning por Prabhu Kumar
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar70 visualizações
Cs221 lecture8-fall11 por darwinrlo
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
darwinrlo370 visualizações
Reinforcement learning 1.pdf por ssuserd27779
Reinforcement learning 1.pdfReinforcement learning 1.pdf
Reinforcement learning 1.pdf
ssuserd2777994 visualizações
anintroductiontoreinforcementlearning-180912151720.pdf por ssuseradaf5f
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
ssuseradaf5f1 visão
TensorFlow and Deep Learning Tips and Tricks por Ben Ball
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball1.9K visualizações
RL.ppt por AzharJamil15
RL.pptRL.ppt
RL.ppt
AzharJamil1531 visualizações
14_ReinforcementLearning.pptx por RithikRaj25
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
RithikRaj255 visualizações
Introduction of Deep Reinforcement Learning por NAVER Engineering
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering5.3K visualizações
Reinforcement Learning and Artificial Neural Nets por Pierre de Lacaze
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
Pierre de Lacaze1.8K visualizações
reiniforcement learning.ppt por charusharma165
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma16511 visualizações
Reinforcement Learning on Mine Sweeper por DataScienceLab
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab38 visualizações

Mais de Natan Katz

final_v.pptx por
final_v.pptxfinal_v.pptx
final_v.pptxNatan Katz
20 visualizações26 slides
AI for PM.pptx por
AI for PM.pptxAI for PM.pptx
AI for PM.pptxNatan Katz
9 visualizações30 slides
SGLD Berlin ML GROUP por
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUPNatan Katz
93 visualizações53 slides
Ancestry, Anecdotes & Avanan -DL for Amateurs por
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Natan Katz
83 visualizações74 slides
Cyn meetup por
Cyn meetupCyn meetup
Cyn meetupNatan Katz
71 visualizações35 slides
Foundation of KL Divergence por
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL DivergenceNatan Katz
206 visualizações52 slides

Mais de Natan Katz(16)

final_v.pptx por Natan Katz
final_v.pptxfinal_v.pptx
final_v.pptx
Natan Katz20 visualizações
AI for PM.pptx por Natan Katz
AI for PM.pptxAI for PM.pptx
AI for PM.pptx
Natan Katz9 visualizações
SGLD Berlin ML GROUP por Natan Katz
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUP
Natan Katz93 visualizações
Ancestry, Anecdotes & Avanan -DL for Amateurs por Natan Katz
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs
Natan Katz83 visualizações
Cyn meetup por Natan Katz
Cyn meetupCyn meetup
Cyn meetup
Natan Katz71 visualizações
Foundation of KL Divergence por Natan Katz
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
Natan Katz206 visualizações
Quant2a por Natan Katz
Quant2aQuant2a
Quant2a
Natan Katz41 visualizações
Bismark por Natan Katz
BismarkBismark
Bismark
Natan Katz188 visualizações
Bayesian Neural Networks por Natan Katz
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
Natan Katz588 visualizações
Deep VI with_beta_likelihood por Natan Katz
Deep VI with_beta_likelihoodDeep VI with_beta_likelihood
Deep VI with_beta_likelihood
Natan Katz141 visualizações
NICE Research -Variational inference project por Natan Katz
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference project
Natan Katz55 visualizações
NICE Implementations of Variational Inference por Natan Katz
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
Natan Katz29 visualizações
Ucb por Natan Katz
UcbUcb
Ucb
Natan Katz89 visualizações
Neural ODE por Natan Katz
Neural ODENeural ODE
Neural ODE
Natan Katz1.3K visualizações
Variational inference por Natan Katz
Variational inference  Variational inference
Variational inference
Natan Katz257 visualizações
GAN for Bayesian Inference objectives por Natan Katz
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectives
Natan Katz973 visualizações

Último

shivam tiwari.pptx por
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
9 visualizações14 slides
Business administration Project File.pdf por
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdfKiranPrajapati91
10 visualizações36 slides
4_4_WP_4_06_ND_Model.pptx por
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptxd6fmc6kwd4
7 visualizações13 slides
Penetration testing by Burpsuite por
Penetration testing by  BurpsuitePenetration testing by  Burpsuite
Penetration testing by BurpsuiteAyonDebnathCertified
5 visualizações19 slides
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between... por
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...DataScienceConferenc1
5 visualizações9 slides
DGST Methodology Presentation.pdf por
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdfmaddierlegum
7 visualizações9 slides

Último(20)

shivam tiwari.pptx por AanyaMishra4
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptx
AanyaMishra49 visualizações
Business administration Project File.pdf por KiranPrajapati91
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdf
KiranPrajapati9110 visualizações
4_4_WP_4_06_ND_Model.pptx por d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 visualizações
Penetration testing by Burpsuite por AyonDebnathCertified
Penetration testing by  BurpsuitePenetration testing by  Burpsuite
Penetration testing by Burpsuite
AyonDebnathCertified5 visualizações
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between... por DataScienceConferenc1
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
DataScienceConferenc15 visualizações
DGST Methodology Presentation.pdf por maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 visualizações
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf por 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 visualizações
LIVE OAK MEMORIAL PARK.pptx por ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always8 visualizações
Pydata Global 2023 - How can a learnt model unlearn something por SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
SARADINDU SENGUPTA8 visualizações
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf por DataScienceConferenc1
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
DataScienceConferenc17 visualizações
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx por DataScienceConferenc1
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
DataScienceConferenc15 visualizações
Employees attrition por MaryAlejandraDiaz
Employees attritionEmployees attrition
Employees attrition
MaryAlejandraDiaz7 visualizações
Inawisdom Quick Sight por PhilipBasford
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick Sight
PhilipBasford8 visualizações
Customer Data Cleansing Project.pptx por Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 visualizações
Oral presentation.pdf por reemalmazroui8
Oral presentation.pdfOral presentation.pdf
Oral presentation.pdf
reemalmazroui85 visualizações
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning por SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
SARADINDU SENGUPTA5 visualizações
Infomatica-MDM.pptx por Kapil Rangwani
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptx
Kapil Rangwani12 visualizações

Reinfrocement Learning

  • 2. Learning Through Interaction Sutton: “When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals” • Reinforcement learning is a computational approach for this type of learning. It adopts AI perspective to model learning through interaction.
  • 3. • As a single (agent) approaches the system, it takes an action. Upon this action he gets a reward and jumps to the next state. Online learning becomes plausible 3 Reinforcement Learning
  • 4. Reinforcement Objective • Learning the relation between the current situation (state) and the action to be taken in order to optimize a “payment” Predicting the expected future reward given the current state (s) : 1. Which actions should we take in order to maximize our gain 2. Which actions should we take in order to maximize the click rate • The action that is taken influences on the next step “closed loop” • The learner has to discover which action to take (in ML terminology we can write the feature vector as some features are function of others)
  • 5. RL- Elements • State (s) - The place where the user agent is right now. Examples: 1. A position on a chess board 2. A potential customer in a sales web • Action (a)- An action that a user can take while he is in a state. Examples: 1. Knight pawn captures bishop 2. The user buys a ticket • Reward (r) - The reward that is obtained due to the action Examples: 1. A better worse position 2. More money or more clicks
  • 6. Basic Elements (Cont) • Policy (π)- The “strategy” in which the agent decides which action to take. Abstractly speaking the policy is simply a probability function that is defined for each state • Episode – A sequence of states and their actions • 𝑉π (𝑠) - The value function of a state 𝑠 when using policy π. Mostly it is the expected reward (e.g. in a chess the expected final outcome of the game if we follow a given strategy) • V(s) - Similar to 𝑉π (𝑠) without a fixed policy (The expected reward over all possible trajectories starting from 𝑠 ) • Q(s,a) - The analog for V(s) : the planar value function for state s and action a
  • 9. • We wish to find the best slot machine (best = max reward). Strategy Play ! .. and find the machine with the biggest reward (on average) • At the beginning we pick each slot randomly • After several steps we gain some knowledge How do we choose which machine to play? 1. Should we always use the best machine ? 2. Should we pick it randomly? 3. Any other mechanism? 9 Slot Machines n-armed bandit
  • 10. • The common trade-off 1. Play always with best machine -Exploitation We may miss better machines due to statistical “noise” 2. Choose machine randomly - Exploration We don’t take the optimal machine, Epsilon Greedy We exploit in probability (1- ε) and explore with probability ε Typically ε=0.1 10 Exploration ,Exploitation & Epsilon Greedy
  • 11. • Some problems (like n-bandit) are -Next Best Action. 1. A single given state 2. A set of options that are associated with this state 3. A reward for each action • Sometimes we wish to learn journeys Examples: 1. Teach a robot to go from point A to point B 2. Find the fastest way to drive home 11 Episodes
  • 12. • Episode 1. A “time series” of states {S1, S2, S3.. SK} 2. For all state Si There are set of options {O1, O2,..Oki } 3. Learning formula (the “gradient”) depends not only on the immediate rewards but on the next state as well 12 Episode (Cont.)
  • 13. • The observed sequence: st ,at , Rt+1, st+1 ,at+1 , Rt+2 ,………….., sT ,aT , RT+1 , s-state , a-action, r-reward • We optimize our goal function (commonly maximizing the average): Gt = Rt+1 +γRt+2 +γ2 Rt+3 + …… + γ𝑙Rt+l+1 0< γ ≤ 1 –aging factor Classical Example The Pole Balancing Find the exact force to implement in order to keep the pole up The reward is 1 for every time step that The pole didn’t fall Reinforcement Learning – Foundation
  • 14. Markov Property Pr{ St+1 = s’, Rt+1 = r | S0, A0, R1, . . . , St-1, At-1, Rt , St , At }= Pr{ St+1 = s’, Rt+1 = r | St , At } i.e. : The current state captures the entire history • Markov processes are fully determined by the transition matrix P Markov Process (or Markov Chain) A tuple <S,P> where S - set of states (mostly finite), P a state transition probability matrix. Namely: Pss’= P [St+1 = s’ | St = s] Markov Decision Process -MDP
  • 15. A Markov Reward Process -MRP (Markov Chain with Values) A tuple < S,P, R, γ> S ,P as in Markov process, R a reward function Rs = E [Rt+1 | St = s] γ is a discount factor, γ ∈ [0, 1] (as in Gt ) State Value Function for MRP: v(s) = E [Gt | St = s] MDP-cont
  • 16. Bellman Eq. • v(s) = E [Gt | St = s] = E [Rt+1 + γRt+2 +γ2 Rt+3 +... | St = s]= E [Rt+1 + γ (Rt+2 + γRt+3+ ...) | St = s] = E [R t+1 + γG t+1 | S t = s ] We get a recursion rule: v(s) = E[Rt+1 + γ v(s t+1) | St = s] Similalry we can define on a value on state-action space: Q(s,a)= E [Gt | St = s, At =a] MDP - MRP with a finite set of actions A MDP-cont
  • 17. • Recall - policy π is the strategy – it maps between states and actions. π(a|s) = P [At = a | St = s] We assume that for each time t ,and state S π( | St) is fixed (π is stationary ) Clearly for a MDP, a given policy π modifies the MDP: R -> Rπ P->Pπ We modify V & Q Vπ(s) = Eπ [G t | S t = s] Qπ(s,a) = Eπ [G t | S t = s, At =a] Policy
  • 18. • For V (or Q) the optimal value function v* ,for each state s : v*(s) = max π vπ(s) π -policy Solving MDP ≡ Finding the optimal value function!!!! Optimal Policy π ≥ π’ if vπ(s) ≥ v π’(s) ∀s Theorem For every MDP there exists optimal policy Optimal Value Function
  • 19. • If we know 𝑞∗ (s,a) we can find the optimal policy: Optimal Value (Cont)
  • 20. • Dynamic programming • Monte Carlo • TD methods Objectives Prediction - Find the optimal function Control – Find the optimal policy Solution Methods
  • 21. • A class of algorithms used in many applications such as graph theory (shortest path) and bio informatics. It has two essential properties: 1. Can be decomposed to sub solutions 2. Solutions can be cashed and reused RL-MDP satisfies these both • We assume a full knowledge of MDP !!! Prediction Input: MDP and policy Output: Optimal Value function vπ Control Input: MDP Output: Optimal Value function v* Optimal policy π * Dynamic Programming
  • 22. • Assume policy π and MDP we wish to find the optima V π(s) V π(s) = Eπ [Rt+1 + γvπ(St+1) | St =s] • Since policy and MDP are known it is a linear eq. in vi but…. Extremely tedious !!!! Let’s do something iterative (Sutton &Barto) Prediction – Optimal Value Function
  • 23. • Following the previous algorithm one can use an algorithm (often a greedy algorithm) to improve the policy which will lead to an optimal function Policy Improvement (policy iteration)
  • 24. • Policy iteration requires policy updating which can be heavy. • We can study 𝑉∗ and obtain the policy through • The idea is that • Hence we can find 𝑉∗ iteratively (and derive the optimal policy) Value Iteration
  • 26. • The formula supports online update • Bootstrapping • Mostly we don’t have MDP DP -Remarks
  • 27. • A model free (we don’t need MDP) 1. It learns from generating episodes. 2. It must complete an episode for having the required average. 3. It is unbiased • For a policy π S0,A0,R1….. St ~ π We use empirical mean return rather expected return. V(St) =V(St) + 1 𝑁(𝑡) [ Gt –V(St ) ] N(t) – Amount of visits at time t For non-stationary cases we update differently: V(St) =V(St) +α [ Gt –V(St ) ] In MC one must terminate the episode to get the value (we calculate the mean explicitly ) Hence in grid problems it may work bad Monte Carlo Methods
  • 28. • Learn the optimal policy (using Q function): Monte Carlo Control
  • 29. Temporal Difference –TD • Motivation –Combining DP & MC As MC -Learning from experience , no explicit MDP As DP- Bootstrapping, no need to complete the episodes Prediction Recall that for MC we have Where Gt is known only at the end of the episode.
  • 30. TD –Methods (Cont.) • TD method needs to wait until the next step (TD(0)) We can see that it leads to different targets: MC- Gt TD - Rt+1 + γ V(S t+1) • Hence it is a Bootstrapping method The estimaion of V given a policy is straightforwad since the policy chooses S t+1.
  • 32. TD Vs. MC -Summary MC • High variance unbiased • Good convergence • Easy to understand • Low sensitivity for i.c TD • Efficiency • Convergence to V π • More sensitive to i.c.
  • 33. SARSA • On Policy method for Qlearning (update after every step): The next step is using SARSA to develop also a control algorithm, we learn on policy the Q function and update the policy toward greedyness
  • 34. On Policy Control Algorithm
  • 36. Qlearning –Off Policy • Rather learning from an action that has been offered we simply take the best action for the state The control algorithm is straightforward
  • 37. Value Function Approx. • Sometimes we have a large scale RL 1. TD backgammon (Appendix) 2. GO – (Deep Mind) 3. Helicopter (continuous) • Our objectives are still :control & predictions but we have huge amount of states. • The tabular solutions that we presented are not scalable. • Value Function approx. will allow us to use models!!!
  • 38. Value Function (Cont) • Consider a large (continuous ) MDP Vπ (s)= 𝑉′ π (s,w) Qπ (s,a) =𝑄′ π (s,a,w) w –set of function parameters • We can train them by both TD & MC . • We can expand values to unseen states
  • 39. Type of Approximations 1. Linear Combinations 2. Neural networks (lead to DQN) 3. Wavelet solutions
  • 40. Function Approximation on the technics • Define features vectors (X(S)) for the state S. e.g. Distance from target Trend in stock Chess board configuration • Training methods for W • SGD Linear Function get the form: 𝑉′ π =<X(S),W>
  • 41. RL -Based problems • No supervisor, only rewards solutions become:
  • 42. Deep -RL Why using Deep RL? • It allows us to find an optimal model (value/policy) • It allows us to optimize a model • Commonly we will use SGD Examples • Automatic cars • Atari • Deep Mind • TD- Gammon
  • 43. Q – network • We follow the value function approx. approach Q(s,a,w)≈𝑄∗(s,a)
  • 44. Q-Learning • We will simply follow TD target function with supervised manners: Target r+ γmax 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) Loss -MSE (r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤) −Q(s,a,w) )2 • We solve it with SGD
  • 45. Q Network –Stability Issues Divergences • Correlation between successive samples ,non-iid • Policy is not necessarily stationary (influences on Q value) • Scale of rewards and Q value is unknown
  • 46. Deep –Q network Experience Replay Replay the data from the past with the current W It allows to remove correlation in data: • Pick at upon a greedy algorithm • Store in memory the tuple(st, at, rt+1, st+1 ) - Replay • Now calculate the MSE
  • 48. DQN (Cont) Fixed Target Q- Network In order to handle oscillations We calculate targets with respect to old parameters 𝑤− r+ γ max 𝑎′ 𝑄(𝑠′, 𝑎′, 𝑤− ) The loss becomes (r+ γ max 𝑎′ 𝑄(𝑠′ , 𝑎′ , 𝑤− ) −Q(s,a,w) )2 𝑤− <- w
  • 49. DQN –Summary Many further methods: • RewardValue • Double DQN • Parallel Updates Requires another lecture
  • 50. Gradient Policy • We have discussed: 1. Function approximations 2. Algorithms in which policy is learned through the value functions We can parametrize policy using parameters θ : πθ (s, a) =P[a| s, θ] Remark: we focus on model free!!
  • 52. Policy Based Good & Bad Good Better in High dimensions Convergence faster Bad Less efficient for high variance Local minima Example: Rock-Paper-Scissors
  • 53. How to optimize a policy? • We assume it is differentiable and calculate the log-likelihood • We assume further Gibbs distribution i.e. policy exponent in value function πθ (s, a) α 𝑒−θΦ(𝑠,𝑎) Deriving by θ implies: We can also use Gaussian policy
  • 54. Optimize policy (Cont.) Actor-Critic Critic – Update the action-state function by w Actor –Update the policy θ upon the critic suggestion
  • 56. • Rather Learning value functions we learn probabilities. Let At the action that is taken at time t Pr(At =a) = πt (a) = 𝑒Ht (a) 𝑏=1 𝑘 𝑒Ht (b) H – Numerical Preference We assume Gibbs Boltzmann Distribution R¯t - The average until time t Rt - The reward at time t Ht+1(At) = Ht(At) + α (Rt − R¯t )(1 − πt(At) ) Ht+1(a) = Ht(a) − α (Rt − R¯t ) πt(a) ∀a ≠ At Gradient Bandit algorithm
  • 57. Further Reading • Sutton & Barto http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton- bookdraft2016sep.pdf Pole balancing - https://www.youtube.com/watch?v=Lt-KLtkDlh8 • DeepMind papers • David Silver –Youtube and ucl.ac.uk • TD-Backgammon