SlideShare uma empresa Scribd logo
1 de 61
Baixar para ler offline
Introduction of Reinforcement Learning
Artificial Intelligence
• 지능이란?
 보다 추상적인 정보를 이해하는 능력
• 인공 지능이란?
 이러한 지능 현상을 인공적으로 구현하려는 연구
Artificial Intelligence & Machine Learning
Deep Learning in RL
• DeepRL에서 딥러닝은 그저 하나의
module로써만 사용된다.
• Deep Learning의 강점 :
1. 층을 쌓을 수록 더욱 Abstract
Feature Learning이 가능한 유일
한 알고리즘.
2. Universal Function Approximator.
Universal Function Approximator
Reinforcement Learning이란?
•Supervised Learning :
y = f(x)
•Unsupervised Learning :
x ~ p(x) or x = f(x)
•Reinforcement Learning :
Find a policy, p(a|s) which maximizes the sum of reward
Machine Learning
Example of Supervised Learning :
Polynomial Curve Fitting
Microsoft Excel 2007의 추세선
Example of Unsupervised Learning :
Clustering
http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
Example of Reinforcement Learning :
Optimal Control Problem
http://graveleylab.cam.uchc.edu/WebData/mduff/older_papers.html
https://studywolf.wordpress.com/2015/03/29/reinforcement-learning-part-3-egocentric-learning/
Markov Decision Processes(MDP)
• Discrete state space : S = { A , B }
• State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 }
• Purpose : Finding a steady state distribution
Markov Processes
• Discrete state space : S = { A , B }
• State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 }
• Reward function : R(S’=A) = +1 , R(S’=B) = -1
• Purpose : Finding an expected reward distribution
Markov Reward Processes
• Discrete state space : S = { A , B }
• Discrete action space : A = { X, Y }
• (Action conditional) State transition probability : P(S’|S , A) = { … }
• Reward function : R(S’=A) = +1 , R(S’=B) = -1
• Purpose : Finding an optimal policy (maximizes the expected sum of future reward)
Markov Decision Processes
•Markov decision processes :
a mathematical framework for modeling decision
making.
• MDP are solved via dynamic programming and reinforcement learning.
• Applications : robotics, automated control, economics and manufacturing.
• Examples of MDP :
1) AlphaGo에서는 바둑을 MDP로 정의
2) 자동차 운전을 MDP로 정의
3) 주식시장을 MDP로 정의
Markov Decision Processes
• Objective : Finding an optimal policy which maximizes the expected
sum of future rewards
• Algorithms
1) Planning : Exhaustive Search / Dynamic Programming
2) Reinforcement Learning : MC method / TD Learning(Q-learning , …)
Agent-Environment Interaction
Discount Factor
• Sum of future rewards in episodic tasks
 Gt := Rt+1 + Rt+2 + Rt+3 + … + RT
• Sum of future rewards in continuous tasks
 Gt := Rt+1 + Rt+2 + Rt+3 + … + RT + …
Gt  ∞ (diverge)
• Sum of discounted future rewards in both case
 Gt := Rt+1 + γRt+2 + γ2Rt+3 + … + γT-1RT + …
= γk−1Rt+k
∞
𝒌=𝟏 (converge)
(Rt is bounded / γ : discount factor, 0 <= γ < 1)
Deterministic Policy, a=f(s)
State S f(S)
A Y
B Y
Stochastic Policy, p(a|s)
State S P(X|S) P(Y|S)
A 0.3 0.7
B 0.4 0.6
• State : 5x5
• Action : 4방향이동
• Reward : A에 도착하면 +10, B에 도착하면 +5, 벽에 부딪히면 -1, 그이외 0
• Discounted Factor : 0.9
Gridworld
Value-based Approach
Value Function
• We will introduce a value function which let us know the expected
sum of future rewards at given state s, following policy π.
1) State-value function
2) Action-value function
1) Policy from state-value function
 One-step ahead search for all actions
with state transition probability(model).
2) Policy from action-value function
 a = f(s) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞π
𝑠 , 𝑎
Optimal Control(Policy) with Value Function
•Objective : Finding an optimal policy which maximizes
the expected sum of future rewards
•Algorithms
1) Planning : Exhaustive Search / Dynamic Programming
 주사위 눈의 평균 = 1*1/6 + 2*1/6 + … + 6*1/6 = 3.5
2) Reinforcement Learning : MC method / TD Learning
 주사위 눈의 평균 = 100번을 던져서 나온 눈을 평균 냄 = 3.5
Planning vs Learning
Solution of the MDP : Planning
•So our last job is find optimal policy and there are two approaches.
2) Dynamic Programming
1) Exhaustive Search
Find Optimal Policy with Exhaustive Search
If we know the one step dynamics of the MDP,
P(s’,r|s,a) , we can do exhaustive search
iteratively until the end step T.
And we can choose the optimal action path, but
this needs O(NT)! A
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
X
X
Y
Y
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
Dynamic Programming
• 전체 문제를 풀 때 중복되는 계산이 발생 
이를 subproblem으로 나누어 풀어 중복계산을 없앰
• 전체 path단위로 계산을 하면 앞부분 계산이 항상 중첩 
두 step 단위로 subproblem 계산을 끝내고 다음 step으로 넘어감.
Dynamic Programming
• We can apply the DP in this problem, and the computational cost
reduces to O(N2T). (But still we need to know the environment
dynamics.)
• DP is a computer science technique which calculates the final goal
value with compositions of cumulative partial values.
Policy Iteration
29
• Policy iteration consists of two simultaneous,
interacting processes.
• Policy evaluation (Iterative formulation):
First, make the value function more precise with the
current policy.
• Policy improvement (Greedy selection):
Second, make greedy policy greedy with respect to
the current value function.
Policy Iteration
•How can we get the state-value function with DP?
(The state-value is subproblem. And action-value function is also similarly computed.)
Policy Iteration = Policy Evaluation + Policy Improvement
30
Policy Iteration
31
Solution of the MDP : Learning
•The planning methods must know the perfect dynamics of
environment, P(s’,r|s,a)
•But typically this is really hard to know and empirically impossible.
Therefore we will ignore this and just calculate the mean from
samples.
•This is the starting point of the machine learning is embedded.
1) Monte Carlo Methods
2) Temporal-Difference Learning
(a.k.a reinforcement learning)
32
Monte Carlo Methods
33
Starting State Value
S1 Average of G(S1)
S2 Average of G(S2)
S3 Average of G(S2)
Tabular state-value function
Monte Carlo Methods
34
•We need a full length of
experience for each starting state.
•This is really time consuming to
update one state while waiting
the terminal of episode.
•Continuous task에는 적용 불가
Q-learning (Temporal Difference Learning)
Bellman Equation
(iterative formulation)
Temporal-Difference Learning
•TD learning is a combination of Monte Carlo idea and
dynamic programming (DP) idea.
•Like Monte Carlo methods, TD methods can learn directly
from raw experience without a model of the environment's
dynamics.
•Like DP, TD methods update estimates based in part without
waiting for a final outcome (they bootstrap).
37
Temporal-Difference Learning
38
Q-learning Algorithm
Temporal-Difference Learning
40
On policy / Off policy
•On policy : Target policy = Behavioral policy
there can be only one policy.
 This can learn a stochastic policy. Ex) SARSA
•Off policy : Target policy != Behavioral policy
there can be several policies.
 Sample efficient. Ex) Q-learning
41
Sarsa: On-Policy TD Control
42
Eligibility Trace
43
•Smoothly combining the TD and MC.
Eligibility Trace
44
Comparisons
45
Planning + Learning
•There is only a difference between planning and learning. That is
the existence of model.
•So we call planning is model-based method, and learning is
model-free method.
46
Planning + Learning
47
Deep Reinforcement Learning
• Value function approximation with deep learning
 Large scale or infinite dimensional state can be solvable
Generalization with deep learning
This needs supervised learning techniques and online moving target regression.
Deep Q Networks (DQN)
• Q learning에서 특정 state에 대한 action-value function으로 CNN
을 사용
State
Q(State,Action=0)
Q(State,Action=1)
.
.
.
Q(State,Action=K)
Loss function
•Q learning의 업데이트 식
•TD error의 크기를 Loss로 사용, regression 방식으로 학습
TD error
Target Network
• Moving Target Regression :
Q를 학습하는 순간, target 값도 같이 변화  학습 성능 저하
 Target Network : DQN과 동일한 구조를 가지고 있으며 학습 도
중 weight값이 변하지 않는 별도의 네트워크로 Q(s’,a’)로 고정된
target value를 사용하여 학습 안정성 개선
(Target Network의 weight값들은 주기적으로 DQN의 것을 복사)
Replay Memory
•Data inefficiency : 그 동안 경험한 experience들을 off-
policy learning으로 재활용(Mini-batch learning)
•Data correlation : Mini-batch를 On-line으로 구성할 경우
Mini batch 내의 데이터들이 서로 비슷함(Correlated)
 한쪽으로 치우친 불균형한 학습이 발생함
 Replay Memory : (State, Action, Reward, Next State) 데
이터를 Buffer에 저장해놓고 그 안에서 Random Sampling하
여 Mini-batch를 구성하는 방법으로 해결
Reward Clipping
•Reward Scale problem : 도메인에 따라 Reward의 scale이
다르기 때문에 학습되는 Q-value의 크기도 제각각.
이렇게 학습해야하는 Q-value의 variance가 매우 큰 경우
(ex : -1000~1000) 신경망 학습이 어려움.
 Reward Clipping: Reward의 크기를 -1~1 사이로 잘라내
어 안정적으로 학습
DQN의 성능
• ATARI 2600 고전게임에서 실험
• 절반 이상의 게임에서 사람보다 우수
• 기존방식 (linear)에 비해 월등한 향상
• 일부 게임은 학습에 실패함.
(sparse reward, 복잡한 단계를 가진 게임)
DQN 데모
DQN의 발전된 모델들 - Double DQN
•Positive bias: max 연산자에 의해 Q-value를 실
제보다 높게 평가하는 경향이 있음
 Double Q learning을 사용하여 이 현상을 방,
DQN에 보다 빠른 학습이 가능
DQN의 발전된 모델들 - Prioritized Replay Memory
•일반적인 DQN은 Replay Memory에서 Uniform Sampling
 TD error가 높은 데이터들이 선택될 확률을 높여,
더 빠르고 효율적인 학습이 가능
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 1998.
[2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement
learning." Nature 518.7540 (2015): 529-533.
[3] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement
Learning with Double Q-Learning." AAAI. 2016.
[4] Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement
learning." arXiv preprint arXiv:1511.06581 (2015).
[5] Schaul, Tom, et al. "Prioritized experience replay." arXiv preprint
arXiv:1511.05952 (2015).
59
Appendix
• Atari 2600 - https://www.youtube.com/watch?v=iqXKQf2BOSE
• Super MARIO - https://www.youtube.com/watch?v=qv6UVOQ0F44
• Robot Learns to Flip Pancakes - https://www.youtube.com/watch?v=W_gxLKSsSIE
• Stanford Autonomous Helicopter - Airshow #2 -
https://www.youtube.com/watch?v=VCdxqn0fcnE
• OpenAI Gym - https://gym.openai.com/envs
• Awesome RL - https://github.com/aikorea/awesome-rl
• Udacity RL course
• TensorFlow DRL - https://github.com/nivwusquorum/tensorflow-deepq
• Karpathy rldemo -
http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html
60
감사합니다.

Mais conteúdo relacionado

Mais procurados

강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 

Mais procurados (20)

Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
From REINFORCE to PPO
From REINFORCE to PPOFrom REINFORCE to PPO
From REINFORCE to PPO
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPODoing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 

Destaque

Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
NAVER Engineering
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
NAVER Engineering
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Jeongkyu Shin
 

Destaque (20)

알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
알파고 풀어보기 / Alpha Technical Review
알파고 풀어보기 / Alpha Technical Review알파고 풀어보기 / Alpha Technical Review
알파고 풀어보기 / Alpha Technical Review
 
조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단
조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단
조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단
 
Multimodal Sequential Learning for Video QA
Multimodal Sequential Learning for Video QAMultimodal Sequential Learning for Video QA
Multimodal Sequential Learning for Video QA
 
바둑인을 위한 알파고
바둑인을 위한 알파고바둑인을 위한 알파고
바둑인을 위한 알파고
 
Online video object segmentation via convolutional trident network
Online video object segmentation via convolutional trident networkOnline video object segmentation via convolutional trident network
Online video object segmentation via convolutional trident network
 
Video Object Segmentation in Videos
Video Object Segmentation in VideosVideo Object Segmentation in Videos
Video Object Segmentation in Videos
 
Step-by-step approach to question answering
Step-by-step approach to question answeringStep-by-step approach to question answering
Step-by-step approach to question answering
 
알파고 해부하기 1부
알파고 해부하기 1부알파고 해부하기 1부
알파고 해부하기 1부
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
 
딥러닝을 활용한 비디오 스토리 질의응답: 뽀로로QA와 심층 임베딩 메모리망
딥러닝을 활용한 비디오 스토리 질의응답: 뽀로로QA와 심층 임베딩 메모리망딥러닝을 활용한 비디오 스토리 질의응답: 뽀로로QA와 심층 임베딩 메모리망
딥러닝을 활용한 비디오 스토리 질의응답: 뽀로로QA와 심층 임베딩 메모리망
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
 
what is_tabs_share
what is_tabs_sharewhat is_tabs_share
what is_tabs_share
 
[143]알파글래스의 개발과정으로 알아보는 ar 스마트글래스 광학 시스템
[143]알파글래스의 개발과정으로 알아보는 ar 스마트글래스 광학 시스템 [143]알파글래스의 개발과정으로 알아보는 ar 스마트글래스 광학 시스템
[143]알파글래스의 개발과정으로 알아보는 ar 스마트글래스 광학 시스템
 
[124]자율주행과 기계학습
[124]자율주행과 기계학습[124]자율주행과 기계학습
[124]자율주행과 기계학습
 
[132]웨일 브라우저 1년 그리고 미래
[132]웨일 브라우저 1년 그리고 미래[132]웨일 브라우저 1년 그리고 미래
[132]웨일 브라우저 1년 그리고 미래
 

Semelhante a Introduction of Deep Reinforcement Learning

Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions

Semelhante a Introduction of Deep Reinforcement Learning (20)

Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 

Mais de NAVER Engineering

Mais de NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Introduction of Deep Reinforcement Learning

  • 2. Artificial Intelligence • 지능이란?  보다 추상적인 정보를 이해하는 능력 • 인공 지능이란?  이러한 지능 현상을 인공적으로 구현하려는 연구
  • 3. Artificial Intelligence & Machine Learning
  • 4. Deep Learning in RL • DeepRL에서 딥러닝은 그저 하나의 module로써만 사용된다. • Deep Learning의 강점 : 1. 층을 쌓을 수록 더욱 Abstract Feature Learning이 가능한 유일 한 알고리즘. 2. Universal Function Approximator.
  • 7. •Supervised Learning : y = f(x) •Unsupervised Learning : x ~ p(x) or x = f(x) •Reinforcement Learning : Find a policy, p(a|s) which maximizes the sum of reward Machine Learning
  • 8. Example of Supervised Learning : Polynomial Curve Fitting Microsoft Excel 2007의 추세선
  • 9. Example of Unsupervised Learning : Clustering http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
  • 10. Example of Reinforcement Learning : Optimal Control Problem http://graveleylab.cam.uchc.edu/WebData/mduff/older_papers.html https://studywolf.wordpress.com/2015/03/29/reinforcement-learning-part-3-egocentric-learning/
  • 12. • Discrete state space : S = { A , B } • State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 } • Purpose : Finding a steady state distribution Markov Processes
  • 13. • Discrete state space : S = { A , B } • State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 } • Reward function : R(S’=A) = +1 , R(S’=B) = -1 • Purpose : Finding an expected reward distribution Markov Reward Processes
  • 14. • Discrete state space : S = { A , B } • Discrete action space : A = { X, Y } • (Action conditional) State transition probability : P(S’|S , A) = { … } • Reward function : R(S’=A) = +1 , R(S’=B) = -1 • Purpose : Finding an optimal policy (maximizes the expected sum of future reward) Markov Decision Processes
  • 15. •Markov decision processes : a mathematical framework for modeling decision making. • MDP are solved via dynamic programming and reinforcement learning. • Applications : robotics, automated control, economics and manufacturing. • Examples of MDP : 1) AlphaGo에서는 바둑을 MDP로 정의 2) 자동차 운전을 MDP로 정의 3) 주식시장을 MDP로 정의 Markov Decision Processes
  • 16. • Objective : Finding an optimal policy which maximizes the expected sum of future rewards • Algorithms 1) Planning : Exhaustive Search / Dynamic Programming 2) Reinforcement Learning : MC method / TD Learning(Q-learning , …) Agent-Environment Interaction
  • 17. Discount Factor • Sum of future rewards in episodic tasks  Gt := Rt+1 + Rt+2 + Rt+3 + … + RT • Sum of future rewards in continuous tasks  Gt := Rt+1 + Rt+2 + Rt+3 + … + RT + … Gt  ∞ (diverge) • Sum of discounted future rewards in both case  Gt := Rt+1 + γRt+2 + γ2Rt+3 + … + γT-1RT + … = γk−1Rt+k ∞ 𝒌=𝟏 (converge) (Rt is bounded / γ : discount factor, 0 <= γ < 1)
  • 19. Stochastic Policy, p(a|s) State S P(X|S) P(Y|S) A 0.3 0.7 B 0.4 0.6
  • 20. • State : 5x5 • Action : 4방향이동 • Reward : A에 도착하면 +10, B에 도착하면 +5, 벽에 부딪히면 -1, 그이외 0 • Discounted Factor : 0.9 Gridworld
  • 22. Value Function • We will introduce a value function which let us know the expected sum of future rewards at given state s, following policy π. 1) State-value function 2) Action-value function
  • 23. 1) Policy from state-value function  One-step ahead search for all actions with state transition probability(model). 2) Policy from action-value function  a = f(s) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞π 𝑠 , 𝑎 Optimal Control(Policy) with Value Function
  • 24. •Objective : Finding an optimal policy which maximizes the expected sum of future rewards •Algorithms 1) Planning : Exhaustive Search / Dynamic Programming  주사위 눈의 평균 = 1*1/6 + 2*1/6 + … + 6*1/6 = 3.5 2) Reinforcement Learning : MC method / TD Learning  주사위 눈의 평균 = 100번을 던져서 나온 눈을 평균 냄 = 3.5 Planning vs Learning
  • 25. Solution of the MDP : Planning •So our last job is find optimal policy and there are two approaches. 2) Dynamic Programming 1) Exhaustive Search
  • 26. Find Optimal Policy with Exhaustive Search If we know the one step dynamics of the MDP, P(s’,r|s,a) , we can do exhaustive search iteratively until the end step T. And we can choose the optimal action path, but this needs O(NT)! A A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) X X Y Y A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1)
  • 27. Dynamic Programming • 전체 문제를 풀 때 중복되는 계산이 발생  이를 subproblem으로 나누어 풀어 중복계산을 없앰 • 전체 path단위로 계산을 하면 앞부분 계산이 항상 중첩  두 step 단위로 subproblem 계산을 끝내고 다음 step으로 넘어감.
  • 28. Dynamic Programming • We can apply the DP in this problem, and the computational cost reduces to O(N2T). (But still we need to know the environment dynamics.) • DP is a computer science technique which calculates the final goal value with compositions of cumulative partial values.
  • 29. Policy Iteration 29 • Policy iteration consists of two simultaneous, interacting processes. • Policy evaluation (Iterative formulation): First, make the value function more precise with the current policy. • Policy improvement (Greedy selection): Second, make greedy policy greedy with respect to the current value function.
  • 30. Policy Iteration •How can we get the state-value function with DP? (The state-value is subproblem. And action-value function is also similarly computed.) Policy Iteration = Policy Evaluation + Policy Improvement 30
  • 32. Solution of the MDP : Learning •The planning methods must know the perfect dynamics of environment, P(s’,r|s,a) •But typically this is really hard to know and empirically impossible. Therefore we will ignore this and just calculate the mean from samples. •This is the starting point of the machine learning is embedded. 1) Monte Carlo Methods 2) Temporal-Difference Learning (a.k.a reinforcement learning) 32
  • 33. Monte Carlo Methods 33 Starting State Value S1 Average of G(S1) S2 Average of G(S2) S3 Average of G(S2) Tabular state-value function
  • 34. Monte Carlo Methods 34 •We need a full length of experience for each starting state. •This is really time consuming to update one state while waiting the terminal of episode. •Continuous task에는 적용 불가
  • 37. Temporal-Difference Learning •TD learning is a combination of Monte Carlo idea and dynamic programming (DP) idea. •Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. •Like DP, TD methods update estimates based in part without waiting for a final outcome (they bootstrap). 37
  • 41. On policy / Off policy •On policy : Target policy = Behavioral policy there can be only one policy.  This can learn a stochastic policy. Ex) SARSA •Off policy : Target policy != Behavioral policy there can be several policies.  Sample efficient. Ex) Q-learning 41
  • 42. Sarsa: On-Policy TD Control 42
  • 46. Planning + Learning •There is only a difference between planning and learning. That is the existence of model. •So we call planning is model-based method, and learning is model-free method. 46
  • 49. • Value function approximation with deep learning  Large scale or infinite dimensional state can be solvable Generalization with deep learning This needs supervised learning techniques and online moving target regression.
  • 50. Deep Q Networks (DQN) • Q learning에서 특정 state에 대한 action-value function으로 CNN 을 사용 State Q(State,Action=0) Q(State,Action=1) . . . Q(State,Action=K)
  • 51. Loss function •Q learning의 업데이트 식 •TD error의 크기를 Loss로 사용, regression 방식으로 학습 TD error
  • 52. Target Network • Moving Target Regression : Q를 학습하는 순간, target 값도 같이 변화  학습 성능 저하  Target Network : DQN과 동일한 구조를 가지고 있으며 학습 도 중 weight값이 변하지 않는 별도의 네트워크로 Q(s’,a’)로 고정된 target value를 사용하여 학습 안정성 개선 (Target Network의 weight값들은 주기적으로 DQN의 것을 복사)
  • 53. Replay Memory •Data inefficiency : 그 동안 경험한 experience들을 off- policy learning으로 재활용(Mini-batch learning) •Data correlation : Mini-batch를 On-line으로 구성할 경우 Mini batch 내의 데이터들이 서로 비슷함(Correlated)  한쪽으로 치우친 불균형한 학습이 발생함  Replay Memory : (State, Action, Reward, Next State) 데 이터를 Buffer에 저장해놓고 그 안에서 Random Sampling하 여 Mini-batch를 구성하는 방법으로 해결
  • 54. Reward Clipping •Reward Scale problem : 도메인에 따라 Reward의 scale이 다르기 때문에 학습되는 Q-value의 크기도 제각각. 이렇게 학습해야하는 Q-value의 variance가 매우 큰 경우 (ex : -1000~1000) 신경망 학습이 어려움.  Reward Clipping: Reward의 크기를 -1~1 사이로 잘라내 어 안정적으로 학습
  • 55. DQN의 성능 • ATARI 2600 고전게임에서 실험 • 절반 이상의 게임에서 사람보다 우수 • 기존방식 (linear)에 비해 월등한 향상 • 일부 게임은 학습에 실패함. (sparse reward, 복잡한 단계를 가진 게임)
  • 57. DQN의 발전된 모델들 - Double DQN •Positive bias: max 연산자에 의해 Q-value를 실 제보다 높게 평가하는 경향이 있음  Double Q learning을 사용하여 이 현상을 방, DQN에 보다 빠른 학습이 가능
  • 58. DQN의 발전된 모델들 - Prioritized Replay Memory •일반적인 DQN은 Replay Memory에서 Uniform Sampling  TD error가 높은 데이터들이 선택될 확률을 높여, 더 빠르고 효율적인 학습이 가능
  • 59. References [1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998. [2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. [3] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. 2016. [4] Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015). [5] Schaul, Tom, et al. "Prioritized experience replay." arXiv preprint arXiv:1511.05952 (2015). 59
  • 60. Appendix • Atari 2600 - https://www.youtube.com/watch?v=iqXKQf2BOSE • Super MARIO - https://www.youtube.com/watch?v=qv6UVOQ0F44 • Robot Learns to Flip Pancakes - https://www.youtube.com/watch?v=W_gxLKSsSIE • Stanford Autonomous Helicopter - Airshow #2 - https://www.youtube.com/watch?v=VCdxqn0fcnE • OpenAI Gym - https://gym.openai.com/envs • Awesome RL - https://github.com/aikorea/awesome-rl • Udacity RL course • TensorFlow DRL - https://github.com/nivwusquorum/tensorflow-deepq • Karpathy rldemo - http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html 60