49. 状態のハッシュ化を用いたカウントによる内発的報酬
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement
Learning[Haoran+]
論文概要
高次元な探索空間でも,疑似カウントではない普通の状態カウントを用いた内発
的報酬を得るため,状態をハッシュ化
状態をハッシュ化する前の良い特徴抽出法についても検討
49
130. 参考文献,サイト,資料 1
強化学習・深層強化学習の基礎
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. Bradford, 1998.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go
with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In
AAAI, volume 2, page 5. Phoenix, AZ, 2016.
Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep reinforcement
learning. In ICML, 2016.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David
Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages
1928–1937, 2016.
Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas
Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for
deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.
J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization”, in ICML,
2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms. CoRR, abs/1707.06347, 2017.
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for
continous control”, in ICML, 2016.
Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment
130
131. 参考文献,サイト,資料 2
報酬なスパースな環境と好奇心による探索
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym,
2016.
Unity ML-agents. https://github.com/Unity-Technologies/ml-agents.
S. P. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. In NIPS, 2005.
Strehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of
Computer and System Sciences, 74(8):1309 – 1331.
論文紹介
環境から得る情報量を用いた内発的報酬
R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. In
NIPS, 2016.
Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv
preprint arXiv:1507.00814.
疑似的な状態カウントと内発的報酬を組み合わせた探索
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration
and intrinsic motivation. In NIPS, pages 1471–1479, 2016.
Bellemare, M., Veness, J., and Talvitie, E. (2014). Skip context tree switching. In Proceedings of the 31st International Conference on
Machine Learning, pages 1458–1466.
状態のハッシュ化を用いたカウントによる内発的報酬
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter
Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In NIPS, pages 2750–2759, 2017.
Charikar, Moses S. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on
Theory of Computing (STOC), pp. 380–388, 2002.
131
132. 参考文献,サイト,資料 3
観測の識別器を用いて推定した密度を内発的報酬とする探索
J. Fu, J. D. Co-Reyes, and S. Levine. EX2: Exploration with exemplar models for deep
reinforcement learning. NIPS, 2017.
まったく報酬が与えられない環境における探索
Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale
study of curiosity-driven learning. In arXiv:1808.04355, 2018.
自分に関係あるものだけに注目した好奇心による探索
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-
supervised prediction. In ICML, 2017.
ランダム初期化したネットワークの蒸留と予測誤差による内発的報酬
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.
arXiv preprint arXiv:1810.12894, 2018.
過去に保存した良い状態に戻ってスタート地点とする探索手法
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. 2019. Go-Explore: a New
Approach for Hard-Exploration Problems. arXiv preprint arXiv:1901.10995 (2019)
Tim Salimans and Richard Chen. Learning montezuma’s revenge from a single demonstration. arXiv preprint
arXiv:1812.03381, 2018.
Reinforcement Learning @ NeurIPS2018 https://www.slideshare.net/yukono1/reinforcement-learning-
neurips2018
2018-12-07-NeurIPS-DeepRLWorkshop-Go-Explore
http://www.cs.uwyo.edu/~jeffclune/share/2018_12_07_NeurIPS_DeepRLWorkshop_Go_Explore.pdf
132
133. 参考文献,サイト,資料 4
その他好奇心による探索手法
Nikolay Savinov, Anton Raichuk, Raphael Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. Episodic curiosity
through reachability. arXiv preprint arXiv:1810.0227, 2018.
Daniel McDuff and Ashish Kapoor. Visceral Machines: Reinforcement Learning with Intrinsic Rewards that Mimic the Human Nervous System.
arXiv preprint arXiv:1805.09975, 2018.
Sandy H. Huang and Martina Zambelli and Jackie Kay and Murilo F. Martins and Yuval Tassa and Patrick M. Pilarski and Raia Hadsell. Learning
Gentle Object Manipulation with Curiosity-Driven Deep Reinforcement Learning. arXiv preprint arXiv:1903.08542, 2019
133