Top-K Off-Policy Correction for a REINFORCE Recommender System
1. Top-K Off-Policy Correction for a
REINFORCE Recommender System
調和系 M1 織⽥智⽮ 2020/07/22
Minmin Chen and Alex Beutel and Paul Covington and Sagar Jain and Francois
Belletti and Ed Chi, Google, Inc. ,WSDM 2019
https://arxiv.org/abs/1812.02353
有志実装: https://github.com/awarebayes/RecNN
Google Resarch: https://research.google/pubs/pub47647/
4. ゼミ資料
RELATED WORK
• Q学習のような価値ベースの⼿法の関数近似部分は不安定
[29]
– ⽅策の収束はあまり研究されてない
– 安定動作には,ハイパラ調整が必須
• ⽅策ベースの⼿法は,学習率が⼗分⼩さいと,関数近似部
分がかなり安定
3
[29] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray
Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928–1937.
方策ベースな手法であるREINFORCEを使用