論文紹介 No-Reward Meta Learning (RL architecture勉強会)

•Download as PPTX, PDF•

2 likes•1,540 views

Yusuke Nakata

RL architecture勉強会での発表資料

Technology

論文紹介
NoRML: No-Reward Meta Learning
D1 中田勇介
2019/05/21 強化学習アーキテクチャ勉強会

002 /
実環境で実行可能な方策を学習するためのMeta-Learning手法を提案
・シミュレータで学習 -> 実環境で方策を適応
・特徴：実環境への適応時に報酬不要
・MAML, Domain Randomizationと比較して優れた性能
・著者：Yuxiang Yang, Ken Caluwaerts, Atil Iscen, Jie Tan, Chelsea Finn
・実装：https://github.com/google-research/google-research/tree/master/norml
どんな論文？

003 /
1. Introduction
2. Preliminaries
3. NO-REWARD META LEARNING （提案法）
4. Experiments
5. Related Work
6. まとめ
Outline

004 /
・（モデルフリー）強化学習には多くの試行錯誤が必要
・実環境では試行錯誤するのは困難
・シミュレータで実環境のダイナミクスを再現するのは困難
・シミュレータで学習させた方策を実環境に適応
Introduction

006 /
Notation
状態集合
行動集合
状態遷移確率
報酬関数
軌跡
方策
Preliminaries

007 /
Model-free Reinforcement Learning
・Loss function
Preliminaries

008 /
Model-free Reinforcement Learning
・Loss function
・Policy Gradient
・Advantage function
Preliminaries

009 /
学習タスク〜を用いてテストタスク〜に適応可能な
パラメータを学習する方法
仮定：タスク間には共通の構造（使いまわせる知識）が存在する．
Meta Learningとは

0010 /
学習タスク〜を用いてテストタスク〜に適応可能な
パラメータを学習する方法
仮定：タスク間には共通の構造（使いまわせる知識）が存在する．
・NoRMLにおける仮定
タスク間で共通：，，
タスク間で異なる：
Meta Learningとは

0011 /
Model-Agnostic Meta Learning (MAML)

0012 /
Model-Agnostic Meta Learning (MAML)

0013 /
MAML on Model-free RL
Policy Gradient

0014 /
・Policy Gradient
・Update Rule
MAML on Model-free RL

0018 /
・Learned Advantage Function
・Offset
NO-REWARD META LEARNING （提案法）

0019 /
・Learned Advantage Function
・Offset
NO-REWARD META LEARNING （提案法）

0020 /
・Learned Advantage Function
・Offset
NO-REWARD META LEARNING （提案法）
全タスクで共通

0024 /
比較対象
・MAML
・Domain Randomization
Experiments

0026 /
実験環境
・Point Agent with Rotation Bias
・Cartpole with Sensor Bias
・Half Cheetah with Swapped Actions
Experiments

0027 /
Point Agent with Rotation Bias
(-2, 2) (2, 2)
(2, -2)(-2, -2)
x
Goal (1, 0)
action
Rotation bias
Next state
State(0, 0)

0033 /
Half Cheetah with Swapped Actions

0034 /
Half Cheetah with Swapped Actions

0035 /
https://sites.google.com/view/noreward-meta-rl/
Half Cheetah with Swapped Actions

0036 /
Meta reinforcement learningの分類
- Recurrent based: RL2, Attentive meta learner, etc.
- エピソードを記憶させることで環境の違いを認識
- Gradient-based: NoRML(this work), MAML, etc.
- 勾配法でパラメータを更新し環境に適応
- Hybrid-based: Evolved Policy Gradient, Meta-critic network, etc
- 上の二つのハイブリッド
Related Work

0037 /
ダイナミクスの変化に対する他のアプローチ
- Adaptive inverse control
- Self-modeling
- Bayesian optimization
- Online system identification
Related Work

0038 /
実環境で実行可能な方策を学習するNoRMLを提案
提案内容：Learned Advantage Function, Offset
・シミュレータで学習 -> 実環境で方策を適応
・実環境への適応時に報酬不要
・MAML, Domain Randomizationと比較して優れた性能
まとめ

0039 /
おまけ：AAMAS2019参加報告
http://www.kamishima.net/archive/MLDMAImap.pdf

What's hot

[DL輪読会]World ModelsDeep Learning JP

MixMatch: A Holistic Approach to Semi- Supervised Learningharmonylab

【DL輪読会】A Path Towards Autonomous Machine IntelligenceDeep Learning JP

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida

【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement LearningDeep Learning JP

[DL輪読会]Learning Latent Dynamics for Planning from PixelsDeep Learning JP

【DL輪読会】時系列予測 Transfomers の精度向上手法Deep Learning JP

[DL輪読会]Flow-based Deep Generative ModelsDeep Learning JP

[DL輪読会]ICLR2020の分布外検知速報Deep Learning JP

【DL輪読会】Implicit Behavioral CloningDeep Learning JP

ブラックボックス最適化とその応用gree_tech

[DL輪読会]NVAE: A Deep Hierarchical Variational AutoencoderDeep Learning JP

HiPPO/S4解説Morpho, Inc.

[DL輪読会]GQNと関連研究，世界モデルとの関係についてDeep Learning JP

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII

【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP

機械学習のためのベイズ最適化入門hoxo_m

[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...Deep Learning JP

深層生成モデルと世界モデルMasahiro Suzuki

[DL輪読会]逆強化学習とGANsDeep Learning JP

What's hot (20)

[DL輪読会]World Models

MixMatch: A Holistic Approach to Semi- Supervised Learning

【DL輪読会】A Path Towards Autonomous Machine Intelligence

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning

[DL輪読会]Learning Latent Dynamics for Planning from Pixels

【DL輪読会】時系列予測 Transfomers の精度向上手法

[DL輪読会]Flow-based Deep Generative Models

[DL輪読会]ICLR2020の分布外検知速報

【DL輪読会】Implicit Behavioral Cloning

ブラックボックス最適化とその応用

[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder

HiPPO/S4解説

[DL輪読会]GQNと関連研究，世界モデルとの関係について

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向

【DL輪読会】Scaling Laws for Neural Language Models

機械学習のためのベイズ最適化入門

[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...

深層生成モデルと世界モデル

[DL輪読会]逆強化学習とGANs

Similar to 論文紹介 No-Reward Meta Learning (RL architecture勉強会)

【DL輪読会】Transformers are Sample Efficient World ModelsDeep Learning JP

強化学習を可視化する chainerrl-visualizerを動かしてみたmogamin

[DL輪読会]Learning to Generalize: Meta-Learning for Domain GeneralizationDeep Learning JP

未来画像予測モデルと時間重み付けを導入した価値関数に基づく強化学習 MILab

C# から java へのプログラム移植で体験したtddの効果は？Shinichi Hirauchi

ビジネス的に高価値なアジャイルテストTsutomu Chikuba

IBIS2011 企画セッション「CV/PRで独自の進化を遂げる学習・最適化技術」趣旨説明Akisato Kimura

オブジェクト指向プログラミング教育法序説seastar orion

報酬が殆ど得られない場合の強化学習Masato Nakai

強化学習の実適用に向けた課題と工夫Masahiro Yasumoto

LCCC2010:Learning on Cores, Clusters and Cloudsの解説Preferred Networks

Similar to 論文紹介 No-Reward Meta Learning (RL architecture勉強会) (11)

【DL輪読会】Transformers are Sample Efficient World Models

強化学習を可視化する chainerrl-visualizerを動かしてみた

[DL輪読会]Learning to Generalize: Meta-Learning for Domain Generalization

未来画像予測モデルと時間重み付けを導入した価値関数に基づく強化学習

C# から java へのプログラム移植で体験したtddの効果は？

ビジネス的に高価値なアジャイルテスト

IBIS2011 企画セッション「CV/PRで独自の進化を遂げる学習・最適化技術」趣旨説明

オブジェクト指向プログラミング教育法序説

報酬が殆ど得られない場合の強化学習

強化学習の実適用に向けた課題と工夫

LCCC2010:Learning on Cores, Clusters and Cloudsの解説

Recently uploaded

NewSQLの可用性構成パターン（OCHaCafe Season 8 #4 発表資料）NTT DATA Technology & Innovation

クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfFumieNakayama

AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfFumieNakayama

業務で生成AIを活用したい人のための生成AI入門講座（社外公開版：キンドリルジャパン社内勉強会：2024年4月発表）Hiroshi Tomioka

CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か？akihisamiyanaga1

モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察～Text-to-MusicとText-To-ImageかつImage-to-Music...博三太田

デジタル・フォレンジックの最新動向（2024年4月27日情洛会総会特別講演スライド）UEHARA, Tetsutaro

自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineerYuki Kikuchi

Recently uploaded (8)

NewSQLの可用性構成パターン（OCHaCafe Season 8 #4 発表資料）

クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf

AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf

業務で生成AIを活用したい人のための生成AI入門講座（社外公開版：キンドリルジャパン社内勉強会：2024年4月発表）

CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か？

モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察～Text-to-MusicとText-To-ImageかつImage-to-Music...

デジタル・フォレンジックの最新動向（2024年4月27日情洛会総会特別講演スライド）

自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer

論文紹介 No-Reward Meta Learning (RL architecture勉強会)

1. 論文紹介 NoRML: No-Reward Meta Learning D1 中田勇介 2019/05/21 強化学習アーキテクチャ勉強会

2. 002 / 実環境で実行可能な方策を学習するためのMeta-Learning手法を提案・シミュレータで学習 -> 実環境で方策を適応・特徴：実環境への適応時に報酬不要・MAML, Domain Randomizationと比較して優れた性能・著者：Yuxiang Yang, Ken Caluwaerts, Atil Iscen, Jie Tan, Chelsea Finn ・実装：https://github.com/google-research/google-research/tree/master/norml どんな論文？

3. 003 / 1. Introduction 2. Preliminaries 3. NO-REWARD META LEARNING （提案法） 4. Experiments 5. Related Work 6. まとめ Outline

4. 004 / ・（モデルフリー）強化学習には多くの試行錯誤が必要・実環境では試行錯誤するのは困難・シミュレータで実環境のダイナミクスを再現するのは困難・シミュレータで学習させた方策を実環境に適応 Introduction

5. 005 / 想定している状況 [Tan+, 2018]

6. 006 / Notation 状態集合行動集合状態遷移確率報酬関数軌跡方策 Preliminaries

7. 007 / Model-free Reinforcement Learning ・Loss function Preliminaries

8. 008 / Model-free Reinforcement Learning ・Loss function ・Policy Gradient ・Advantage function Preliminaries

9. 009 / 学習タスク〜を用いてテストタスク〜に適応可能なパラメータを学習する方法仮定：タスク間には共通の構造（使いまわせる知識）が存在する． Meta Learningとは

10. 0010 / 学習タスク〜を用いてテストタスク〜に適応可能なパラメータを学習する方法仮定：タスク間には共通の構造（使いまわせる知識）が存在する．・NoRMLにおける仮定タスク間で共通：，，タスク間で異なる： Meta Learningとは

11. 0011 / Model-Agnostic Meta Learning (MAML)

12. 0012 / Model-Agnostic Meta Learning (MAML)

13. 0013 / MAML on Model-free RL Policy Gradient

14. 0014 / ・Policy Gradient ・Update Rule MAML on Model-free RL

15. 0015 / MAML on Model-free RL

16. 0016 / MAML on Model-free RL

17. 0017 / MAML on Model-free RL

18. 0018 / ・Learned Advantage Function ・Offset NO-REWARD META LEARNING （提案法）

19. 0019 / ・Learned Advantage Function ・Offset NO-REWARD META LEARNING （提案法）

20. 0020 / ・Learned Advantage Function ・Offset NO-REWARD META LEARNING （提案法）全タスクで共通

21. 0021 / NO-REWARD META LEARNING （提案法）

22. 0022 / NO-REWARD META LEARNING （提案法）

23. 0023 / NO-REWARD META LEARNING （提案法）

24. 0024 / 比較対象・MAML ・Domain Randomization Experiments

25. 0025 / Domain Randomization

26. 0026 / 実験環境・Point Agent with Rotation Bias ・Cartpole with Sensor Bias ・Half Cheetah with Swapped Actions Experiments

27. 0027 / Point Agent with Rotation Bias (-2, 2) (2, 2) (2, -2)(-2, -2) x Goal (1, 0) action Rotation bias Next state State(0, 0)

28. 0028 / Point Agent with Rotation Bias

29. 0029 / Point Agent with Rotation Bias

30. 0030 / Point Agent with Rotation Bias

31. 0031 / Cartpole with Sensor Bias

32. 0032 / Cartpole with Sensor Bias

33. 0033 / Half Cheetah with Swapped Actions

34. 0034 / Half Cheetah with Swapped Actions

35. 0035 / https://sites.google.com/view/noreward-meta-rl/ Half Cheetah with Swapped Actions

36. 0036 / Meta reinforcement learningの分類 - Recurrent based: RL2, Attentive meta learner, etc. - エピソードを記憶させることで環境の違いを認識 - Gradient-based: NoRML(this work), MAML, etc. - 勾配法でパラメータを更新し環境に適応 - Hybrid-based: Evolved Policy Gradient, Meta-critic network, etc - 上の二つのハイブリッド Related Work

37. 0037 / ダイナミクスの変化に対する他のアプローチ - Adaptive inverse control - Self-modeling - Bayesian optimization - Online system identification Related Work

38. 0038 / 実環境で実行可能な方策を学習するNoRMLを提案提案内容：Learned Advantage Function, Offset ・シミュレータで学習 -> 実環境で方策を適応・実環境への適応時に報酬不要・MAML, Domain Randomizationと比較して優れた性能まとめ

39. 0039 / おまけ：AAMAS2019参加報告 http://www.kamishima.net/archive/MLDMAImap.pdf

40. 0040 / おまけ：AAMAS2019参加報告

41. 0041 / おまけ：AAMAS2019参加報告

42. 0042 / おまけ：AAMAS2019参加報告

43. 0043 / おまけ：AAMAS2019参加報告

44. 0044 / おまけ：AAMAS2019参加報告