5. 研究概要
● Online planning with deep dynamics models(PDDM)
○ Model Predictive Control
■ Neural network dynamics for modelbased deep reinforcement learning
with model-free fine-tuning(https://arxiv.org/pdf/1708.02596.pdf)
○ Ensembles for model uncertainty estimation
■ Deep Reinforcement Learning in a Handful of Trials using Probabilistic
Dynamics Models(https://papers.nips.cc/paper/7725-deep-
reinforcement-learning-in-a-handful-of-trials-using-probabilistic-
dynamics-models.pdf)
● 一言で言うと: 不確実性を考慮に入れたダイナミクスの予測をブートストラッ
プアンサンブルで行い,行動の選択をMPCによって行う
● 個々の手法は既存のものだが,組み合わせは新しく, 肝だとしている 5
6. アウトライン
● Learning the Dynamics
○ モデルベース強化学習の課題
○ 不確実性の考慮
○ ブートストラップアンサンブル
● Model Predictive Control
○ Random Shooting
○ Iterative Random-Shooting with Refinement
○ Filtering and Reward-Weighted Refinement
● PDDM
● 実験結果
6
13. Model Predictive Control
Random shooting
● ある系列長のactionの系列をいくつか候補として挙げる
● その中で最も報酬が高く得られたaction系列を採用する
○ どれくらい報酬が得られるかは学習したモデルを使用し評価
○ Model Predictive Controlでは最初のactionだけ採用し, また次のstepで
Random shootingを行う
Slide from CS285 Lecture 10,
11
13
14. Model Predictive Control
Iterative Random-Shooting with Refinement
● 候補に挙げるアクション系列を,報酬が高く得られた範囲からとるようにし,
確度を高めていく
○ 何度かサンプリングを行い,最終的にアクション系列を定める
image from CS285 Lecture 10
slide
14
15. Model Predictive Control
Filtering and Reward-Weighted Refinement
● time step間の相関を考慮に入れ,アクション系列のサンプリングを行う時絞り
込む分布の更新をよりサンプル全体を考慮して有効的に行う
報酬による重み付けを行い
分布を更新
Time step間の相関の考慮(?)
filtering
15
the model must have enough capacity to represent the complex dynamical system
the use of ensembles is helpful, especially earlier in training when non-ensembled models can overfit badly and thus exhibit overconfident and harmful behavior
there is not much difference between resetting model weights randomly at each training iteration versus warmstarting them from their previous values
using a planning horizon that is either too long or too short can be detrimental: Short horizons lead to greedy planning, while long horizons suffer from compounding errors in the predictions
PDDM, with action smoothing and soft updates, greatly outperforms the others
medium values provide the best balance of dimensionality reduction and smooth integration of action samples versus loss of control authority. Here, too soft of a weighting leads to minimal movement of the hand, and too hard of a weighting leads to aggressive behaviors that frequently drop the objects
we confirm that most of the prior methods do in fact succeed, and we also see that even on this simpler task, policy gradient approaches such as NPG require prohibitively large amounts of data
when we increase the number of possible goals to 8 different options (90◦ and 45◦ rotations in the left, right, up, and down directions), we see that our method still succeeds, but the model-free approaches get stuck in local optima and are unable to fully achieve even the previously attainable goals. This inability to effectively address a “multi-task” or “multi-goal” setup is indeed a known drawback for model-free approaches, and it is particularly pronounced in such goal-conditioned tasks that require flexibility
These additional goals do not make the task harder for PDDM, because even in learning 90◦ rotations, it is building a model of its interactions rather than specifically learning to get to those angles.
prior model-based approaches don’t actually solve this task (values below the grey line correspond to holding the pencil still near the middle of the paper)
This task is particularly challenging due to the inter-object interactions, which can lead to drastically discontinuous dynamics and frequent failures from dropping the objects. We were unable to get the other model-based or model-free methods to succeed at this task (Figure 8), but PDDM solves it using just 100,000 data points, or 2.7 hours worth of data
moving a single ball to a goal location in the hand, posing the hand, and performing clockwise rotations instead of the learned counter-clockwise ones