230915 paper summary learning to world model with language with details - public.pdf

S
Seungjoon1Graduated student
Machine
Learning
LABoratory
Seungjoon Lee. 2023-09-15. sjlee1218@postech.ac.kr
Learning to Model the World
with Language
Paper Summary
1
Contents
• Introduction
• Methods
• Experiments
• ‘Dynamics’ in DynaLang
2
Caution!!!
• This is the material I summarized a paper at my personal research meeting.
• Some of the contents may be incorrect!
• Some contributions, experiments are excluded intentionally, because they
are not directly related to my research interest.
• Methods are simpli
fi
ed for easy explanation.
• Please send me an email if you want to contact me: sjlee1218@postech.ac.kr
(for correction or addition of materials, ideas to develop this paper, or others).
3
Intro
4
Situations
• Most language-conditioned RL methods only use language as instructions
(eg. “Pick the blue box”)
• However, language does not always match the optimal action.
• Therefore, mapping language only to actions is a weak learning signal.
5
“Put the bowls away”
Complication
• On the other hand, human can predict the future using language.
• Human can predict environment dynamics (eg. “wrenches tightens nuts.”)
• Human can predict the future observations (eg. “the paper is outside.”)
6
Questions & Hypothesis
• Question:
• If we let reinforcement learning predict the future using language, will its
performance improve?
• Hypothesis:
• Predicting the future representation provides a rich learning signal for
agents of how language relates to the world.
7
Contributions
• What’s done:
• DynaLang makes a language-conditioned world model, which can be
trained by self-supervised manner.
• So what?
• The self-supervised world model enables training in sparse-reward envs,
and text-only pretraining without actions or task rewards.
• This shows learning language dynamics helps to make useful
representation for RL.
8
Why is This New?
• DynaLang makes a language-conditioned world model which learns the
dynamics of language and image.
• Previous works use language to make language-conditioned policies or to
make additional rewards.
• The world model can be trained in self-supervised manner.
• This enables to make useful feature representation in sparse-reward envs, and
text-only pretraining without actions or task rewards.
9
Methods
10
Problem Setting
• Observation: , where is an image, is a language token.
• An agent chooses action , then environment returns:
• reward ,
• a
fl
ag whether the episode continues ,
• and next observation .
•
The agent’s goal is to maximize
ot = (xt, lt) xt lt
at
rt+1
ct+1
ot+1
E
[
T
∑
t=1
γt−1
rt
]
11
Method Outline
• DynaLang components
• World model: encodes current image obs and language into representation.
• RL agent: using encoded representation, acts to maximize the sum of
discounted reward.
12
Method - World Model
Outline
• World model components:
• Encoder - Decoder: learns to represent the current state.
• Sequence model: learns to predict the future state representation.
13
Method - World Model
Base model (previous work)
• DynaLang = Dreamer V3 + language.
• Dreamer V3 learns to compute compact representations of current state, and
learns how these concepts change by actions.
14
Architecture of Dreamer V3
Method - World Model
Incorporation of language
• DynaLang incorporates language into the encoder-decoder of Dremer V3.
• By this, DynaLang gets representations unifying visual observations and
languages.
15
Method - World Model
Prediction of the future
• DynaLang predicts future representation using the sequence model, like
Dreamer V3.
• Future representation prediction lets the agent extract the information from
language, relating to the dynamics of multiple modalities.
16
Method - World Model
Model Losses
• World model loss: , where
• Image loss
• Language loss or
• Reward loss
• Continue loss
• Regularizer , where sg is stop-gradient
• Future prediction loss
Lx + Ll + Lr + Lc + Lreg + Lpred
Lx = || ̂
xt − x||2
2
Ll = categorical_cross_entropy( ̂
lt, lt) Ll = || ̂
lt − lt ||2
Lr = ( ̂
rt − rt)2
Lc = binary_cross_entropy( ̂
ct, ct)
Lreg = βreg max(1,KL[zt |sg( ̂
zt)])
Lpred = βpred max(1,KL[sg(zt), ̂
zt])
17
Language into the World Model
• Questions to address:
• How are languages tokenized and fed into world model?
• What is the language embedding the world model use? Embedding from
pretrained language model (LM)? Or embedding from scratch?
• Answer:
• DynaLang uses existing tokenizer, and feeds pretrained embedding or one-
hot encoded embedding into the world model.
18
Language to World Model
Pretrained LM
• DynaLang uses existing tokenizer and pretrained LM encoder[T5]. (In except the
HomeGrid env)
19
Sentence
T5
Tokenizer
Tokens
T5
Encoder embedding
Rn
DynaLang
Language
Encoder
(MLP)
embedding
Rk
Fixed Learnable
[T5]: https://arxiv.org/abs/1910.10683
Language from World Model
Pretrained LM
20
• Dynalang makes embedding from decoder close to embedding from the
pretrained LM encoder.
• Loss = ||lDynaLang − lpretrained ||2
world model
embedding
z DynaLang
Language
Decoder
(MLP)
embedding
from
decoder
Rn
embedding
from
LM encoder
Rn
DynaLang
Language
Encoder
(MLP)
embedding
from
encoder
Rk
World
Model
Encoder
Language to World Model
One-hot encoder
• On the other hand, DynaLang also can use one-hot encoder with T5
tokenizer. (In HomeGrid env)
21
Sentence
T5
Tokenizer
Tokens
One-hot
Encoder
DynaLang
Language
Encoder
(MLP)
embedding
Rk
Fixed Learnable
0
0
1
0
.
.
.
Method - RL Agent
Outline
• The used RL agent is a simple actor critic agent.
• Actor:
• Critic:
• Note that the RL agent is not conditioned on language directly.
π(at |zt, ht)
V(ht, zt)
22
Method - RL Agent
Environment interaction
• The RL agent interacts with environment using the encoded representation
and history .
zt
ht
23
Method - RL Agent
Training
• Let , the estimated discounted sum of
future rewards.
• Critic loss:
• Actor loss: , maximizing the return estimate
• The agent is trained only using imagined rollout generated by the world model.
• The agent is trained by the action of the agent and the predicted states, rewards.
Rt = rt + γct ((1 − λ)V (zt+1, ht+1) + λRt+1)
Lϕ = (Vϕ(zt, ht) − Rt)
2
Lθ = − (Rt − V(zt, ht)) log πθ(at |ht, zt)
24
Experiments
25
Diverse Types of Language
Questions
• Questions to address:
• Can DynaLang use diverse types of language along with instruction?
• If can, does it improve task performance?
26
Diverse Types of Language
Setup
• Env: HomeGrid
• multitask grid world where agents receive task
instruction in language but also language hints.
• Agents gets a reward of 1 when a task is completed,
and then a new task is sampled.
• Therefore, agents must complete as many tasks
as possible before the episode terminates in 100
steps.
•
27
HomeGrid env. Agents receive 3 typess of hints.
Diverse Types of Language
Results
• Baselines: model-free o
ff
-policy algorithms, IMPALA, R2D2.
• Simply image embeddings, language embeddings are conditioned to policy.
• DynaLang solves more tasks with hints, but simple language-conditioned RL
get worse with hints.
28
HomeGrid training performance after 50M steps (2 seeds)
World Model with Sparse/No Rewards
• DynaLang learns to extract features in self-supervised manner.
• By encoder-decoder structure, and future predictive objective.
• Because of this learning method, DynaLang can make useful embedding even
in environments with sparse reward and no rewards.
• Existing language-conditioned policy methods cannot make useful
embedding without rewards. Because their language encoder is trained by
reward signal.
29
World Model with Sparse Rewards
Setup
• Env: Messenger
• grid world where agents should deliver a message
while avoiding enemies using text manuals.
• Agents must understand manuals and relate them to
the environment to achieve high score.
30
Messenger env. Agent get text manuals.
World Model with Sparse Rewards
Results
• EMMA is added to be compared:
• Language + gridworld speci
fi
c method, model-free language-conditioned policy.
• Only DynaLang can learn from S3, the most di
ffi
cult setting.
• Adding future prediction helps the training more than only action generation.
• However, the authors do not include ablation studies which exclude the future
prediction loss from their architecture.
31
Messenger training performance (2 seeds). S1 is most easy, S3 is most hard.
World Model with Sparse Rewards
Results
• DynaLang learns sparse-reward Messenger S3(hard), outperforming EMMA.
• EMMA is a model-free special architecture designed for Messenger env.
• Messenger S3 is a di
ffi
cult game, because it have many entities, and entities have
same appearance and di
ff
erent roles and movements.
32
World Model with No Rewards
Text only pretaining
• Self-supervised manner allows text-only o
ffl
ine pertaining.
• By zeroing out the other irrelevant loss, and ignoring actions.
• Existing model free language-conditioned methods cannot be pretrained with
action-free and reward-free data.
33
World Model with No Rewards
Text only pretaining
• Text-only pretraining of the world model shows improvements of training
performance in Messenger S2 env.
• Learned language dynamics helps to make useful representation for RL.
34
T5 tokenizer + T5 pretrained LM encoder
T5 tokenizer + one-hot encoder (no pretraining)
T5 tokenizer + one-hot encoder + pretraining with Messenger manuals
T5 tokenizer + one-hot encoder + pretraining with domain general TinyStories dataset
(short stories generated by GPT-3.5 and GPT-4)
‘Dynamics’ in DynaLang
35
What is the ‘Dynamics’ DynaLang Learns?
• World model dynamics = language dynamics + visual game dynamics.
• DynaLang learns the dynamics of language, relating it to the dynamics of
visual game.
• Evidence:
• DynaLang can generate texts.
• DynaLang can do embodied question answering.
36
Language Dynamics
Evidence 1 - text generation
• DynaLang is trained to predict next language token of TinyStories dataset.
• Below is the example of 10-token generations conditioned on prompts.
• The tokens are predicted by DynaLang and decoded by T5 Tokenizer.
37
Language Dynamics
Evidence 2 - embodied question answering (EQA)
• The authors introduces new benchmark, LangRoom.
• An agent gets a question for the color of an object.
• The agent should move to the correct object and say the correct color.
• The agent should understand the object name, relating it to the visual observation.
• Action space: movement and 15 color tokens.
• However, there is already an EQA benchmark[EQA]…
38
[EQA]: https://arxiv.org/abs/1711.11543
1 de 38

Recomendados

230906 paper summary - learning to world model with language - public.pdf por
230906 paper summary - learning to world model with language - public.pdf230906 paper summary - learning to world model with language - public.pdf
230906 paper summary - learning to world model with language - public.pdfSeungjoon1
38 visualizações26 slides
Representation Learning of Text for NLP por
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
5.4K visualizações267 slides
Anthiil Inside workshop on NLP por
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
447 visualizações267 slides
고급컴파일러구성론_개레_230303.pptx por
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptxssuser1e7611
4 visualizações28 slides
MLMPLs por
MLMPLsMLMPLs
MLMPLsmiso_uam
109 visualizações15 slides
The Good, the Bad and the Ugly things to do with android por
The Good, the Bad and the Ugly things to do with androidThe Good, the Bad and the Ugly things to do with android
The Good, the Bad and the Ugly things to do with androidStanojko Markovik
2.5K visualizações47 slides

Mais conteúdo relacionado

Similar a 230915 paper summary learning to world model with language with details - public.pdf

Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications" por
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Fwdays
483 visualizações74 slides
An Introduction to Pre-training General Language Representations por
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
149 visualizações51 slides
CS8461 - Design and Analysis of Algorithms por
CS8461 - Design and Analysis of AlgorithmsCS8461 - Design and Analysis of Algorithms
CS8461 - Design and Analysis of AlgorithmsKrishnan MuthuManickam
264 visualizações65 slides
230922 Semantic Exploration from Language Abstractions and Pretrained Represe... por
230922 Semantic Exploration from Language Abstractions and Pretrained Represe...230922 Semantic Exploration from Language Abstractions and Pretrained Represe...
230922 Semantic Exploration from Language Abstractions and Pretrained Represe...Seungjoon1
33 visualizações47 slides
NLP from scratch por
NLP from scratch NLP from scratch
NLP from scratch Bryan Gummibearehausen
1.5K visualizações56 slides
Deep Learning for Machine Translation por
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine TranslationMatīss
2.4K visualizações53 slides

Similar a 230915 paper summary learning to world model with language with details - public.pdf(20)

Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications" por Fwdays
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Fwdays483 visualizações
An Introduction to Pre-training General Language Representations por zperjaccico
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
zperjaccico149 visualizações
CS8461 - Design and Analysis of Algorithms por Krishnan MuthuManickam
CS8461 - Design and Analysis of AlgorithmsCS8461 - Design and Analysis of Algorithms
CS8461 - Design and Analysis of Algorithms
Krishnan MuthuManickam264 visualizações
230922 Semantic Exploration from Language Abstractions and Pretrained Represe... por Seungjoon1
230922 Semantic Exploration from Language Abstractions and Pretrained Represe...230922 Semantic Exploration from Language Abstractions and Pretrained Represe...
230922 Semantic Exploration from Language Abstractions and Pretrained Represe...
Seungjoon133 visualizações
Deep Learning for Machine Translation por Matīss
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss 2.4K visualizações
NLP Bootcamp 2018 : Representation Learning of text for NLP por Anuj Gupta
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta1.3K visualizações
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15 por MLconf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
MLconf3.5K visualizações
10 more lessons learned from building Machine Learning systems - MLConf por Xavier Amatriain
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain375.2K visualizações
10 more lessons learned from building Machine Learning systems por Xavier Amatriain
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
Xavier Amatriain180.1K visualizações
Programming paradigms c1 por Omar Al-Sabek
Programming paradigms c1Programming paradigms c1
Programming paradigms c1
Omar Al-Sabek479 visualizações
Mcs lec2 por Faiza Gull
Mcs lec2Mcs lec2
Mcs lec2
Faiza Gull268 visualizações
The joy of functional programming por Steve Zhang
The joy of functional programmingThe joy of functional programming
The joy of functional programming
Steve Zhang345 visualizações
Cse115 lecture02overviewofprogramming por Md. Ashikur Rahman
Cse115 lecture02overviewofprogrammingCse115 lecture02overviewofprogramming
Cse115 lecture02overviewofprogramming
Md. Ashikur Rahman147 visualizações
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI) por Deep Learning Italia
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia196 visualizações
Performance optimization techniques for Java code por Attila Balazs
Performance optimization techniques for Java codePerformance optimization techniques for Java code
Performance optimization techniques for Java code
Attila Balazs10K visualizações
Beyond the Symbols: A 30-minute Overview of NLP por MENGSAYLOEM1
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1145 visualizações
ELLA LC algorithm presentation in ICIP 2016 por InVID Project
ELLA LC algorithm presentation in ICIP 2016ELLA LC algorithm presentation in ICIP 2016
ELLA LC algorithm presentation in ICIP 2016
InVID Project1.8K visualizações
Talk from NVidia Developer Connect por Anuj Gupta
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
Anuj Gupta497 visualizações

Último

plasmids por
plasmidsplasmids
plasmidsscribddarkened352
7 visualizações2 slides
A training, certification and marketing scheme for informal dairy vendors in ... por
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...ILRI
8 visualizações13 slides
EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN... por
EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN...EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN...
EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN...gynomark
9 visualizações15 slides
Climate Change.pptx por
Climate Change.pptxClimate Change.pptx
Climate Change.pptxlaurenmortensen1
106 visualizações1 slide
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... por
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...GIFT KIISI NKIN
12 visualizações31 slides
MSC III_Advance Forensic Serology_Final.pptx por
MSC III_Advance Forensic Serology_Final.pptxMSC III_Advance Forensic Serology_Final.pptx
MSC III_Advance Forensic Serology_Final.pptxSuchita Rawat
10 visualizações109 slides

Último(20)

A training, certification and marketing scheme for informal dairy vendors in ... por ILRI
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...
ILRI8 visualizações
EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN... por gynomark
EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN...EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN...
EVALUATION OF HEPATOPROTECTIVE ACTIVITY OF SALIX SUBSERRATA IN PARACETAMOL IN...
gynomark9 visualizações
Climate Change.pptx por laurenmortensen1
Climate Change.pptxClimate Change.pptx
Climate Change.pptx
laurenmortensen1106 visualizações
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... por GIFT KIISI NKIN
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
GIFT KIISI NKIN12 visualizações
MSC III_Advance Forensic Serology_Final.pptx por Suchita Rawat
MSC III_Advance Forensic Serology_Final.pptxMSC III_Advance Forensic Serology_Final.pptx
MSC III_Advance Forensic Serology_Final.pptx
Suchita Rawat10 visualizações
CSF -SHEEBA.D presentation.pptx por SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD710 visualizações
PRINCIPLES-OF ASSESSMENT por rbalmagro
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro9 visualizações
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf por KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez119 visualizações
Researching and Communicating Our Changing Climate por Zachary Labe
Researching and Communicating Our Changing ClimateResearching and Communicating Our Changing Climate
Researching and Communicating Our Changing Climate
Zachary Labe5 visualizações
miscellaneous compound.pdf por manjusha kareppa
miscellaneous compound.pdfmiscellaneous compound.pdf
miscellaneous compound.pdf
manjusha kareppa15 visualizações
Alzheimer's Final Project by Adriana Torres por AdrianaLuzTorres
Alzheimer's Final Project by Adriana TorresAlzheimer's Final Project by Adriana Torres
Alzheimer's Final Project by Adriana Torres
AdrianaLuzTorres15 visualizações
Conventional and non-conventional methods for improvement of cucurbits.pptx por gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97614 visualizações
Women in the Workplace Industry Insights for Life Sciences por workplacesurvey
Women in the Workplace Industry Insights for Life SciencesWomen in the Workplace Industry Insights for Life Sciences
Women in the Workplace Industry Insights for Life Sciences
workplacesurvey5 visualizações
himalay baruah acid fast staining.pptx por HimalayBaruah
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptx
HimalayBaruah5 visualizações
Chromatography ppt.pptx por varshachandgudesvpm
Chromatography ppt.pptxChromatography ppt.pptx
Chromatography ppt.pptx
varshachandgudesvpm15 visualizações
Distinct distributions of elliptical and disk galaxies across the Local Super... por Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani30 visualizações
Workshop LLM Life Sciences ChemAI 231116.pptx por Marco Tibaldi
Workshop LLM Life Sciences ChemAI 231116.pptxWorkshop LLM Life Sciences ChemAI 231116.pptx
Workshop LLM Life Sciences ChemAI 231116.pptx
Marco Tibaldi96 visualizações
MILK LIPIDS 2.pptx por abhinambroze18
MILK LIPIDS 2.pptxMILK LIPIDS 2.pptx
MILK LIPIDS 2.pptx
abhinambroze185 visualizações

230915 paper summary learning to world model with language with details - public.pdf

  • 1. Machine Learning LABoratory Seungjoon Lee. 2023-09-15. sjlee1218@postech.ac.kr Learning to Model the World with Language Paper Summary 1
  • 2. Contents • Introduction • Methods • Experiments • ‘Dynamics’ in DynaLang 2
  • 3. Caution!!! • This is the material I summarized a paper at my personal research meeting. • Some of the contents may be incorrect! • Some contributions, experiments are excluded intentionally, because they are not directly related to my research interest. • Methods are simpli fi ed for easy explanation. • Please send me an email if you want to contact me: sjlee1218@postech.ac.kr (for correction or addition of materials, ideas to develop this paper, or others). 3
  • 5. Situations • Most language-conditioned RL methods only use language as instructions (eg. “Pick the blue box”) • However, language does not always match the optimal action. • Therefore, mapping language only to actions is a weak learning signal. 5 “Put the bowls away”
  • 6. Complication • On the other hand, human can predict the future using language. • Human can predict environment dynamics (eg. “wrenches tightens nuts.”) • Human can predict the future observations (eg. “the paper is outside.”) 6
  • 7. Questions & Hypothesis • Question: • If we let reinforcement learning predict the future using language, will its performance improve? • Hypothesis: • Predicting the future representation provides a rich learning signal for agents of how language relates to the world. 7
  • 8. Contributions • What’s done: • DynaLang makes a language-conditioned world model, which can be trained by self-supervised manner. • So what? • The self-supervised world model enables training in sparse-reward envs, and text-only pretraining without actions or task rewards. • This shows learning language dynamics helps to make useful representation for RL. 8
  • 9. Why is This New? • DynaLang makes a language-conditioned world model which learns the dynamics of language and image. • Previous works use language to make language-conditioned policies or to make additional rewards. • The world model can be trained in self-supervised manner. • This enables to make useful feature representation in sparse-reward envs, and text-only pretraining without actions or task rewards. 9
  • 11. Problem Setting • Observation: , where is an image, is a language token. • An agent chooses action , then environment returns: • reward , • a fl ag whether the episode continues , • and next observation . • The agent’s goal is to maximize ot = (xt, lt) xt lt at rt+1 ct+1 ot+1 E [ T ∑ t=1 γt−1 rt ] 11
  • 12. Method Outline • DynaLang components • World model: encodes current image obs and language into representation. • RL agent: using encoded representation, acts to maximize the sum of discounted reward. 12
  • 13. Method - World Model Outline • World model components: • Encoder - Decoder: learns to represent the current state. • Sequence model: learns to predict the future state representation. 13
  • 14. Method - World Model Base model (previous work) • DynaLang = Dreamer V3 + language. • Dreamer V3 learns to compute compact representations of current state, and learns how these concepts change by actions. 14 Architecture of Dreamer V3
  • 15. Method - World Model Incorporation of language • DynaLang incorporates language into the encoder-decoder of Dremer V3. • By this, DynaLang gets representations unifying visual observations and languages. 15
  • 16. Method - World Model Prediction of the future • DynaLang predicts future representation using the sequence model, like Dreamer V3. • Future representation prediction lets the agent extract the information from language, relating to the dynamics of multiple modalities. 16
  • 17. Method - World Model Model Losses • World model loss: , where • Image loss • Language loss or • Reward loss • Continue loss • Regularizer , where sg is stop-gradient • Future prediction loss Lx + Ll + Lr + Lc + Lreg + Lpred Lx = || ̂ xt − x||2 2 Ll = categorical_cross_entropy( ̂ lt, lt) Ll = || ̂ lt − lt ||2 Lr = ( ̂ rt − rt)2 Lc = binary_cross_entropy( ̂ ct, ct) Lreg = βreg max(1,KL[zt |sg( ̂ zt)]) Lpred = βpred max(1,KL[sg(zt), ̂ zt]) 17
  • 18. Language into the World Model • Questions to address: • How are languages tokenized and fed into world model? • What is the language embedding the world model use? Embedding from pretrained language model (LM)? Or embedding from scratch? • Answer: • DynaLang uses existing tokenizer, and feeds pretrained embedding or one- hot encoded embedding into the world model. 18
  • 19. Language to World Model Pretrained LM • DynaLang uses existing tokenizer and pretrained LM encoder[T5]. (In except the HomeGrid env) 19 Sentence T5 Tokenizer Tokens T5 Encoder embedding Rn DynaLang Language Encoder (MLP) embedding Rk Fixed Learnable [T5]: https://arxiv.org/abs/1910.10683
  • 20. Language from World Model Pretrained LM 20 • Dynalang makes embedding from decoder close to embedding from the pretrained LM encoder. • Loss = ||lDynaLang − lpretrained ||2 world model embedding z DynaLang Language Decoder (MLP) embedding from decoder Rn embedding from LM encoder Rn DynaLang Language Encoder (MLP) embedding from encoder Rk World Model Encoder
  • 21. Language to World Model One-hot encoder • On the other hand, DynaLang also can use one-hot encoder with T5 tokenizer. (In HomeGrid env) 21 Sentence T5 Tokenizer Tokens One-hot Encoder DynaLang Language Encoder (MLP) embedding Rk Fixed Learnable 0 0 1 0 . . .
  • 22. Method - RL Agent Outline • The used RL agent is a simple actor critic agent. • Actor: • Critic: • Note that the RL agent is not conditioned on language directly. π(at |zt, ht) V(ht, zt) 22
  • 23. Method - RL Agent Environment interaction • The RL agent interacts with environment using the encoded representation and history . zt ht 23
  • 24. Method - RL Agent Training • Let , the estimated discounted sum of future rewards. • Critic loss: • Actor loss: , maximizing the return estimate • The agent is trained only using imagined rollout generated by the world model. • The agent is trained by the action of the agent and the predicted states, rewards. Rt = rt + γct ((1 − λ)V (zt+1, ht+1) + λRt+1) Lϕ = (Vϕ(zt, ht) − Rt) 2 Lθ = − (Rt − V(zt, ht)) log πθ(at |ht, zt) 24
  • 26. Diverse Types of Language Questions • Questions to address: • Can DynaLang use diverse types of language along with instruction? • If can, does it improve task performance? 26
  • 27. Diverse Types of Language Setup • Env: HomeGrid • multitask grid world where agents receive task instruction in language but also language hints. • Agents gets a reward of 1 when a task is completed, and then a new task is sampled. • Therefore, agents must complete as many tasks as possible before the episode terminates in 100 steps. • 27 HomeGrid env. Agents receive 3 typess of hints.
  • 28. Diverse Types of Language Results • Baselines: model-free o ff -policy algorithms, IMPALA, R2D2. • Simply image embeddings, language embeddings are conditioned to policy. • DynaLang solves more tasks with hints, but simple language-conditioned RL get worse with hints. 28 HomeGrid training performance after 50M steps (2 seeds)
  • 29. World Model with Sparse/No Rewards • DynaLang learns to extract features in self-supervised manner. • By encoder-decoder structure, and future predictive objective. • Because of this learning method, DynaLang can make useful embedding even in environments with sparse reward and no rewards. • Existing language-conditioned policy methods cannot make useful embedding without rewards. Because their language encoder is trained by reward signal. 29
  • 30. World Model with Sparse Rewards Setup • Env: Messenger • grid world where agents should deliver a message while avoiding enemies using text manuals. • Agents must understand manuals and relate them to the environment to achieve high score. 30 Messenger env. Agent get text manuals.
  • 31. World Model with Sparse Rewards Results • EMMA is added to be compared: • Language + gridworld speci fi c method, model-free language-conditioned policy. • Only DynaLang can learn from S3, the most di ffi cult setting. • Adding future prediction helps the training more than only action generation. • However, the authors do not include ablation studies which exclude the future prediction loss from their architecture. 31 Messenger training performance (2 seeds). S1 is most easy, S3 is most hard.
  • 32. World Model with Sparse Rewards Results • DynaLang learns sparse-reward Messenger S3(hard), outperforming EMMA. • EMMA is a model-free special architecture designed for Messenger env. • Messenger S3 is a di ffi cult game, because it have many entities, and entities have same appearance and di ff erent roles and movements. 32
  • 33. World Model with No Rewards Text only pretaining • Self-supervised manner allows text-only o ffl ine pertaining. • By zeroing out the other irrelevant loss, and ignoring actions. • Existing model free language-conditioned methods cannot be pretrained with action-free and reward-free data. 33
  • 34. World Model with No Rewards Text only pretaining • Text-only pretraining of the world model shows improvements of training performance in Messenger S2 env. • Learned language dynamics helps to make useful representation for RL. 34 T5 tokenizer + T5 pretrained LM encoder T5 tokenizer + one-hot encoder (no pretraining) T5 tokenizer + one-hot encoder + pretraining with Messenger manuals T5 tokenizer + one-hot encoder + pretraining with domain general TinyStories dataset (short stories generated by GPT-3.5 and GPT-4)
  • 36. What is the ‘Dynamics’ DynaLang Learns? • World model dynamics = language dynamics + visual game dynamics. • DynaLang learns the dynamics of language, relating it to the dynamics of visual game. • Evidence: • DynaLang can generate texts. • DynaLang can do embodied question answering. 36
  • 37. Language Dynamics Evidence 1 - text generation • DynaLang is trained to predict next language token of TinyStories dataset. • Below is the example of 10-token generations conditioned on prompts. • The tokens are predicted by DynaLang and decoded by T5 Tokenizer. 37
  • 38. Language Dynamics Evidence 2 - embodied question answering (EQA) • The authors introduces new benchmark, LangRoom. • An agent gets a question for the color of an object. • The agent should move to the correct object and say the correct color. • The agent should understand the object name, relating it to the visual observation. • Action space: movement and 15 color tokens. • However, there is already an EQA benchmark[EQA]… 38 [EQA]: https://arxiv.org/abs/1711.11543