230922 Semantic Exploration from Language Abstractions and Pretrained Representations - public.pdf

S
Seungjoon1Graduated student
Machine
Learning
LABoratory
Seungjoon Lee. 2023-09-22. sjlee1218@postech.ac.kr
Semantic Exploration from
Language Abstractions and
Pretrained Representations
Neurips 2022. Paper Summary
1
Contents
• Introduction
• Methods
• Experiments
• Conclusion
2
Caution!!!
• This is the material I summarized a paper at my personal research meeting.
• Some of the contents may be incorrect!
• Some experiments are excluded intentionally, because they are not directly
related to my research interest.
• Methods are simplified for easy explanation.
• Please send me an email if you want to contact me: sjlee1218@postech.ac.kr
(for correction or addition of materials, ideas to develop this paper, discussion).
3
Situations
• Novelty-based RL exploration methods incentivize exploration using novelty
as intrinsic rewards.
• The novelty is calculated based on the degree to which an observation is new.
4
Complication
• The existing visual novelty-based methods will fail on partial observable high
dimensional state space, especially in 3D environments.
• Because the semantically similar states can be observed very differently, in
terms of the point of view.
5
Questions & Hypothesis
• Question:
• Can you recognize similarities in high dimensional states that are
semantically similar but visually different for novelty-based exploration?
• Hypothesis:
• Language abstraction can make semantic-based novelty intrinsic reward,
accelerating exploration.
6
Contributions
• This paper shows novelty calculation using language abstraction can
accelerate RL exploration because
• 1) language can abstracts a state space coarsely, and
• 2) language can abstracts a state space semantically.
• Furthermore, this paper shows the above idea is applicable in environments
without language, using vision-language model (VLM)
7
Methods
8
Problem Formulation
• Goal conditioned MDP
• , goal space. A goal is a language instruction in this paper.
• .
• , language oracle is used in proof of concept experiments (PoC).
• Oracle output is never observed by an agent and is distinct from the instruction .
•
that maximizes is considered.
(S, A, G, P, Re, γ)
G
Re : S × G → ℝ
𝒪 : S → L
g
πg( ⋅ |S) E[
H
∑
t=0
γt
(re
t + βri
t)]
9
Method Outline
• Novelty calculation baseline + RL + (Language encoder or pretrained VLM)
• Novelty calculation baseline: Random Network Distillation
• RL agent cannot see oracle language and does not share any parameters
with pretrained models.
10
Method - Novelty Calculation Baseline
Outline
• Random Network Distillation (RND)
• RND makes the higher intrinsic rewards when visiting the unfamiliar states
using the trainable network.
• Today’s paper calls the original RND as visual RND, VisRND.
11
Method - Novelty Calculation Baseline
Environment interaction diagram
• RND makes intrinsic rewards using two state encoders:
• a fixed target function and a trainable predictor function .
ffixed fψ
12
Method - Novelty Calculation Baseline
Calculation of intrinsic reward
• Intrinsic reward
• Target function .
• Deterministic, randomly initialized, fixed NN.
• Predictor function .
• Trainable NN with parameter .
ri
= ||ffixed(s) − fψ(s)||2
ffixed : S → ℝk
fψ : S → ℝk
ψ
13
Method - Novelty Calculation Baseline
Training of the state encoder
• is trained to mimic the random feature .
•
• implicitly stores the visit counts, familiarity of states.
fψ ffixed(s)
L(ψ) = ||ffixed(s) − fψ(s)||2
fψ
14
Method - Novelty Calculation Baseline
Training of a RL agent
• RL agent: on-policy IMPALA
•
Value loss , where
and .
•
Policy loss
L(ϕ) =
∑
t
[yt − Vϕ(st, g)]
2
yt =
t+n−1
∑
k=t
γk−t
rk + γn
Vϕ(st+n, g) rk = re
k + βri
k
L(θ) = −
∑
t
[
log πθ(at |st, g)(rt + γyt+1 − Vϕ(st, g))]
15
Method - Language Encoder
• Language-RND, Lang-RND, makes higher intrinsic rewards when getting the
unfamiliar language.
• Language is from the oracle, is a fixed random LSTM.
• Lang-RND shows language’s coarse abstraction is helpful for RL exploration.
ffixed : L → ℝk
16
Method - Oracle Language Distillation
• Language Distillation, LD, makes higher intrinsic rewards when getting the
visual observation with unfamiliar linguistic meaning.
• , oracle.
• , trained to generate text caption like oracle, with a CNN encoder
and LSTM decoder.
•
, where K is the length of oracle language.
• LD shows semantic meaning can accelerate RL exploration.
ffixed : S → L
fψ : S → L
ri
= −
K
∑
k=1
log (fψ(s))
(ffixed(s))k
k
17
Method - VLM Encoder
• Network Distillation, ND, makes higher intrinsic rewards when getting the
visual observation with unfamiliar linguistic meaning.
• , pretrained to make visual embedding aligned to the
corresponding language embedding.
• ND shows this paper’s idea is applicable in envs without language.
ffixed : S → ℝk
18
Experiments
19
PoC: Is Language a Meaningful Abstraction?
• Using oracle language, the authors does proof of concept (PoC) showing:
• 1) Language abstraction forces a RL agent to explore much more states,
• because language coarsely abstracts states.
• 2) Language abstraction forces a RL agent to explore semantically diverse
states,
• because language semantically abstracts states.
20
PoC
Environment
• Playroom Environment:
• Rooms with various household objects.
• Tasks: lift, put, find.
• Goal: an instruction like “find <object>”.
• If the goal is achieved, reward is +1 and an episode ends.
• Oracle language is made by Unity.
21
PoC - Coarse Abstraction by Language
Results
• Claim:
• If language abstracts the state space coarsely first, novelty by the
abstraction accelerates RL exploration.
22
PoC: Methods Taxonomy
23
Is state space
abstracted
coarsely?
Is semantic
meaning
considered?
Trainable
Network
Target
function
Vis-RND X X Fixed random NN
Lang-RND O Fixed random NN
S → Rk
L → Rk
△
PoC - Coarse Abstraction by Language
Results
• Claim:
• If language abstracts the state space coarsely first, novelty by the abstraction accelerates
RL exploration.
• Exploration with language novelty solves the tasks much faster than
exploration with visual novelty.
24
Lang-RND: state -> language -> random feature
Vis-RND: state -> random feature
Trajectory comparison between
Lang-RND v.s. Vis-RND
PoC - Coarse Abstraction by Language
Why coarse? And so what?
• States are coarsely grouped into language by Unity oracle.
• Therefore, the agent should explore wider to get higher intrinsic rewards.
• Because random language features are made from the oracle language,
random feature space coarsely abstracts the states.
25
POC - Semantic Diversity from Images
• Claims:
• 1) Just coarse abstraction is not enough. Semantic should be considered.
• 2) We can use language abstraction-based novelty from visual states.
26
PoC: Methods Taxonomy
27
Is state space
abstracted
coarsely?
Is semantic
meaning
considered?
Trainable
Network
Target
function
Vis-RND X X Fixed random NN
Lang-RND O Fixed random NN
Shuffled
Language
Distillation
O X
Fixed random NN
whose
distribution is
same with oracle
Language
Distillation
O O Unity oracle
S → Rk
L → Rk
S → L
S → L
△
POC - Semantic Diversity from Images
Results
• Claims:
• 1) Just coarse abstraction is not enough. Semantic should be considered.
• 2) We can use language abstraction-based novelty from visual states.
• Exploration with coarse + meaningful embedding helps exploration more than
coarse + meaningless embedding.
28
LD: state -> meaningful text
S-LD: state -> meaningless text S-LD output examples
POC - Semantic Diversity
Why semantically diverse? And so what?
• LD makes higher intrinsic rewards when visiting states with newer semantics.
• RL agent should explore semantically diverse states to get higher .
ri
29
POC - Semantic Diversity
Why semantically diverse? And so what?
• LD makes higher intrinsic rewards when visiting states with newer semantics.
• RL agent should explore semantically diverse states to get higher .
• The dramatic gap between LD v.s. S-LD is due to the environment choice.
• Because oracle makes captions of agent interactions.
• LD interacts more, getting higher intrinsic rewards with new languages.
ri
30
Experiments - Intrinsic Rewards with VLM
• Authors use VLM encoder to eliminate the need for the language oracle.
• An agent gets higher intrinsic rewards when visiting states with the
unfamiliar linguistic embedding.
31
Experiments - Intrinsic Rewards with VLM
Results
• With coarse and semantic embedding of VLM, ALM-ND can learns faster than
Vis-RND, without oracle language.
• ALM-ND uses an ALIGN model encoder, pretrained to make image
embedding aligned to the corresponding text embedding.
32
Conclusion
• Conclusion:
• Novelty calculation using language abstraction can accelerate RL exploration
because it abstracts state space 1) coarsely 2) semantically.
• Novelty calculation using language abstraction works on various settings; on-policy
and off-policy, different novelty calculations, different 3D domains even without
oracle language.
• Limitations:
• There is no 2D env performance comparison with existing visual novelty methods.
• The performance of pretrained VLM affects the resulting RL sample efficiency a lot.
33
Appendix
36
Contents
• Related works and rationale
• Methods - novelty calculation baseline: Random Network Distillation PoC
• Methods - novelty calculation baseline: Never Give Up
• Methods - construction of S-LD
• More experiments
37
Why is This New?
• Existing family of intrinsic reward exploration methods could fail on 3D state spaces,
because those all use visual state representations.
• This method abstracts the states using semantic, avoiding useless explorations.
• Existing RL methods with language require environment-specific annotations or
semantic parsers.
• This method can be applied to any visually natural env using pretrained VLM.
• Existing RL with pretrained embeddings mainly put the embedding directly into the
agent.
• This method shows large pretrained models can be used to guide exploration.
38
Rationale
• Why would VLM representation be helpful for semantic novelty-based
exploration? Intuitions?
• 1. Language is inherently abstract.
• language links situations similarly which are superficially distinct but
causally related, and vice versa.
• 2. Language carries important information efficiently ignoring miscellaneous
noises.
39
Random Network Distillation
Proof of Concept
• Question: can be a novelty measure?
• Dataset: many 0 and other digit. (Eg. 5000 images for 0, 10 images for 1)
• is trained to .
||fψ(s) − ffixed(s)||2
N
fψ min
ψ
||fψ(s) − ffixed(s)||2
40
N, # of target class in training data
MSE
for
unseen data
Novelty Calculation Baseline - Never Give Up
Outline
• Never Give Up (NGU) makes based on how new the state is in this episode.
• NUG components
• , state encoder:
• , memory of for all in one episode.
• is distinct with the experience replay buffer of the whole game.
ri
fψ s → Rk
M f(s) s
M
41
Novelty Calculation Baseline - Never Give Up
Environment interaction diagram
• is made by encoder and non-parametric buffer of encoded states .
• does not share any parameters with a RL agent.
ri
fψ M
fψ
42
Novelty Calculation Baseline - Never Give Up
Calculation of intrinsic reward
•
• is k-nearest neighbors of in the episode memory .
• is bigger when the encoded state is far from the existing encoded states.
• is filled with for all which are visited states so far in this episode.
ri
= R(f(s′), M) ∝
∑
f(x)∈knn(f(s′),M)
||f(s′) − f(x)||2
knn(f(s′), M) f(s′) M
ri
M f(s) s
43
Novelty Calculation Baseline - Never Give Up
Training of the state encoder
• is trained to extract visual features related only to the agent’s actions.
• , where is MLP.
• should extract the features relevant to its action.
• In the today’s paper, only Vis-NGU trains in this way.
• Lang-NGU, LSE-NGU use the fixed pretrained (CLIP, ALM etc.).
fψ
at = h(fψ(st), fψ(st+1)) h
fψ
fψ
fψ
44
Novelty Calculation Baseline - Never Give Up
Training of a RL agent
• RL agent: DRQN + -greedy.
• Q function loss: ,
where ~ experience replay buffer of the whole game.
ϵ
L(ϕ) = ||(re
t + βri
t) + γQϕ(st+1, at+1) − Qϕ(st, at)||2
(st, at, re
t , ri
t, st+1)
45
S-LD Construction
• S-LD uses a fixed target network , whose output distribution is
same with the oracle, but the mapping is random.
• construction procedure:
• Get the empirical oracle language distribution by , which is trained by LD.
• Get an image embedding as a real number using random fixed NN.
• Map the random real number to language, according to the oracle distribution.
ffixed : S → L
ffixed
πLD
46
Pretrained Image Model instead of VLM
• CNN encoder pretrained on ImageNet is compared.
• Language embedding gets much bigger performance in harder tasks. (Put,
Find)
47
Oracle Language v.s. Image Embedding from VLM
• Methods with image are not significantly worse than methods with oracle
languages.
48
Visited State Heatmap in City Environment
• NGU variants explores in City environment only with intrinsic rewards.
• Language abstractions makes the visited state wider.
49
Observation examples in City Environment
1 de 47

Mais conteúdo relacionado

Similar a 230922 Semantic Exploration from Language Abstractions and Pretrained Representations - public.pdf(20)

Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR text
Kepa J. Rodriguez3.1K visualizações
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
H K Yoon146 visualizações
Ire presentationIre presentation
Ire presentation
Raj Patel250 visualizações
NLP from scratch NLP from scratch
NLP from scratch
Bryan Gummibearehausen1.5K visualizações
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptx
ssuser8934454 visualizações

Último(20)

plasmidsplasmids
plasmids
scribddarkened3527 visualizações
Company Fashion Show ChemAI 231116.pptxCompany Fashion Show ChemAI 231116.pptx
Company Fashion Show ChemAI 231116.pptx
Marco Tibaldi63 visualizações
Workshop Chemical Robotics ChemAI 231116.pptxWorkshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptx
Marco Tibaldi78 visualizações
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro7 visualizações
Guinea Pig as a Model for Translation ResearchGuinea Pig as a Model for Translation Research
Guinea Pig as a Model for Translation Research
PervaizDar17 visualizações
Journal of Environmental & Earth Sciences | Vol.5, Iss.2 October 2023Journal of Environmental & Earth Sciences | Vol.5, Iss.2 October 2023
Journal of Environmental & Earth Sciences | Vol.5, Iss.2 October 2023
Bilingual Publishing Group7 visualizações
MSC III_Advance Forensic Serology_Final.pptxMSC III_Advance Forensic Serology_Final.pptx
MSC III_Advance Forensic Serology_Final.pptx
Suchita Rawat5 visualizações
Plasma Extractor.pdfPlasma Extractor.pdf
Plasma Extractor.pdf
alizalabtronuk5 visualizações
Journal of Geographical Research | Vol.6, Iss.4 October 2023Journal of Geographical Research | Vol.6, Iss.4 October 2023
Journal of Geographical Research | Vol.6, Iss.4 October 2023
Bilingual Publishing Group8 visualizações
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptx
HimalayBaruah5 visualizações
Workshop LLM Life Sciences ChemAI 231116.pptxWorkshop LLM Life Sciences ChemAI 231116.pptx
Workshop LLM Life Sciences ChemAI 231116.pptx
Marco Tibaldi85 visualizações
Climate Change.pptxClimate Change.pptx
Climate Change.pptx
laurenmortensen121 visualizações

230922 Semantic Exploration from Language Abstractions and Pretrained Representations - public.pdf

  • 1. Machine Learning LABoratory Seungjoon Lee. 2023-09-22. sjlee1218@postech.ac.kr Semantic Exploration from Language Abstractions and Pretrained Representations Neurips 2022. Paper Summary 1
  • 2. Contents • Introduction • Methods • Experiments • Conclusion 2
  • 3. Caution!!! • This is the material I summarized a paper at my personal research meeting. • Some of the contents may be incorrect! • Some experiments are excluded intentionally, because they are not directly related to my research interest. • Methods are simplified for easy explanation. • Please send me an email if you want to contact me: sjlee1218@postech.ac.kr (for correction or addition of materials, ideas to develop this paper, discussion). 3
  • 4. Situations • Novelty-based RL exploration methods incentivize exploration using novelty as intrinsic rewards. • The novelty is calculated based on the degree to which an observation is new. 4
  • 5. Complication • The existing visual novelty-based methods will fail on partial observable high dimensional state space, especially in 3D environments. • Because the semantically similar states can be observed very differently, in terms of the point of view. 5
  • 6. Questions & Hypothesis • Question: • Can you recognize similarities in high dimensional states that are semantically similar but visually different for novelty-based exploration? • Hypothesis: • Language abstraction can make semantic-based novelty intrinsic reward, accelerating exploration. 6
  • 7. Contributions • This paper shows novelty calculation using language abstraction can accelerate RL exploration because • 1) language can abstracts a state space coarsely, and • 2) language can abstracts a state space semantically. • Furthermore, this paper shows the above idea is applicable in environments without language, using vision-language model (VLM) 7
  • 9. Problem Formulation • Goal conditioned MDP • , goal space. A goal is a language instruction in this paper. • . • , language oracle is used in proof of concept experiments (PoC). • Oracle output is never observed by an agent and is distinct from the instruction . • that maximizes is considered. (S, A, G, P, Re, γ) G Re : S × G → ℝ 𝒪 : S → L g πg( ⋅ |S) E[ H ∑ t=0 γt (re t + βri t)] 9
  • 10. Method Outline • Novelty calculation baseline + RL + (Language encoder or pretrained VLM) • Novelty calculation baseline: Random Network Distillation • RL agent cannot see oracle language and does not share any parameters with pretrained models. 10
  • 11. Method - Novelty Calculation Baseline Outline • Random Network Distillation (RND) • RND makes the higher intrinsic rewards when visiting the unfamiliar states using the trainable network. • Today’s paper calls the original RND as visual RND, VisRND. 11
  • 12. Method - Novelty Calculation Baseline Environment interaction diagram • RND makes intrinsic rewards using two state encoders: • a fixed target function and a trainable predictor function . ffixed fψ 12
  • 13. Method - Novelty Calculation Baseline Calculation of intrinsic reward • Intrinsic reward • Target function . • Deterministic, randomly initialized, fixed NN. • Predictor function . • Trainable NN with parameter . ri = ||ffixed(s) − fψ(s)||2 ffixed : S → ℝk fψ : S → ℝk ψ 13
  • 14. Method - Novelty Calculation Baseline Training of the state encoder • is trained to mimic the random feature . • • implicitly stores the visit counts, familiarity of states. fψ ffixed(s) L(ψ) = ||ffixed(s) − fψ(s)||2 fψ 14
  • 15. Method - Novelty Calculation Baseline Training of a RL agent • RL agent: on-policy IMPALA • Value loss , where and . • Policy loss L(ϕ) = ∑ t [yt − Vϕ(st, g)] 2 yt = t+n−1 ∑ k=t γk−t rk + γn Vϕ(st+n, g) rk = re k + βri k L(θ) = − ∑ t [ log πθ(at |st, g)(rt + γyt+1 − Vϕ(st, g))] 15
  • 16. Method - Language Encoder • Language-RND, Lang-RND, makes higher intrinsic rewards when getting the unfamiliar language. • Language is from the oracle, is a fixed random LSTM. • Lang-RND shows language’s coarse abstraction is helpful for RL exploration. ffixed : L → ℝk 16
  • 17. Method - Oracle Language Distillation • Language Distillation, LD, makes higher intrinsic rewards when getting the visual observation with unfamiliar linguistic meaning. • , oracle. • , trained to generate text caption like oracle, with a CNN encoder and LSTM decoder. • , where K is the length of oracle language. • LD shows semantic meaning can accelerate RL exploration. ffixed : S → L fψ : S → L ri = − K ∑ k=1 log (fψ(s)) (ffixed(s))k k 17
  • 18. Method - VLM Encoder • Network Distillation, ND, makes higher intrinsic rewards when getting the visual observation with unfamiliar linguistic meaning. • , pretrained to make visual embedding aligned to the corresponding language embedding. • ND shows this paper’s idea is applicable in envs without language. ffixed : S → ℝk 18
  • 20. PoC: Is Language a Meaningful Abstraction? • Using oracle language, the authors does proof of concept (PoC) showing: • 1) Language abstraction forces a RL agent to explore much more states, • because language coarsely abstracts states. • 2) Language abstraction forces a RL agent to explore semantically diverse states, • because language semantically abstracts states. 20
  • 21. PoC Environment • Playroom Environment: • Rooms with various household objects. • Tasks: lift, put, find. • Goal: an instruction like “find <object>”. • If the goal is achieved, reward is +1 and an episode ends. • Oracle language is made by Unity. 21
  • 22. PoC - Coarse Abstraction by Language Results • Claim: • If language abstracts the state space coarsely first, novelty by the abstraction accelerates RL exploration. 22
  • 23. PoC: Methods Taxonomy 23 Is state space abstracted coarsely? Is semantic meaning considered? Trainable Network Target function Vis-RND X X Fixed random NN Lang-RND O Fixed random NN S → Rk L → Rk △
  • 24. PoC - Coarse Abstraction by Language Results • Claim: • If language abstracts the state space coarsely first, novelty by the abstraction accelerates RL exploration. • Exploration with language novelty solves the tasks much faster than exploration with visual novelty. 24 Lang-RND: state -> language -> random feature Vis-RND: state -> random feature Trajectory comparison between Lang-RND v.s. Vis-RND
  • 25. PoC - Coarse Abstraction by Language Why coarse? And so what? • States are coarsely grouped into language by Unity oracle. • Therefore, the agent should explore wider to get higher intrinsic rewards. • Because random language features are made from the oracle language, random feature space coarsely abstracts the states. 25
  • 26. POC - Semantic Diversity from Images • Claims: • 1) Just coarse abstraction is not enough. Semantic should be considered. • 2) We can use language abstraction-based novelty from visual states. 26
  • 27. PoC: Methods Taxonomy 27 Is state space abstracted coarsely? Is semantic meaning considered? Trainable Network Target function Vis-RND X X Fixed random NN Lang-RND O Fixed random NN Shuffled Language Distillation O X Fixed random NN whose distribution is same with oracle Language Distillation O O Unity oracle S → Rk L → Rk S → L S → L △
  • 28. POC - Semantic Diversity from Images Results • Claims: • 1) Just coarse abstraction is not enough. Semantic should be considered. • 2) We can use language abstraction-based novelty from visual states. • Exploration with coarse + meaningful embedding helps exploration more than coarse + meaningless embedding. 28 LD: state -> meaningful text S-LD: state -> meaningless text S-LD output examples
  • 29. POC - Semantic Diversity Why semantically diverse? And so what? • LD makes higher intrinsic rewards when visiting states with newer semantics. • RL agent should explore semantically diverse states to get higher . ri 29
  • 30. POC - Semantic Diversity Why semantically diverse? And so what? • LD makes higher intrinsic rewards when visiting states with newer semantics. • RL agent should explore semantically diverse states to get higher . • The dramatic gap between LD v.s. S-LD is due to the environment choice. • Because oracle makes captions of agent interactions. • LD interacts more, getting higher intrinsic rewards with new languages. ri 30
  • 31. Experiments - Intrinsic Rewards with VLM • Authors use VLM encoder to eliminate the need for the language oracle. • An agent gets higher intrinsic rewards when visiting states with the unfamiliar linguistic embedding. 31
  • 32. Experiments - Intrinsic Rewards with VLM Results • With coarse and semantic embedding of VLM, ALM-ND can learns faster than Vis-RND, without oracle language. • ALM-ND uses an ALIGN model encoder, pretrained to make image embedding aligned to the corresponding text embedding. 32
  • 33. Conclusion • Conclusion: • Novelty calculation using language abstraction can accelerate RL exploration because it abstracts state space 1) coarsely 2) semantically. • Novelty calculation using language abstraction works on various settings; on-policy and off-policy, different novelty calculations, different 3D domains even without oracle language. • Limitations: • There is no 2D env performance comparison with existing visual novelty methods. • The performance of pretrained VLM affects the resulting RL sample efficiency a lot. 33
  • 35. Contents • Related works and rationale • Methods - novelty calculation baseline: Random Network Distillation PoC • Methods - novelty calculation baseline: Never Give Up • Methods - construction of S-LD • More experiments 37
  • 36. Why is This New? • Existing family of intrinsic reward exploration methods could fail on 3D state spaces, because those all use visual state representations. • This method abstracts the states using semantic, avoiding useless explorations. • Existing RL methods with language require environment-specific annotations or semantic parsers. • This method can be applied to any visually natural env using pretrained VLM. • Existing RL with pretrained embeddings mainly put the embedding directly into the agent. • This method shows large pretrained models can be used to guide exploration. 38
  • 37. Rationale • Why would VLM representation be helpful for semantic novelty-based exploration? Intuitions? • 1. Language is inherently abstract. • language links situations similarly which are superficially distinct but causally related, and vice versa. • 2. Language carries important information efficiently ignoring miscellaneous noises. 39
  • 38. Random Network Distillation Proof of Concept • Question: can be a novelty measure? • Dataset: many 0 and other digit. (Eg. 5000 images for 0, 10 images for 1) • is trained to . ||fψ(s) − ffixed(s)||2 N fψ min ψ ||fψ(s) − ffixed(s)||2 40 N, # of target class in training data MSE for unseen data
  • 39. Novelty Calculation Baseline - Never Give Up Outline • Never Give Up (NGU) makes based on how new the state is in this episode. • NUG components • , state encoder: • , memory of for all in one episode. • is distinct with the experience replay buffer of the whole game. ri fψ s → Rk M f(s) s M 41
  • 40. Novelty Calculation Baseline - Never Give Up Environment interaction diagram • is made by encoder and non-parametric buffer of encoded states . • does not share any parameters with a RL agent. ri fψ M fψ 42
  • 41. Novelty Calculation Baseline - Never Give Up Calculation of intrinsic reward • • is k-nearest neighbors of in the episode memory . • is bigger when the encoded state is far from the existing encoded states. • is filled with for all which are visited states so far in this episode. ri = R(f(s′), M) ∝ ∑ f(x)∈knn(f(s′),M) ||f(s′) − f(x)||2 knn(f(s′), M) f(s′) M ri M f(s) s 43
  • 42. Novelty Calculation Baseline - Never Give Up Training of the state encoder • is trained to extract visual features related only to the agent’s actions. • , where is MLP. • should extract the features relevant to its action. • In the today’s paper, only Vis-NGU trains in this way. • Lang-NGU, LSE-NGU use the fixed pretrained (CLIP, ALM etc.). fψ at = h(fψ(st), fψ(st+1)) h fψ fψ fψ 44
  • 43. Novelty Calculation Baseline - Never Give Up Training of a RL agent • RL agent: DRQN + -greedy. • Q function loss: , where ~ experience replay buffer of the whole game. ϵ L(ϕ) = ||(re t + βri t) + γQϕ(st+1, at+1) − Qϕ(st, at)||2 (st, at, re t , ri t, st+1) 45
  • 44. S-LD Construction • S-LD uses a fixed target network , whose output distribution is same with the oracle, but the mapping is random. • construction procedure: • Get the empirical oracle language distribution by , which is trained by LD. • Get an image embedding as a real number using random fixed NN. • Map the random real number to language, according to the oracle distribution. ffixed : S → L ffixed πLD 46
  • 45. Pretrained Image Model instead of VLM • CNN encoder pretrained on ImageNet is compared. • Language embedding gets much bigger performance in harder tasks. (Put, Find) 47
  • 46. Oracle Language v.s. Image Embedding from VLM • Methods with image are not significantly worse than methods with oracle languages. 48
  • 47. Visited State Heatmap in City Environment • NGU variants explores in City environment only with intrinsic rewards. • Language abstractions makes the visited state wider. 49 Observation examples in City Environment