SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
ACHIEVING HUMAN PARITY ON VISUAL QUESTION
ANSWERING (Alicemind)
딥러닝논문읽기 자연어처리팀. 2022.02.13
주정헌(발표자), 최상우, 김지연, 백지윤, 민지원
Background
2
1. Visual Question Answering
• VQA Challenge는 컴퓨터비전패턴인식학회(IEEE Computer Vision and Pattern Recognition,
CVPR) 워크샵 중 하나이며, VQA Homepage에서 매년 열린다.
• -> 2016년 CVPR을 시작으로 매년 개최되며, 1년마다 발전된 기술을 평가하고 시상
Background
3
2. 국내 현황
• 2016 naver Labs 2위
• 서울대 장병탁교수팀 2위
Background
4
3. 과거 연구
1) 보통 CNN으로 이미지를 이해하고, RNN 등으로 으로 질문을 이해한 후 정답을 도출하는 방식으로 이루어짐
2) 영어데이터 관련 연구는 많이 이루어지고 있는데 반면, 한국어 연구는 많이 이루어지지 않았음
VQA: Visual Question Answering (ICCV 2015)
Yin and Yang: Balancing and Answering Binary Visual Questions (CVPR 2016)
Making the V in VQA Matter: Elevating the Role of Image Understanding
in Visual Question Answering (CVPR 2017)
VQA Modeling for Alicemind
5
1. Reaching Human Parity (사람 수준의 모델링을 목표)
- Open된 질문에 응대해보자!
2. Comprehensive Feature Representations
- Cross modal learning을 위해 feature engineering 을 해보자!
3. Pretraining
- Cross-modal interaction을 통하여 Vision-languages를 Pretraining.
- 좀 더 좋은 Pretraining을 제안하여 실험해보자!
4. Knowledge-guided Mixture of Experts (MOE)
- 사전학습으로 배운 것 중에서도 (절대로, rarely) 기계가 모르는 것이 있을 수도 있으니 따로 배워보자!
Q & A
6
1. Human parity
7
1. Beyond existing approaches
2. To capture diversity of visual signals
3. Combine visual representations
1. Region feature: Region별로 물체 탐지
2. Grid feature: Grid에 따라 배경 탐지/파악
3. Patch feature: patch 별로 기타 이미지 특성 파악
4. Efficient Semantic gaps between Visual and Language
- Single-stream architecture(o)
- Dual-stream architecture(x)
1. MOE Paradigm
- Text Reading Expert
- Clock Reading Expert
2. Cross-modal learning
8
1. Visual Feature
Region Feature: better localization of individual objects and capture the detailed semantics
Bottom-up attention: to identify salient region
(작은 리전으로부터 relevant한 파트들로 묶어 attention을 수행하여 region 피쳐를 만든다.)
Grid Feature: to capture global information of images(배경 정보) Freely low-resolution images.
이는 HW x C feature 맵으로 만들시 a linear projection layer에서 채널 dimension이 줄어드는 효과가 있음
Patch Feature: 고정된 사이즈로 된 patch를 transformer에 통과시킴.(ViT)
이는 1) grid-based feature 와 함께 convolution 연산이 간편해짐. 2) full-image를 설명할 수 있는 구조를 self-
attention을 통해 학습 시킬 수 있음
2. Textual Feature
- BERT Embeddings
질문은 문장 그대로 인코딩, 정답지는 [CLS]를 넣어서 구분하여 학습
9
3. Visual & Language Pretraining
Single-stream Architecture
- Align attention을 수행하여 joint representation
을 학습한다(V: Vision , L: Language)
Score-matrix
Sub-matrics
10
3. Visual & Language Pretraining
Task
1) Masked LM Prediction: Same with BERT
2) Masked Object Prediction: randomly masking objects
3) Image-Text Matching: Randomly match/Mismatch image-text pairs.
4) Image Question Answering: classification problem with image QA data
11
4. Knowledge-guided Mixture of Experts
1) Text Reading Expert
StructuralLM: OCR 모델을 통해 끄집어낸 텍스트를 가지고 VQA 로 활용한다.
2) Clock Reading Expert
- Clock-detector: binary Classification with bounding box
- Clock-reader : Resnet Backbone, channel-wise attention, spatial attention (SE-Layer)
(시간은 총 12시간이니 12-category classification로 문제를 바꿔 loss를 계산한다)
Putting Together
1) MOE에서 main-task와 sub-task를 구분한다.
2) 일종의 Gating network통과 시킨다.
(이때 Multiple experts로 갈지 Single Expert로 갈지 Switching Transformer를 활용하여 구분한다)
3) Aliceminds 는 이제 총 3가지 tasks를 classification하여 답변에 대해 준비한다.
12
(Alicemind-MMU Architecture)
5. Pretraining datasets
• MS COCO: Image Captioning • Visual Genome: Visual Question Answering
• VQA2.0: Visual Question Answering
13
6. Fine-tuning datasets
• VQA, 10-human annotations, Several Questions
14
7. Results
15
7. Results
16
17
7. Results
Q & A
18
Thank you
19

Mais conteúdo relacionado

Semelhante a Achieving human parity on visual question answering alicemind

210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervisiontaeseon ryu
 
ICIP 2018 REVIEW
ICIP 2018 REVIEWICIP 2018 REVIEW
ICIP 2018 REVIEWSungMan Cho
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection창기 문
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection창기 문
 
FCN to DeepLab.v3+
FCN to DeepLab.v3+FCN to DeepLab.v3+
FCN to DeepLab.v3+Whi Kwon
 
CS231n chap12_Visualization and Understand Summary
CS231n chap12_Visualization and Understand SummaryCS231n chap12_Visualization and Understand Summary
CS231n chap12_Visualization and Understand Summaryssuser491981
 
소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다
소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다
소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다HYEONGNAM LEE
 
연구실 세미나 Show and tell google image captioning
연구실 세미나 Show and tell google image captioning연구실 세미나 Show and tell google image captioning
연구실 세미나 Show and tell google image captioninghkh
 
180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extended180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extendedJaewook. Kang
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderLee Seungeun
 
Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...
Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...
Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...홍배 김
 
Word 2 Vec Algorithm
Word 2 Vec AlgorithmWord 2 Vec Algorithm
Word 2 Vec AlgorithmHyeongmin Lee
 
인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기Byoung-Hee Kim
 
[부스트캠프 Tech Talk] 배지연_Structure of Model and Task
[부스트캠프 Tech Talk] 배지연_Structure of Model and Task[부스트캠프 Tech Talk] 배지연_Structure of Model and Task
[부스트캠프 Tech Talk] 배지연_Structure of Model and TaskCONNECT FOUNDATION
 
Faster R-CNN
Faster R-CNNFaster R-CNN
Faster R-CNNrlawjdgns
 
march report in korean.
march report in korean.march report in korean.
march report in korean.nao takatoshi
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝NAVER Engineering
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스BOAZ Bigdata
 

Semelhante a Achieving human parity on visual question answering alicemind (20)

210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision
 
ICIP 2018 REVIEW
ICIP 2018 REVIEWICIP 2018 REVIEW
ICIP 2018 REVIEW
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
FCN to DeepLab.v3+
FCN to DeepLab.v3+FCN to DeepLab.v3+
FCN to DeepLab.v3+
 
CS231n chap12_Visualization and Understand Summary
CS231n chap12_Visualization and Understand SummaryCS231n chap12_Visualization and Understand Summary
CS231n chap12_Visualization and Understand Summary
 
소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다
소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다
소프트웨어 마에스트로 10기 - 책을 만나는 순간, 책을찍다
 
연구실 세미나 Show and tell google image captioning
연구실 세미나 Show and tell google image captioning연구실 세미나 Show and tell google image captioning
연구실 세미나 Show and tell google image captioning
 
180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extended180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extended
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding Autoencoder
 
Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...
Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...
Knowing when to look : Adaptive Attention via A Visual Sentinel for Image Cap...
 
Word 2 Vec Algorithm
Word 2 Vec AlgorithmWord 2 Vec Algorithm
Word 2 Vec Algorithm
 
인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기인공지능 방법론 - Deep Learning 쉽게 이해하기
인공지능 방법론 - Deep Learning 쉽게 이해하기
 
[부스트캠프 Tech Talk] 배지연_Structure of Model and Task
[부스트캠프 Tech Talk] 배지연_Structure of Model and Task[부스트캠프 Tech Talk] 배지연_Structure of Model and Task
[부스트캠프 Tech Talk] 배지연_Structure of Model and Task
 
Codex project
Codex projectCodex project
Codex project
 
Faster R-CNN
Faster R-CNNFaster R-CNN
Faster R-CNN
 
march report in korean.
march report in korean.march report in korean.
march report in korean.
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝
 
Vid2vid
Vid2vidVid2vid
Vid2vid
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
 

Mais de taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

Mais de taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Achieving human parity on visual question answering alicemind

  • 1. ACHIEVING HUMAN PARITY ON VISUAL QUESTION ANSWERING (Alicemind) 딥러닝논문읽기 자연어처리팀. 2022.02.13 주정헌(발표자), 최상우, 김지연, 백지윤, 민지원
  • 2. Background 2 1. Visual Question Answering • VQA Challenge는 컴퓨터비전패턴인식학회(IEEE Computer Vision and Pattern Recognition, CVPR) 워크샵 중 하나이며, VQA Homepage에서 매년 열린다. • -> 2016년 CVPR을 시작으로 매년 개최되며, 1년마다 발전된 기술을 평가하고 시상
  • 3. Background 3 2. 국내 현황 • 2016 naver Labs 2위 • 서울대 장병탁교수팀 2위
  • 4. Background 4 3. 과거 연구 1) 보통 CNN으로 이미지를 이해하고, RNN 등으로 으로 질문을 이해한 후 정답을 도출하는 방식으로 이루어짐 2) 영어데이터 관련 연구는 많이 이루어지고 있는데 반면, 한국어 연구는 많이 이루어지지 않았음 VQA: Visual Question Answering (ICCV 2015) Yin and Yang: Balancing and Answering Binary Visual Questions (CVPR 2016) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (CVPR 2017)
  • 5. VQA Modeling for Alicemind 5 1. Reaching Human Parity (사람 수준의 모델링을 목표) - Open된 질문에 응대해보자! 2. Comprehensive Feature Representations - Cross modal learning을 위해 feature engineering 을 해보자! 3. Pretraining - Cross-modal interaction을 통하여 Vision-languages를 Pretraining. - 좀 더 좋은 Pretraining을 제안하여 실험해보자! 4. Knowledge-guided Mixture of Experts (MOE) - 사전학습으로 배운 것 중에서도 (절대로, rarely) 기계가 모르는 것이 있을 수도 있으니 따로 배워보자!
  • 7. 1. Human parity 7 1. Beyond existing approaches 2. To capture diversity of visual signals 3. Combine visual representations 1. Region feature: Region별로 물체 탐지 2. Grid feature: Grid에 따라 배경 탐지/파악 3. Patch feature: patch 별로 기타 이미지 특성 파악 4. Efficient Semantic gaps between Visual and Language - Single-stream architecture(o) - Dual-stream architecture(x) 1. MOE Paradigm - Text Reading Expert - Clock Reading Expert
  • 8. 2. Cross-modal learning 8 1. Visual Feature Region Feature: better localization of individual objects and capture the detailed semantics Bottom-up attention: to identify salient region (작은 리전으로부터 relevant한 파트들로 묶어 attention을 수행하여 region 피쳐를 만든다.) Grid Feature: to capture global information of images(배경 정보) Freely low-resolution images. 이는 HW x C feature 맵으로 만들시 a linear projection layer에서 채널 dimension이 줄어드는 효과가 있음 Patch Feature: 고정된 사이즈로 된 patch를 transformer에 통과시킴.(ViT) 이는 1) grid-based feature 와 함께 convolution 연산이 간편해짐. 2) full-image를 설명할 수 있는 구조를 self- attention을 통해 학습 시킬 수 있음 2. Textual Feature - BERT Embeddings 질문은 문장 그대로 인코딩, 정답지는 [CLS]를 넣어서 구분하여 학습
  • 9. 9 3. Visual & Language Pretraining Single-stream Architecture - Align attention을 수행하여 joint representation 을 학습한다(V: Vision , L: Language) Score-matrix Sub-matrics
  • 10. 10 3. Visual & Language Pretraining Task 1) Masked LM Prediction: Same with BERT 2) Masked Object Prediction: randomly masking objects 3) Image-Text Matching: Randomly match/Mismatch image-text pairs. 4) Image Question Answering: classification problem with image QA data
  • 11. 11 4. Knowledge-guided Mixture of Experts 1) Text Reading Expert StructuralLM: OCR 모델을 통해 끄집어낸 텍스트를 가지고 VQA 로 활용한다. 2) Clock Reading Expert - Clock-detector: binary Classification with bounding box - Clock-reader : Resnet Backbone, channel-wise attention, spatial attention (SE-Layer) (시간은 총 12시간이니 12-category classification로 문제를 바꿔 loss를 계산한다) Putting Together 1) MOE에서 main-task와 sub-task를 구분한다. 2) 일종의 Gating network통과 시킨다. (이때 Multiple experts로 갈지 Single Expert로 갈지 Switching Transformer를 활용하여 구분한다) 3) Aliceminds 는 이제 총 3가지 tasks를 classification하여 답변에 대해 준비한다.
  • 13. 5. Pretraining datasets • MS COCO: Image Captioning • Visual Genome: Visual Question Answering • VQA2.0: Visual Question Answering 13
  • 14. 6. Fine-tuning datasets • VQA, 10-human annotations, Several Questions 14