Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

Frontiers of Vision and Language:
Bridging Images and Texts by Deep Learning
The University of Tokyo
Yoshitaka Ushiku
losnuevetoros

Documents = Vision + Language
Vision & Language:
an emerging topic
• Integration of CV, NLP
and ML techs
• Several backgrounds
– Impact of Deep Learning
• Image recognition (CV)
• Machine translation (NLP)
– Growth of user generated
contents
– Exploratory researches on
Vision and Language

2012: Impact of Deep Learning
Academic AI startup A famous company
Many slides refer to the first use of CNN (AlexNet) on ImageNet

Academic AI startup A famous company
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
on ImageNet
1st team: 15.3%
2nd team: 26.2%
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Many slides refer to the first use of CNN (AlexNet) on ImageNet

According to the official site…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 26%
[http://image-net.org/challenges/LSVRC/2012/results.html]
It’s me!!

2014: Another impact of Deep Learning
• Deep learning appears in machine translation
[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
• Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output

Growth of user generated contents
Especially in content posting/sharing service
• Facebook: 300 million photos per day
• YouTube: 400-hours videos per minute
Pōhutukawa blooms this
time of the year in New
Zealand. As the flowers
fall, the ground
underneath the trees look
spectacular.
Pairs of a sentence
+ a video / photo
→Collectable in
large quantities

Exploratory researches on Vision and Language
Captioning an image associated with its article
[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.

Exploratory researches on Vision and Language
Captioning an image associated with its article
[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.As a result of these backgrounds:
Various research topics such as …

Image Captioning
Group of people sitting
at a table with a dinner.
Tourists are standing on
the middle of a flat desert.
[Ushiku+, ICCV 2015]

Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]

Multilingual + Image Caption Translation
Ein Masten mit zwei Ampeln
fur Autofahrer. (German)
A pole with two lights
for drivers. (English)
[Hitschler+, ACL 2016]

Visual Question Answering[Fukui+, EMNLP 2016]

Image Generation from Captions
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]

Goal of this keynote
Looking over researches on vision&language
• Historical flow of each area
• Changes by Deep Learning
× Deep Learning enabled these researches
✓ Deep Learning boosted these researches
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions

Frontiers of Vision and Language 1
Image Captioning

Every picture tells a story
Dataset:
Images + <object, action, scene> + Captions
1. Predict <object, action, scene> for an input
image using MRF
2. Search for the existing caption associated with
similar <object, action, scene>
<Horse, Ride, Field>
[Farhadi+, ECCV 2010]

Every picture tells a story
<pet, sleep, ground>
See something unexpected.
<transportation, move, track>
A man stands next to a train
on a cloudy day.
[Farhadi+, ECCV 2010]

Retrieve? Generate?
• Retrieve
• Generate
– Template-based
e.g. generating a Subject＋Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset

Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
e.g. generating a Subject＋Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset

Retrieve? Generate?
• Retrieve
• Generate
– Template-based
dog＋stand ⇒ A dog stands.
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset

Retrieve? Generate?
• Retrieve
• Generate
– Template-based
dog＋stand ⇒ A dog stands.
– Template-free
A small white dog standing on a leash.
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset

Captioning with multi-keyphrases
[Ushiku+, ACM MM 2012]

End of sentence
[Ushiku+, ACM MM 2012]

Benefits of Deep Learning
• Refinement of image recognition [Krizhevsky+, NIPS 2012]
• Deep learning appears in machine translation
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output

Google NIC
Concatenation of Google’s methods
• GoogLeNet [Szegedy+, CVPR 2015]
• MT with LSTM
Caption (word seq.) 𝑆0 … 𝑆 𝑁 for image 𝐼
𝑆0: beginning of the sentence
𝑆1 = LSTM CNN 𝐼
𝑆𝑡 = LSTM St−1 , 𝑡 = 2 … 𝑁 − 1
𝑆 𝑁: end of the sentence
[Vinyals+, CVPR 2015]

Examples of generated captions
[https://github.com/tensorflow/models/tree/master/im2txt]
[Vinyals+, CVPR 2015]

Comparison to [Ushiku+, ACM MM 2012]
Input image
[Ushiku+, ACM MM 2012]:
Conventional object recognition
Fisher Vector + Linear classifier
Neural image captioning:
Conventional object recognition
Convolutional Neural Network
Neural image captioning
Conventional machine translation
Recurrent Neural Network + beam search
[Ushiku+, ACM MM 2012]:
Conventional machine translation
Log Linear Model + beam search
Estimation of important words Connect the words with grammar model
• Trained using only images and captions
• Approaches are similar to each other

Current development: Accuracy
• Attention-based captioning [Xu+, ICML 2015]
– Focus on some areas for predicting each word!
– Both attention and caption models are trained
using pairs of an image & caption

Current development: Problem setting
Dense captioning
[Lin+, BMVC 2015] [Johnson+, CVPR 2016]

Generating captions for a photo sequence
[Park+Kim, NIPS 2015][Huang+, NAACL 2016]
The family
got
together for
a cookout.
They had a
lot of
delicious
food.
The dog
was happy
to be there.
They had a
great time
on the
beach.
They even
had a swim
in the water.

Captioning using sentiment terms
[Mathews+, AAAI 2016][Shin+, BMVC 2016]
Neutral caption
Positive caption

Video Captioning

Before Deep Learning
• Grounding of languages and objects in videos
[Yu+Siskind, ACL 2013]
– Learning from only videos and their captions
– Experiment with a small object with few objects
– Controlled and small dataset
• Deep Learning should suite for this problem
– Image Captioning: single image → word sequence
– Video Captioning: image sequence → word
sequence

End-to-end learning by Deep Learning
• LRCN
[Donahue+, CVPR 2015]
– CNN+RNN for
• Action recognition
• Image / Video
Captioning
• Video to Text
[Venugopalan+, ICCV 2015]
– CNNs to recognize
• Objects from RGB frames
• Actions from flow images
– RNN for captioning

Video Captioning
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]

Video Retrieval from Caption
• Input: Captions
• Output: A video related to the caption
10 sec video clip from 40 min database!
• Video captioning is also addressed
A woman in blue is
playing ping pong in a
room.
A guy is skiing with no
shirt on and yellow
snow pants.
A man is water skiing
while attached to a
long rope.
[Yamaguchi+, ICCV 2017]

Multilingual +
Image Caption Translation

Towards multiple languages
Datasets with multilingual captions
• IAPR TC12 [Grubinger+, 2006] English + Germany
• Multi30K [Elliot+, 2016] English + Germany
• STAIR Captions [Yoshikawa+, 2017]
English + Japanese
Development of cross-lingual tasks
• Non-English-caption generation
• Image Caption Transration
Input: Pair of a caption in Language A + an image
or A caption in Language A
Output: Caption in Language B

Non-English-caption generation

Non-English-caption generation
Most researches: generate English Caption
• Japanese [Miyazaki+Shimizu, ACL 2016]
• Chinese [Li+, ICMR 2016]
• Turkish [Unal+, SIU 2016]
Çimlerde ko¸ san bir köpek
金色头发的小女孩
柵の中にキリンが一頭
立っています

Just collecting non-English captions?
Transfer learning among languages
[Miyazaki+Shimizu, ACL 2016]
• Vision-Language grounding Wim is transferred
• Efficient learning using small amount of captions
an elephant is
an elephant
一匹の象が土の
一匹の象が

Machine translation via visual data
Images can boost MT [Calixto+,2012]
• Example below (English to Portuguese):
Does the word “seal” in English
– mean “seal” similar to “stamp”?
– mean “seal” which is a sea animal?
• [Calixto+,2012] insist that the mistranslation can be
avoided using a related image (w/o experiments)
Mistranslation!

Input: Caption in Language A + image
• Caption translation via an associated image
[Elliott+, 2015] [Hitschler+, ACL 2016]
– Generate translation candidates
– Re-rank the candidates using similar images’
captions in Language B
Eine Person in
einem Anzug
und Krawatte
und einem Rock.
(In German)
Translation w/o the related image
A person in a suit and tie
and a rock.
Translation with the related image
A person in a suit and tie
and a skirt.

Input: Caption in Language A
• Cross-lingual document retrieval via images
[Funaki+Nakayama, EMNLP 2015]
• Zero-shot machine translation
[Nakayama+Nishida, 2017]

Visual Question Answering

Visual Question Answering (VQA)
Proposed in Human-Computer Interfaces
• VizWiz [Bigham+, UIST 2010]
Manually solved on AMT
• Automation for the first time (w/o Deep Learning)
[Malinowski+Fritz, NIPS 2014]
• Similar term: Visual Turing Test [Malinowski+Fritz, 2014]

VQA: Visual Question Answering
• Established VQA as an AI problem
– Provided a benchmark dataset
– Experimental results with reasonable baselines
• Portal web site is also organized
– http://www.visualqa.org/
– Annual competition for VQA accuracy
[Antol+, ICCV 2015]
What color are her eyes?
What is the mustache made of?

VQA Dataset
Collected questions and answers on AMT
• Over 100K real images and 30K abstract images
• About 700K questions＋10 answers for each

VQA=Multiclass Classification
Feature 𝑍𝐼+𝑄 is applied to an usual classifier
Question 𝑄
What objects are
found on the bed?
Answer 𝐴
bed sheets, pillow
Image 𝐼
Image feature
𝑥𝐼
Question feature
𝑥 𝑄
Integrated feature
𝑧𝐼+𝑄

Development of VQA
How to calculate the integrated feature 𝑧𝐼+𝑄?
• VQA [Antol+, ICCV 2015]: Just concatenate them
• Summation
例 Summation of an image feature with attention
and a question feature [Xu+Saenko, ECCV 2016]
• Multiplication
e.g.Bilinear multiplication using DFT
[Fukui+, EMNLP 2016]
• Hybrid of summation and multiplication
e.g.Concatenation of sum and multiplication
[Saito+, ICME 2017]
𝑧𝐼+𝑄 =
𝑥𝐼
𝑥 𝑄
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄

VQA Challenge
Examples from competition results
Q: What is the woman holding?
GT A: laptop
Machine A: laptop
Q: Is it going to rain soon?
GT A: yes
Machine A: yes

VQA Challenge
Examples from competition results
Q: Why is there snow on one
side of the stream and clear
grass on the other?
GT A: shade
Machine A: yes
Q: Is the hydrant painted a new
color?
GT A: yes
Machine A: no

Image Generation from Captions

Image generation from input caption
Photo-realistic image generation itself is difficult
• [Mansimov+, ICLR 2016]: Incrementally draw using LSTM
• N.B. Photo synthesis is well studied [Hays+Efros, 2007]

Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!

Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a … hmm

Add a Caption to Generator and Discriminator
Conditional Generative Models
Tries to generate an image
・photo-realistic
・related to the caption
Tries to detect an image
・fake
・unrelated
[Reed+, ICML 2016]

Examples of generated images
• Birds (CUB) / Flowers (Oxford-102)
– About 10K images & 5 captions for each image
– 200 kinds of birds / 102 kinds of flowers
A tiny bird, with a tiny beak,
tarsus and feet, a blue crown,
blue coverts, and black
cheek patch
Bright droopy yellow petals
with burgundy streaks, and a
yellow stigma
[Reed+, ICML 2016]

Towards more realistic image generation
StackGAN [Zhang+, 2016]
Two-step GANs
• First GAN generates small and fuzzy image
• Second GAN enlarges and refines it

[Zhang+, 2016]

[Zhang+, 2016]
N.B. Results using dataset specialized in birds / flowers
→ More breakthrough is necessary to generate general images

Take-home Messages
• Looked over researches on vision and language
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
• Contributions of Deep Learning
– Most research themes exist before Deep Learning
– Commodity techs for processing images, videos and natural
languages
– Evolution of recognition and generation
Towards a new stage among vision and language!

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

Semelhante a Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning (20)

Mais de Yoshitaka Ushiku

Mais de Yoshitaka Ushiku (20)

Último

Último (20)

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

Notas do Editor