5. Training Dataset
A woman posing
on a red scooter.
White and gray
kitten lying on
its side.
A white van
parked in an
empty lot.
A white cat rests
head on a stone.
Silver car parked
on side of road.
A small gray dog
on a leash.
A black dog
standing in a
grassy area.
A small white dog
wearing a flannel
warmer.
Input Image
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
Nearest Captions
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
A small white dog standing on a leash.
19. 生成されたキャプションの評価方法
機械翻訳では…
• テスト文に複数の参照訳が付随(通常5文)
• これらの参照訳と近い訳文が「良い」
One jet lands at an airport while another takes off next to i
Two airplanes parked in an airport.
Two jets taxi past each other.
Two parked jet airplanes facing opposite directions.
two passenger planes on a grassy plain
キャプション生成の評価でも同様の流れ
PASCAL Sentenceの画像と参照キャプションの例
26. Exposure Biasを解決する既存のアプローチ
Scheduled sampling [Venkatraman+, AAAI 2015]
Data As Demonstrator [Bengio+, NIPS 2015]
• 毎回コイントスして
– 教師系列から次を推定
– 生成中の系列から次を推定
のどちらかを選ぶ
• 次第に生成中の系列のみを選ぶ様にスケジュール
• Exposure Biasを軽減できるが…
– いまだにword-levelの最適化なのはXENTと同じ
– 生成中の系列が既に違っていた場合はよりエラーが蓄積
e.g. 正解が I had a long walk. で I had a walk 迄生成
→この手法だと walk が正解になってしまう
The training dataset is pairs of an image and a caption.
At first, the similarity of images and the similarity of captions are combined and concept space is generated.
When an image is input, its coordinate in the space is estimated and neighbor pairs are retrieved.
Then captions of retrieved pairs are scored according to the distance to the input image.
And each phrase of each caption is scored according to how discriminative.
Finally, highly socred phrases are combined and a caption for the input image is generated.