Translating Related Words to Videos and Back through Latent Topics

Translating Related Words to
Videos and Back through Latent
Topics

Pradipto Das, Rohini K. Srihari and Jason J. Corso
SUNY Buffalo
WSDM 2013, Rome, Italy

WiSDoM is beyond words
Master Yoda, how do I find wisdom
Go to the center of the data and
from so many things happening
find your wisdom you will
around us?

How do the centers look like?

parkour perform traceur area flip footage jump park urban run lobster burger dress celery Christmas wrap roll mix tarragon
outdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic

floor parkour wall jump handrail locker contestant school run make dog sandwich man outdoors guy bench black sit park
interview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody
Be careful on what people do with
Interviews indoors can be tough!
their sandwiches!

The actual ground-truth synopses overlaid
Man performs
Kid does parkour A family holds a strange burger assembly
parkour in various
around the park and wrapping contest at Christmas
locations

Footage of group of performing parkour outdoors

tutorial: man explains how to
parkour perform traceur area flip footageguys free urban run
montage of jump park running lobster burger dressmake lobster rolls from scratch
celery Christmas wrap roll mix tarragon
up a tree and through the
outdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic
woods

interview with parkour contestants One guy is making
floor parkour wall jump handrail locker contestant school run sandwich outdoors
make dog sandwich man outdoors guy bench black sit park
interview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody
Be careful on what people do with
Interviews indoors can be tough!
their sandwiches!

Back to conventional wisdom: Translation
S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain
Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

 There is some model that captures the correspondence of the blood flow patterns in the
brain to the world being observed
 Given a slightly different pattern we are able to translate them to concepts present in our
vocabulary to a lingual description
 Three basic assumptions of Machine Learning are satisfied:
1) There is pattern 2) We do not know the target function 3) There is data to learn from
Training Testing
Topic
Model
(LDA) Regression

F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human
Neuroscience, Vol. 5(72), 2011

Back to conventional wisdom: Translation
S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain
Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011

Giving back to the community:
 Driverless blood flow patterns in the
 There is some model that captures the correspondence of thecars are already helpingthe
brain to the world being observed visually impaired to drive around
 It will them to to enable visually
 Given a slightly different pattern we are able to translate be good concepts present in our
vocabulary to a lingual description impaired drivers to hear the scenery
in front
 Three basic assumptions of Machine Learning are satisfied:
1) There is pattern 2) We do not know the target function 3) There is data to learn from
Training Testing
Topic
Model
(LDA) Regression

F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human
Neuroscience, Vol. 5(72), 2011

Do we speak all that we see?

Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.
2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.
3. Someone doing indoor rock climbing.

Centers of attention (topics) Not so
important!
Hand holding
climbing
surface
How many
rocks?
The sketch in
the board
Wrist-watch
What’s there
in the back?
Dress of the
climber
Empty slots
Color of the
floor
Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.
2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.
3. Someone doing indoor rock climbing. Summaries point toward information needs!

From patterns to topics to sentences
Adverb modifier
(climbing where?)

Direct
Subject
Direct Object
A young man climbs an artificial rock wall indoors
 Spoken Language is complex – Adjective modifier
structured according to various (What kind of wall?)
grammars and dependent on
active topics
 Different paraphrases describe
the same visual input
Major Topic: Rock climbing
Sub-topics: artificial rock wall, indoor rock climbing gym

Object detection models

Annotations for training object/concept models
 Expensive frame-wise manual
annotation efforts by drawing
bounding boxes
 Difficulties: camera
shakes, camera motion, zooming
 Careful consideration to which
objects/concepts to annotate?
 Focus on object/concept detection –
Man with Climbing noisy for videos in-the-wild
microphone person  Does not answer which
objects/concepts are important for
Trained Models
summary generation?

Translating across modalities
Learning latent translation
spaces a.k.a topics

 Mixed membership of
latent topics
 Some topics capture
observations that co-
occur commonly
 Other topics allow for
discrimination
 Different topics can be
responsible for
different modalities

No annotations Human Synopsis
needed – only A young man is
need clip level climbing an artificial
summary rock wall indoors

Using learnt translation
spaces for prediction

 Topics are marginalized
out to permute
vocabulary for
predictions
 The lower the
correlation among
topics, the better the
permutation
 Sensitive to priors for
real valued data

Text Translation
? p( wv | wO , wH ) 
O K H K


o 1 i 1
(O )
d , o ,i p( wv | i ) d( H ,)i p( wv | i )
,h
h 1 i 1

Use learnt translation
spaces for prediction

 Topics are marginalized
out to permute
vocabulary for
predictions
 The lower the
correlation among
topics, the better the
permutation
 Sensitive to priors for
Responsibility of Responsibility of real valued data
topic i over real topic i over discrete
valued observations video features
Text Translation Probability of learnt
? p( wv | wO , wH )  topic i explaining
O K H K
words in the text

o 1 i 1
(O )
d , o ,i p( wv | i ) d( H ,)i p( wv | i )
,h
h 1 i 1 vocabulary

Wisdom of the young padawans
OB (Object Bank)
 High level semantic
representation of images from
low level features
[L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank:
A high-level image representation for scene
classification and semantic feature sparsification.
In NIPS, 2010]

HOG3D (Histogram of oriented
gradients in 3D)
 Effective action recognition
features for videos
[A. Klaser, M. Marszalek, and C. Schmid. A spatio-
temporal descriptor based on 3d-gradients. In
BMVC, 2008]

Color Histogram:
 512 RGB color bins
 histograms are computed on
densely sampled frames
 large deviations in the
extremities of the color spectrum
are discarded

The video is about a man answering Two camera men film a cop
to a question from the podium by taking a camera from a woman
using a microphone sitting in a group
Town hall meeting
Topics

Scenes from images belonging to different topics and sub-topics
Rock climbing

An young man climbs an A man climbs a boulder
artificial rock wall indoors outdoors with a friend spotting
Sub-Topics

Global GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the
spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.]
 eight perceptual dimensions capture most of the 3D structures of real-world scenes
 naturalness, openness, perspective or expansion, size or
roughness, ruggedness, mean depth, symmetry and complexity

GIST in general terms:
 An energy space that pervades the arrangements of objects
 Does not really care about the specificity of the objects
 Helps us summarize an image even after it has disappeared from our sight

Yoda’s wisdom
The video is about a man answering Two camera men film a cop
to a question from the podium by taking a camera from a woman
It will bemicrophone
using a nice sitting in a group
to have the
Force as a
“feature”!

For my ally is the Force,
Its energy surrounds us and binds us.
An young man climbs an we,
Luminous beings are A man climbs a boulder
artificial rock wall indoors
not this crude matter. outdoors with a friend spotting

Datasets
 NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T
datasets

 Training set is organized into 15 event categories, some of which are: 1)
Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding
ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle
tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an
animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an
appliance 15) Working on a sewing project

 Each video has its own high level summary – varies from 2 to 40 words but
on average 10 words

 2062 clips in the training set and 530 clips for the first 5 events in the Dev-T
set

 Dev-T summaries are only used as reference summaries for evaluation with
up to 10 predicted keywords

The summarization perspective
Sub-events e.g.
Multiple sets of Multiple sentences (group of
skateboarding, snowboarding, sur
fing documents (sets of segments in frames)
frames in videos)
Multimedia
Topic Model
Skateboarding – permute
Wedding event specific
ceremony vocabularies

Feeding
animals Bag of keywords
multi-document
summaries
Woodworking
project
Landing fishes Natural language
multi-document
summaries

The summarization perspective
Sub-events e.g.
Why event snowboarding,vocabularies? sets of
skateboarding,
specific sur Multiple Multiple sentences (group of
fing documents (sets of segments in frames)
frames in videos)
Multimedia
Skateboarding Topic Model
– permute
Model Actual Synopsis Wedding
Predicted Words (top 10) event specific
One school man feeds fish fish jump bread fishing skateboard
of thought bread pole machine car dog cat
Another Feeding feeds fish
man bread shampoo sit condiment place
school of animals
bread fill plate jump pole fishing
Bag of keywords
thought
multi-document
 Intuitively multiple objects and actions are shared and many summaries
different words across eventsWoodworking semantically
get associated
project
 Prediction quality degenerates rapidly!
Landing
fishes Natural language
multi-document
summaries

Previously
[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of
Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011]

Words
forming Article specific content words
other Wiki
articles
Words corresponding to the
embedded multimedia

Afterwards
[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and
Back through Latent Topics,” WSDM, Rome, Italy, 2013]

Words
forming Article specific content words
other Wiki
articles
Words corresponding to the
embedded multimedia

The family of multimedia topic models
• Corr-MMGLDA: If a single topic generates a scene – the same topic
generates all text in the document – a considerable strongpoint but a
drawback for summary generation if this is not the case
• MMGLDA: More diffuse translation of both visual and textual patterns
through the latent translation spaces
– Intuitively it aids frequency based summarization

MMGLDA Corr-
Key is to use an asymmetric Dirichlet prior MMGLDA
Document specific topic proportions
Indicator variables

Synopses words
GIST features
Visual “words”

Topic Parameters for explaining latent
structure within observation ensembles

Topic modeling performance

 Test ELBOs on events 1-5 in  Prediction ELBOs on events
the Dev-T set 1-5 in the Dev-T set
 Measuring held-out log  Measuring held-out log
likelihoods on both videos and likelihoods on just videos in
associated human summaries absence of the text
 In a purely multinomial MMLDA model, failures of independent events
contribute highly negative terms to the log likelihoods
 Clearly NOT a measure of keyword summary generation power
 For the MMGLDA family of models, Gaussian components can partially
remove the independence through covariance modeling
 This allows only the responsible topic-Gaussians to contribute to the likelihood

Translating Related Words to Videos

Corr-MMGLDA MMGLDA

1 2 3 4 5 6 7 8 9 10
Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047
MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971
Corr-MMGLDA: log
(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164
MMGLDA: log
(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025

Translating Related Words to Videos
 Corr-MMGLDA is able to capture
more variance relative to
MMGLDA
Corr-MMGLDA   for CorrMMGLDA is also slightly MMGLDA
higher than that for MMGLDA
 This can allow related but topically
unique concepts to appear upfront

1 2 3 4 5 6 7 8 9 10
Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047
MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971
Corr-MMGLDA: log
(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164
MMGLDA: log
(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025

Related Words to Videos – Difficult Examples
measure project lady
tape indoor sew
marker pleat
highwaist zigzag
scissor card mark
teach cut fold stitch
pin woman skirt
machine fabric inside
scissors make leather
kilt man beltloop
sew woman fabric
make machine show
baby traditional loom
blouse outdoors
blanket quick
rectangle hood knit
indoor stitch scissors
pin cut iron studio
montage measure kid
penguin dad stuff
thread

Related Words to Videos – Difficult Examples
clock mechanism
repair computer tube
wash machine lapse
click desk mouse time
front wd40 pliers
reattach knob make
level video water
control person clip
part wire inside
indoor whirlpool man
gear machine guy
repair sew fan test
make replace grease
vintage motor box
indoor man tutorial
fuse bypass brush
wrench repairman
lubricate workshop
bottom remove screw
unscrew screwdriver
video wire

A few words is worth a thousand frames!

From MMGLDA

Event classification and summarization
Sub-events e.g. skateboarding,
snowboarding, surfing
documents (sets of segments in frames)
frames in videos)
Multimedia
Topic Model

Feeding
 A c-SVM classier from the libSVM package is
animals Bag of words
used with default settings for multiclass (15 multi-document
classes) classification summaries
Woodworking
 55% test accuracy easily achievable
project
Landing fishes
Evaluate using ROUGE-1 Natural language
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
multi-document
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)

Event classification and summarization
frames in videos)
Multimedia
Topic Model
- Usually changes from dataset to dataset
but max around 40-45% for 100 word
system summaries
- If we can achieve 10% of this for 10
Feeding summaries, we are doing pretty
word
 A c-SVM classier from the libSVM package is
animals
good! Bag of words
used with default Caveat – The text multi-document
- settings for multiclass (15 multi-document
classes) classification summaries
summarization task is much more
Woodworking
project
complex than this simpler task
Landing fishes
multi-document

Future Directions:
- Typically lots of features help in
Event classification and summarization classification but do we need all of
them for better summary generation?
Multiple sets of
- Does better event classification of
Multiple sentences (group
performance always mean better
frames in videos)
summarization performance?
Multimedia
Topic Model

- Usually changes from dataset to dataset
Feeding max around 40-45% for 100 word
but
 A c-SVM classier from the summaries
system libSVM package is
animals Bag of words
used with default If we can achieve 10% of this for 10
- settings for multiclass (15 multi-document
classes) classification summaries, we are doing pretty summaries
word Woodworking
good! project
Landing fishes
multi-document

ROUGE-1 performance
 MMLDA can show poor ELBO – a bit
misleading
 Performs quite well on predicting
summary worthy keywords

 MMGLDA produces better topics and
higher ELBO
 Summary worthiness of keywords
almost same as MMLDA for lower n

 Sum-normalizing the real valued data
to lie in [0,1]P distorts reality for Corr-
MGLDA w.r.t. quantitative evaluation

 Summary worthiness of keywords is
not good but topics are good
 Different but related topics can model
GIST features almost equally (strong
overlap in the tail of the Gaussians)

ROUGE-1 performance
 MMLDA can show poor ELBO – a bit
misleading
 Performs quite well on predicting
summary worthy keywords

Future Directions
 MMGLDA produces better topics and
higher ELBO  Need better initialization of
 Summary worthiness of keywords parameters
priors governing
almost same as MMLDA forvaluedndata
for real lower
[N. Nasios and A.G. Bors. Variational learning for gaussian
mixture models. IEEE Transactions on Systems, Man, and
 Sum-normalizing the real B: Cybernetics, 36(4):849 {862, 2006]
Cybernetics, Part valued data
to lie in [0,1]P distorts reality for Corr-
MGLDA w.r.t. quantitative evaluation

 Summary worthiness of keywords is
not good but topics are good
 Different but related topics can model
GIST features almost equally (strong
overlap in the tail of the Gaussians)

Model usefulness and applications
• Applications
– Label topics through document level multimedia
– Movie recommendations through semantically
related frames
– Video analysis: word prediction given video features
– Adword creation through semantics of multimedia
(Using transcripts only can be noisy)
– Semantic compression of videos
– Allowing the visually impaired to hear the world
through text

Long list of acknowledgements
• Scott McCloskey (Honeywell ACS Labs)
• Sangmin Oh, Amitha Perera (Kitware Inc.)
• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)
For helping us with feature extractions, event classification evaluations and
many fruitful discussions throughout this project

• Jack Gallant (UC Berkeley)
• Francisco Pereira (Siemens Corporate Research)
For allowing us to reuse some of their illustrations in this presentation

• Lucy Vanderwende (Microsoft Research)
• Enrique Alfonseca (Google Research)
For helpful discussions during TAC 2011 on the importance of the
summarization problem outside of the competitions on newswire collections

Long list of acknowledgements
This work was supported by the Intelligence Advanced Research
Projects Activity (IARPA) via Department of Interior National Business
Center contract number D11PC20069. The U.S. Government is
authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of
the authors and should not be interpreted as necessarily representing
the official policies or endorsements, either expressed or implied, of
IARPA, DOI/NBC, or the U.S. Government.

We also thank the anonymous reviewers for their comments

Translating Related Words to Videos and Back through Latent Topics

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Translating Related Words to Videos and Back through Latent Topics

Semelhante a Translating Related Words to Videos and Back through Latent Topics (17)

Último

Último (20)

Translating Related Words to Videos and Back through Latent Topics

Notas do Editor