TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Translating Related Words to Videos and Back through Latent Topics
1. Translating Related Words to
Videos and Back through Latent
Topics
Pradipto Das, Rohini K. Srihari and Jason J. Corso
SUNY Buffalo
WSDM 2013, Rome, Italy
2. WiSDoM is beyond words
Master Yoda, how do I find wisdom
Go to the center of the data and
from so many things happening
find your wisdom you will
around us?
3. WiSDoM is beyond words
Master Yoda, how do I find wisdom
Go to the center of the data and
from so many things happening
find your wisdom you will
around us?
4. How do the centers look like?
parkour perform traceur area flip footage jump park urban run lobster burger dress celery Christmas wrap roll mix tarragon
outdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic
floor parkour wall jump handrail locker contestant school run make dog sandwich man outdoors guy bench black sit park
interview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody
Be careful on what people do with
Interviews indoors can be tough!
their sandwiches!
5. The actual ground-truth synopses overlaid
Man performs
Kid does parkour A family holds a strange burger assembly
parkour in various
around the park and wrapping contest at Christmas
locations
Footage of group of performing parkour outdoors
tutorial: man explains how to
parkour perform traceur area flip footageguys free urban run
montage of jump park running lobster burger dressmake lobster rolls from scratch
celery Christmas wrap roll mix tarragon
up a tree and through the
outdoor outdoors kid group pedestrian playground steam season scratch stick live water lemon garlic
woods
interview with parkour contestants One guy is making
floor parkour wall jump handrail locker contestant school run sandwich outdoors
make dog sandwich man outdoors guy bench black sit park
interview block slide indoor perform build tab duck white disgustingly toe cough feed rub contest parody
Be careful on what people do with
Interviews indoors can be tough!
their sandwiches!
6. Back to conventional wisdom: Translation
S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain
Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011
There is some model that captures the correspondence of the blood flow patterns in the
brain to the world being observed
Given a slightly different pattern we are able to translate them to concepts present in our
vocabulary to a lingual description
Three basic assumptions of Machine Learning are satisfied:
1) There is pattern 2) We do not know the target function 3) There is data to learn from
Training Testing
Topic
Model
(LDA) Regression
F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human
Neuroscience, Vol. 5(72), 2011
7. Back to conventional wisdom: Translation
S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu and J. Gallant, “Reconstructing Visual Experiences from Brain
Activity Evoked by Natural Movies,” Current biology Vol. 21(19), 2011
Giving back to the community:
Driverless blood flow patterns in the
There is some model that captures the correspondence of thecars are already helpingthe
brain to the world being observed visually impaired to drive around
It will them to to enable visually
Given a slightly different pattern we are able to translate be good concepts present in our
vocabulary to a lingual description impaired drivers to hear the scenery
in front
Three basic assumptions of Machine Learning are satisfied:
1) There is pattern 2) We do not know the target function 3) There is data to learn from
Training Testing
Topic
Model
(LDA) Regression
F. Pereira, G. Detre and M. Botvinick, "Generating text from functional brain images," In Frontiers in Human
Neuroscience, Vol. 5(72), 2011
8. Do we speak all that we see?
Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.
2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.
3. Someone doing indoor rock climbing.
9. Centers of attention (topics) Not so
important!
Hand holding
climbing
surface
How many
rocks?
The sketch in
the board
Wrist-watch
What’s there
in the back?
Dress of the
climber
Empty slots
Color of the
floor
Multiple Human Summaries: (Max 10 words i.e. imposing a length constraint)
1. There is a guy climbing on a rock-climbing wall. 4. A person is practicing indoor rock climbing.
2. A man is bouldering at an indoor rock climbing gym. 5. A man is doing artificial rock climbing.
3. Someone doing indoor rock climbing. Summaries point toward information needs!
10. From patterns to topics to sentences
Adverb modifier
(climbing where?)
Direct
Subject
Direct Object
A young man climbs an artificial rock wall indoors
Spoken Language is complex – Adjective modifier
structured according to various (What kind of wall?)
grammars and dependent on
active topics
Different paraphrases describe
the same visual input
Major Topic: Rock climbing
Sub-topics: artificial rock wall, indoor rock climbing gym
11. Object detection models
Annotations for training object/concept models
Expensive frame-wise manual
annotation efforts by drawing
bounding boxes
Difficulties: camera
shakes, camera motion, zooming
Careful consideration to which
objects/concepts to annotate?
Focus on object/concept detection –
Man with Climbing noisy for videos in-the-wild
microphone person Does not answer which
objects/concepts are important for
Trained Models
summary generation?
12. Translating across modalities
Learning latent translation
spaces a.k.a topics
Mixed membership of
latent topics
Some topics capture
observations that co-
occur commonly
Other topics allow for
discrimination
Different topics can be
responsible for
different modalities
No annotations Human Synopsis
needed – only A young man is
need clip level climbing an artificial
summary rock wall indoors
13. Translating across modalities
Using learnt translation
spaces for prediction
Topics are marginalized
out to permute
vocabulary for
predictions
The lower the
correlation among
topics, the better the
permutation
Sensitive to priors for
real valued data
Text Translation
? p( wv | wO , wH )
O K H K
o 1 i 1
(O )
d , o ,i p( wv | i ) d( H ,)i p( wv | i )
,h
h 1 i 1
14. Translating across modalities
Use learnt translation
spaces for prediction
Topics are marginalized
out to permute
vocabulary for
predictions
The lower the
correlation among
topics, the better the
permutation
Sensitive to priors for
Responsibility of Responsibility of real valued data
topic i over real topic i over discrete
valued observations video features
Text Translation Probability of learnt
? p( wv | wO , wH ) topic i explaining
O K H K
words in the text
o 1 i 1
(O )
d , o ,i p( wv | i ) d( H ,)i p( wv | i )
,h
h 1 i 1 vocabulary
15. Wisdom of the young padawans
OB (Object Bank)
High level semantic
representation of images from
low level features
[L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank:
A high-level image representation for scene
classification and semantic feature sparsification.
In NIPS, 2010]
HOG3D (Histogram of oriented
gradients in 3D)
Effective action recognition
features for videos
[A. Klaser, M. Marszalek, and C. Schmid. A spatio-
temporal descriptor based on 3d-gradients. In
BMVC, 2008]
Color Histogram:
512 RGB color bins
histograms are computed on
densely sampled frames
large deviations in the
extremities of the color spectrum
are discarded
16. Wisdom of the young padawans
The video is about a man answering Two camera men film a cop
to a question from the podium by taking a camera from a woman
using a microphone sitting in a group
Town hall meeting
Topics
Scenes from images belonging to different topics and sub-topics
Rock climbing
An young man climbs an A man climbs a boulder
artificial rock wall indoors outdoors with a friend spotting
Sub-Topics
17. Wisdom of the young padawans
Global GIST energy [A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the
spatial envelope. Int. J. Comput. Vision, 42(3):145{175, 2001.]
eight perceptual dimensions capture most of the 3D structures of real-world scenes
naturalness, openness, perspective or expansion, size or
roughness, ruggedness, mean depth, symmetry and complexity
GIST in general terms:
An energy space that pervades the arrangements of objects
Does not really care about the specificity of the objects
Helps us summarize an image even after it has disappeared from our sight
18. Yoda’s wisdom
The video is about a man answering Two camera men film a cop
to a question from the podium by taking a camera from a woman
It will bemicrophone
using a nice sitting in a group
to have the
Force as a
“feature”!
For my ally is the Force,
Its energy surrounds us and binds us.
An young man climbs an we,
Luminous beings are A man climbs a boulder
artificial rock wall indoors
not this crude matter. outdoors with a friend spotting
19. Datasets
NIST's 2011 TRECVID Multimedia Event Detection (MED) Events and Dev-T
datasets
Training set is organized into 15 event categories, some of which are: 1)
Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding
ceremony 5) Woodworking project 6) Birthday party 7) Changing a vehicle
tire 8) Flash mob gathering 9) Getting a vehicle unstuck 10) Grooming an
animal 11) Making a sandwich 12) Parade 13) Parkour 14) Repairing an
appliance 15) Working on a sewing project
Each video has its own high level summary – varies from 2 to 40 words but
on average 10 words
2062 clips in the training set and 530 clips for the first 5 events in the Dev-T
set
Dev-T summaries are only used as reference summaries for evaluation with
up to 10 predicted keywords
20. The summarization perspective
Sub-events e.g.
Multiple sets of Multiple sentences (group of
skateboarding, snowboarding, sur
fing documents (sets of segments in frames)
frames in videos)
Multimedia
Topic Model
Skateboarding – permute
Wedding event specific
ceremony vocabularies
Feeding
animals Bag of keywords
multi-document
summaries
Woodworking
project
Landing fishes Natural language
multi-document
summaries
21. The summarization perspective
Sub-events e.g.
Why event snowboarding,vocabularies? sets of
skateboarding,
specific sur Multiple Multiple sentences (group of
fing documents (sets of segments in frames)
frames in videos)
Multimedia
Skateboarding Topic Model
– permute
Model Actual Synopsis Wedding
Predicted Words (top 10) event specific
ceremony vocabularies
One school man feeds fish fish jump bread fishing skateboard
of thought bread pole machine car dog cat
Another Feeding feeds fish
man bread shampoo sit condiment place
school of animals
bread fill plate jump pole fishing
Bag of keywords
thought
multi-document
Intuitively multiple objects and actions are shared and many summaries
different words across eventsWoodworking semantically
get associated
project
Prediction quality degenerates rapidly!
Landing
fishes Natural language
multi-document
summaries
22. Previously
[P. Das, R. K. Srihari and Y. Fu. “Simultaneous Joint and Conditional Modeling of
Documents Tagged from Two Perspectives,” CIKM, Glasgow Scotland, 2011]
Words
forming Article specific content words
other Wiki
articles
Words corresponding to the
embedded multimedia
23. Afterwards
[P. Das, R. K. Srihari and J. J. Corso. “Translating Related Words to Videos and
Back through Latent Topics,” WSDM, Rome, Italy, 2013]
Words
forming Article specific content words
other Wiki
articles
Words corresponding to the
embedded multimedia
24. The family of multimedia topic models
• Corr-MMGLDA: If a single topic generates a scene – the same topic
generates all text in the document – a considerable strongpoint but a
drawback for summary generation if this is not the case
• MMGLDA: More diffuse translation of both visual and textual patterns
through the latent translation spaces
– Intuitively it aids frequency based summarization
MMGLDA Corr-
Key is to use an asymmetric Dirichlet prior MMGLDA
Document specific topic proportions
Indicator variables
Synopses words
GIST features
Visual “words”
Topic Parameters for explaining latent
structure within observation ensembles
25. Topic modeling performance
Test ELBOs on events 1-5 in Prediction ELBOs on events
the Dev-T set 1-5 in the Dev-T set
Measuring held-out log Measuring held-out log
likelihoods on both videos and likelihoods on just videos in
associated human summaries absence of the text
In a purely multinomial MMLDA model, failures of independent events
contribute highly negative terms to the log likelihoods
Clearly NOT a measure of keyword summary generation power
For the MMGLDA family of models, Gaussian components can partially
remove the independence through covariance modeling
This allows only the responsible topic-Gaussians to contribute to the likelihood
27. Translating Related Words to Videos
Corr-MMGLDA is able to capture
more variance relative to
MMGLDA
Corr-MMGLDA for CorrMMGLDA is also slightly MMGLDA
higher than that for MMGLDA
This can allow related but topically
unique concepts to appear upfront
1 2 3 4 5 6 7 8 9 10
Corr-MMGLDA-α 0.445936 0.451391 0.462443 0.397392 0.374922 0.573839 0.425912 0.375423 0.38186 0.189047
MMGLDA-α 0.414354 0.422954 0.427442 0.359592 0.353317 0.552872 0.39681 0.349695 0.345466 0.163971
Corr-MMGLDA: log
(α/|Λ|) 12.6479 61.7312 50.0512 58.7659 60.1194 104.628 28.2949 31.3856 18.9223 8.164
MMGLDA: log
(α/|Λ|) 12.498 61.4666 49.8858 58.643 59.9248 104.623 28.2264 31.2219 18.6953 8.1025
28. Related Words to Videos – Difficult Examples
measure project lady
tape indoor sew
marker pleat
highwaist zigzag
scissor card mark
teach cut fold stitch
pin woman skirt
machine fabric inside
scissors make leather
kilt man beltloop
sew woman fabric
make machine show
baby traditional loom
blouse outdoors
blanket quick
rectangle hood knit
indoor stitch scissors
pin cut iron studio
montage measure kid
penguin dad stuff
thread
29. Related Words to Videos – Difficult Examples
clock mechanism
repair computer tube
wash machine lapse
click desk mouse time
front wd40 pliers
reattach knob make
level video water
control person clip
part wire inside
indoor whirlpool man
gear machine guy
repair sew fan test
make replace grease
vintage motor box
indoor man tutorial
fuse bypass brush
wrench repairman
lubricate workshop
bottom remove screw
unscrew screwdriver
video wire
30. A few words is worth a thousand frames!
From MMGLDA
31. A few words is worth a thousand frames!
From MMGLDA
32. Event classification and summarization
Sub-events e.g. skateboarding,
Multiple sets of Multiple sentences (group of
snowboarding, surfing
documents (sets of segments in frames)
frames in videos)
Multimedia
Topic Model
Skateboarding – permute
Wedding event specific
ceremony vocabularies
Feeding
A c-SVM classier from the libSVM package is
animals Bag of words
used with default settings for multiclass (15 multi-document
classes) classification summaries
Woodworking
55% test accuracy easily achievable
project
Landing fishes
Evaluate using ROUGE-1 Natural language
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
multi-document
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
33. Event classification and summarization
Sub-events e.g. skateboarding,
Multiple sets of Multiple sentences (group of
snowboarding, surfing
documents (sets of segments in frames)
frames in videos)
Multimedia
Topic Model
Skateboarding – permute
- Usually changes from dataset to dataset
Wedding event specific
but max around 40-45% for 100 word
ceremony vocabularies
system summaries
- If we can achieve 10% of this for 10
Feeding summaries, we are doing pretty
word
A c-SVM classier from the libSVM package is
animals
good! Bag of words
used with default Caveat – The text multi-document
- settings for multiclass (15 multi-document
classes) classification summaries
summarization task is much more
Woodworking
55% test accuracy easily achievable
project
complex than this simpler task
Landing fishes
Evaluate using ROUGE-1 Natural language
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
multi-document
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
34. Future Directions:
- Typically lots of features help in
Event classification and summarization classification but do we need all of
Sub-events e.g. skateboarding,
them for better summary generation?
Multiple sets of
- Does better event classification of
Multiple sentences (group
snowboarding, surfing
documents (sets of segments in frames)
performance always mean better
frames in videos)
summarization performance?
Multimedia
Topic Model
Skateboarding – permute
Wedding event specific
ceremony vocabularies
- Usually changes from dataset to dataset
Feeding max around 40-45% for 100 word
but
A c-SVM classier from the summaries
system libSVM package is
animals Bag of words
used with default If we can achieve 10% of this for 10
- settings for multiclass (15 multi-document
classes) classification summaries, we are doing pretty summaries
word Woodworking
55% test accuracy easily achievable
good! project
Landing fishes
Evaluate using ROUGE-1 Natural language
HEXTAC 2009: 100-word human references vs. 100-word manually extracted summary
multi-document
Average Recall: 0.37916 (95%-confidence interval 0.37187 - 0.38661) summaries
Average Precision: 0.39142 (95%-confidence interval 0.38342 - 0.39923)
35. ROUGE-1 performance
MMLDA can show poor ELBO – a bit
misleading
Performs quite well on predicting
summary worthy keywords
MMGLDA produces better topics and
higher ELBO
Summary worthiness of keywords
almost same as MMLDA for lower n
Sum-normalizing the real valued data
to lie in [0,1]P distorts reality for Corr-
MGLDA w.r.t. quantitative evaluation
Summary worthiness of keywords is
not good but topics are good
Different but related topics can model
GIST features almost equally (strong
overlap in the tail of the Gaussians)
36. ROUGE-1 performance
MMLDA can show poor ELBO – a bit
misleading
Performs quite well on predicting
summary worthy keywords
Future Directions
MMGLDA produces better topics and
higher ELBO Need better initialization of
Summary worthiness of keywords parameters
priors governing
almost same as MMLDA forvaluedndata
for real lower
[N. Nasios and A.G. Bors. Variational learning for gaussian
mixture models. IEEE Transactions on Systems, Man, and
Sum-normalizing the real B: Cybernetics, 36(4):849 {862, 2006]
Cybernetics, Part valued data
to lie in [0,1]P distorts reality for Corr-
MGLDA w.r.t. quantitative evaluation
Summary worthiness of keywords is
not good but topics are good
Different but related topics can model
GIST features almost equally (strong
overlap in the tail of the Gaussians)
37. Model usefulness and applications
• Applications
– Label topics through document level multimedia
– Movie recommendations through semantically
related frames
– Video analysis: word prediction given video features
– Adword creation through semantics of multimedia
(Using transcripts only can be noisy)
– Semantic compression of videos
– Allowing the visually impaired to hear the world
through text
38. Long list of acknowledgements
• Scott McCloskey (Honeywell ACS Labs)
• Sangmin Oh, Amitha Perera (Kitware Inc.)
• Kevin Cannons, Arash Vahdat, Greg Mori (SFU)
For helping us with feature extractions, event classification evaluations and
many fruitful discussions throughout this project
• Jack Gallant (UC Berkeley)
• Francisco Pereira (Siemens Corporate Research)
For allowing us to reuse some of their illustrations in this presentation
• Lucy Vanderwende (Microsoft Research)
• Enrique Alfonseca (Google Research)
For helpful discussions during TAC 2011 on the importance of the
summarization problem outside of the competitions on newswire collections
39. Long list of acknowledgements
This work was supported by the Intelligence Advanced Research
Projects Activity (IARPA) via Department of Interior National Business
Center contract number D11PC20069. The U.S. Government is
authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of
the authors and should not be interpreted as necessarily representing
the official policies or endorsements, either expressed or implied, of
IARPA, DOI/NBC, or the U.S. Government.
We also thank the anonymous reviewers for their comments
Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
Big data problem---lots of data around us but which ones are meaningful?Need statistics from the data that meaningfully encode multiple views i.e. modalitiesSufficient statistics (i.e. the function of a sample that encodes all information about the sample) usually represent the centers of the data
Centers are the topics which correspond to some best description of data which are similar in some wayTrue Centers are never known---each one of us has an algorithm for finding centers---our own topic model
The actual ground-truth synopses overlaid over the training topics
BOLD (Blood Oxygen Level Dependent) and fMRI patternsImages used with permission from Jack Gallant and Francisco Pereira (by the way, both of them are now applying topic models to map brain patterns to movies or text)
A genuine philanthropic use case
The importance of relating multi-document summaries to that for summarizing videos – every frame is a document
Psycholinguistics are needed to confirm but that’s not a concern at this pointIn our dataset we have only one ground truth summary---base case for ROUGE evaluation
Ground truth annotationComplex high level descriptionsSpoken Language is complicated – We are corresponding it to a minimal set of features (next)
Upper row – training (camera motion and shakes are a real problem for maintaining the bounding boxes)Lower row – trained models
Role of alpha – alpha provides a topic for every observation. Alpha is a K-vectorHere each component of alpha is different which helps assign different proportions of observations differently (e.g. one topic can be focusing solely on “stop-words”, another one on “commonly occurring words” and other ones on the different topics etc.)
Translation formula (Marginalization over topics)- If there are two topics i.e. K=2, then (for e.g for the 2nd term) 0.5*0.5 + 0.5*0.5 = 0.5 < 0*0.0001 + 0.9*0.9- Values of the inferred \\phi’s are very important for the real valued data – separated Gaussians are better but does not always happen- This raises an issue where the real valued data may need to be preprocessed to increase the chances of separation
Object Bank (Computed on keyframes), HOG3D and Color histograms – features through the lens of computer vision
Important references: http://cvcl.mit.edu/papers/oliva04.pdf | http://vision.stanford.edu/VSS2007-NaturalScene/Oliva_VSS07.pdf- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns should be different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
Red arrow means “lack of the corresponding GIST property” and green means ok- The principal components of the spectrogram of real-world scenes. The spectrogram is sampled at 4 × 4 spatial location for a better visualization. Each subimage corresponds to the local energy spectrum at the corresponding spatial locationGlobal GIST patterns are different for topics and sub-topicsAnother relevant piece of information for image representation concerns the spatial relationships between the main structures in the image. Spatial distribution of spectral information can be described by means of the windowed Fourier transform (WFT)
- The dataset that we use for the video summarization task is released as part of NIST's 2011 TRECVID Multimedia Event Detection (MED) evaluation set. The dataset consists of a collection of Internet multimedia content posted to the various Internet video hosting sites. The training set is organized into 15 event categories, some of which are: 1) Attempting a board trick 2) Feeding an animal 3) Landing a fish 4) Wedding ceremony 5) Working on a woodworking project etc.We use the videos and their textual metadata in all the 15 events as training data. There are 2062 clips with summaries in the training set with almost equal distribution amongst the events. The test set which we use is called the TransparentDevelopment (Dev-T) collection. The Dev-T collection includes positive instances of the first 5 training events and near positive instances for the last 10 events---a total of 630 videos labeled with event category information (and associated human synopses which are to be compared against for summarization performance). Each summary is a short and very high level description of the entire video and ranges from 2 to 40 words but on average 10 words (with stopwords). We remove standard English stopwords and retain only the word morphologies (not required) from the synopses as our training vocabularies. The proportion of videos belonging to events 6 through 15 in the Dev-T set is much low compared to the proportion for the other events since those clips are considered to be “related" instances which cover only part of the event category specifications. The performances of our topic models are evaluated on those kinds of clips as well. The numbers of videos in events 6 through 15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there are around 120 videos per event for the first 5 events. All other videos in the Dev-T set neither have any event category label nor are identified as positive, negative or related videos and we do not consider these videos in our experiments.
There are no individual summaries for shots within the clip – only one high level summaryProblems with shot-wise nearest neighbor matching precisely for this reason?
Why event specific vocabularies
Modeling correspondence of caption words to the main text content which can be annotated in various ways
“Dear Wikipedia readers: We are the small non-profit that runs the #5 website in the world. We have only 150 staff but serve 450 million users” – finding the reason why it might be so? (Both the main and embedded content reflects coherent topics e.g. if there appears an irrelevant advertisement, the topic will drift and Wikipedia will loose its appeal)
Corr-MMGLDA seems to be capturing more variance relative to MMGLDA\\alpha for CorrMMGLDA is thus slightly higher than that for MMGLDATopic parameters over words are seeded through documents during initialization and hence are same for both models here
This is a tough event to match words with frames. The event is “Working on a sewing project”Top row: frames coming from only one video. We do not put a constraint that we can select only 5 frames per video. Although this can be easily done. The shown video’s actual synopsis is “One lady is doing sewing project indoors.”Bottom row: better variance – Note how it captures dad sewing kid’s penguin with a needle and threadFirst row: “Woman demonstrating different stitches using a serger/sewing machine”Second row: “dad sewing up stuffed penguin for kids”Third row: “Woman makes a bordered hem skirt.”; Last one: “A pair of hands do a sewing project using a sewing machine.”Other features might help: Action, objects, GIST and color may not be enough
This is again another tough event to match words with frames. The event is “Repairing an appliance”Top row: frames coming from only one video. Bad example. The shown video’s actual synopsis is “How to repair the water level control mechanism on a Whirlpool washing machine.”Bottom row: better variance – Row1,Cols1-3: “a man is repairing a whirlpool washer” ;Row1,Col4 “how to remove blockage from a washing machine pump”; Row2,Cols1-3: “Woman demonstrates replacing a door hinge on a dishwasher”;Row2,Col4: “A guy shows how to make repairs on a microwave”;Row3,Cols1-3: “How to fix a broken agitator on a Whirlpool washing machine”;Row3,Col4: “A guy working on a vintage box fan”Other features might help: Action, objects, GIST and color not enough
Usually changes from dataset to dataset but max around 40-45% for 100 word summariesIf we can achieve 10% of this for 10 word summaries, we are doing pretty good!
Caveat – The text multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
Caveat – The multi-document summarization task is much more complex than this simpler task (w.r.t. summarization)
Purely multinomial topic models showing lower ELBOs can perform quite well in BoW summarization. MMLDA assigns likelihoods based on success and failure of independent events and failures contribute highly negative terms to the log likelihoods but this does not indicate the model's summarization performance where low probability terms are pruned out. Gaussian components can partially remove the independence through covariance modeling but this can also allow different but related topics to model GIST features almost equally (strong overlap in the tail of the bell shaped curves - Gaussians) and show poor permutation of predicted words due to the violation of the soft probabilistic constraint of correspondence
There has been some work done for initialization of priors for a Gaussian Mixture Model (GMM) setting but no work has been done on the effects of such initializations for topic models involving Gaussians and Multinomials
Never had the chance to acknowledge them all in the paper