ARF @ MediaEval 2012: Multimodal Video Classification
1. ~ Multimodal Video Classification ~
ARF (Austria-Romania-France) team
Bogdan IONESCU*1,3 Ionuț MIRONICĂ1 Klaus SEYERLEHNER2
bionescu@imag.pub.ro imironica@imag.pub.ro music@cp.jku.at
Peter KNEES2 Jan SCHLÜTER4 Markus SCHEDL2
peter.knees@jku.at jan.schlueter@ofai.at markus.schedl@jku.at
Horia CUCU1 Andi BUZO1 Patrick LAMBERT3
horia.cucu@upb.ro andi.buzo@upb.ro patrick.lambert@univ-savoie.fr
*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.
1 2 3 4
University Austrian Research
POLITEHNICA Institute for Artificial
of Bucharest Intelligence
2. Presentation outline
• The approach
• Video content description
• Experimental results
• Conclusions and future work
MediaEval - Pisa, Italy, 4-5 October 2012 1/16 2
3. The approach
> challenge: find a way to assign (genre) tags to unknown videos;
> approach: machine learning paradigm;
…
web food autos label data
train
unlabeled data
classifier labeled data
tagged video database
video database
MediaEval - Pisa, Italy, 4-5 October 2012 2/163
4. The approach: classification
> the entire process relies on the concept of “similarity” computed
between content annotations (numeric features),
> this year focus is on:
objective 1: go multimodal (truly)
visual audio text
objective 2: test a broad range of classifiers and descriptor
combinations;
MediaEval - Pisa, Italy, 4-5 October 2012 3/164
5. Video content description - audio
block-level audio features • Spectral Pattern,
(capture also local temporal information) ~ soundtrack’s timbre;
• delta Spectral Pattern,
e.g. 50% overlapping
~ strength of onsets;
• variance delta Spectral Pattern,
average ~ variation of the onset strength;
median • Logarithmic Fluctuation Pattern,
variance ~ rhythmic aspects;
... • Correlation Pattern,
~ loudness changes;
• Spectral Contrast Pattern,
~ ”toneness”;
• Local Single Gaussian model,
[Klaus Seyerlehner et al., MIREX’11, USA] ~ timbral;
• George Tzanetakis model,
~ timbral;
MediaEval - Pisa, Italy, 4-5 October 2012 4/16
5
6. Video content description - audio
standard audio features
(audio frame-based)
• Zero-Crossing Rate,
• Linear Predictive Coefficients,
time • Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
global
feature • spectral centroid, flux, rolloff, and
f1 f2 … fn
= kurtosis,
+ mean & + variance of each feature over
var{f2} var{fn} variance a certain window.
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
MediaEval - Pisa, Italy, 4-5 October 2012 5/16
6
7. Video content description - visual
MPEG-7 & color/texture descriptors
(visual frame-based)
• Local Binary Pattern,
global • Autocorrelogram,
feature • Color Coherence Vector,
=
mean & • Color Layout Pattern,
dispersion & • Edge Histogram,
skewness &
time
kurtosis & • Classic color histogram,
f1 f2 … fn median &
• Scalable Color Descriptor,
root mean square
• Color moments.
[OpenCV toolbox, http://opencv.willowgarage.com]
MediaEval - Pisa, Italy, 4-5 October 2012 6/16
7
8. Video content description - visual
feature descriptors
(visual frame-based)
• Histogram of oriented Gradients (HoG)
~ counts occurrences of gradient orientation
feature points (e.g. Harris)
in localized portions of an image (20º per bin)
• Harris corner detector
• Speeded Up Robust Feature (SURF)
image source http://www.ifp.illinois.edu/~yuhuang
[OpenCV toolbox, http://opencv.willowgarage.com]
MediaEval - Pisa, Italy, 4-5 October 2012 7/16
8
9. Video content description - text
TF-IDF descriptors
(Term Frequency-Inverse Document Frequency)
> text sources: ASR and metadata,
1. remove XML markups,
2. remove terms <5%-percentile of the frequency distribution,
3. select term corpus: retaining for each genre class m terms (e.g. m =
150 for ASR and 20 for metadata) with the highest χ2 values that
occur more frequently than in complement classes,
4. for each document we represent the TF-IDF values.
MediaEval - Pisa, Italy, 4-5 October 2012 8/16
9
10. Experimental results: devset (5,127 seq.)
> classifiers from Weka (Bayes, lazy, functional, trees, etc.),
> cross-validation (train 50% – test 50%),
avg. Fscore (over all genres)
- visual descriptors capabilities 30%±10%,
- using more visual is not more accurate than using few,
- best LBP+CCV+histogram (Fscore=41.2%).
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
MediaEval - Pisa, Italy, 4-5 October 2012 9/1610
11. Experimental results: devset (5,127 seq.)
> cross-validation (train 50% – test 50%),
avg. Fscore (over all genres)
- audio still better than visual (improvement ~6%),
- proposed block-based better than standard (by ~10%),
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
MediaEval - Pisa, Italy, 4-5 October 2012 10/16
11
12. Experimental results: devset (5,127 seq.)
> cross-validation (train 50% – test 50%),
avg. Fscore (over all genres)
- ASR from LIMSI more representative than LIUM (~3%),
- best performance ASR LIMSI + metadata (Fscore=68%).
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
MediaEval - Pisa, Italy, 4-5 October 2012 11/16
12
13. Experimental results: devset (5,127 seq.)
> cross-validation (train 50% – test 50%),
avg. Fscore (over all genres)
- audio-visual close to text (ASR) for the automatic descriptors,
- increasing the number of modalities increases the performance.
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
MediaEval - Pisa, Italy, 4-5 October 2012 12/16
13
14. Experimental results: official runs (9,550 seq.)
> train on devset, test on testset (SVM linear),
MediaEval MediaEval
2011 2011
MAP 12% MAP 10.3%
Run1 Run2 Run3 Run4 Run5
LBP+CCV+ TF-IDF on audio block-based + audio TF-IDF on
hist + audio ASR LIMSI LBP + CCV + hist + block-based metadata +
metadata
block-based TF-IDF on ASR ASR LIMSI
LIMSI
MediaEval - Pisa, Italy, 4-5 October 2012 13/16
14
15. Experimental results: official runs (9,550 seq.)
> genre MAP for Run 5: TF-IDF on ASR + metadata,
Run 1: visual + audio
autos gaming religion environment
52% 71% 71% 50%
MediaEval - Pisa, Italy, 4-5 October 2012 14/16
15
16. Conclusions and future work
> classification adapts to the corpus – changing the corpus will
change the performance;
> audio-visual descriptors are inherently limited;
> how far can we go with ad-hoc classification without human
intervention?
> future work:
more elaborated late-fusion ?
pursue tests on the entire data set;
perhaps more elaborated Bag-of-Visual-Words.
Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and
Prof. Nicu Sebe from University of Trento for their support.
MediaEval - Pisa, Italy, 4-5 October 2012 15/16
16
17. thank you !
any questions ?
MediaEval - Pisa, Italy, 4-5 October 2012 16/16
17