MediaEval 2017 Predicting Media Interestingness Task
Presenter: Claire-Hélène Demarty, Technicolor, France
Paper: http://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_4.pdf
Video: https://youtu.be/dWhSJuR5DuM
Authors: Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Michael Gygli, Ngoc Q.K. Duong
Abstract: In this paper, the Predicting Media Interestingness task which is running for the second year as part of the MediaEval 2017 Benchmarking Initiative for Multimedia Evaluation, is presented. For the task, participants are expected to create systems that automatically select images and video segments that are considered to be the most interesting for a common viewer. All task characteristics are described, namely the task use case and challenges, the released data set and ground truth, the required participant runs and the evaluation metrics.
1. Predicting Media Interestingness Task
Overview
Claire-Hélène Demarty – Technicolor
Mats Sjöberg – University of Helsinki
Bogdan Ionescu – University Polytehnica of Bucharest
Thanh-Toan Do – University of Adelaide
Michael Gygli, ETH & Gifs.com
Ngoc Q.K. Duong, Technicolor
MediaEval 2017 Workshop
Dublin, 13-16th 2016
In its
second year
2. Derives from a use case at Technicolor
Helping professionals to illustrate a Video on Demand (VOD) web site by
selecting some interesting frames and/or video excerpts for the posted
movies.
2
Task definition
9/13/2017
3. Derives from a use case at Technicolor
Helping professionals to illustrate a Video on Demand (VOD) web site by
selecting some interesting frames and/or video excerpts for the posted
movies.
3
Task definition
9/13/2017
4. Derives from a use case at Technicolor
Helping professionals to illustrate a Video on Demand (VOD) web site by
selecting some interesting frames and/or video excerpts for the posted
movies.
4
Task definition
9/13/2017
Definition: The frames and excerpts should be suitable in terms of helping a
user to make his/her decision about whether he/she is interested in watching
the underlying movie. Emphasized
in 2017
5. Two subtasks -> Image and Video
Image subtask: given a set of keyframes extracted from a movie, …
Video subtask: given a set of video segments extracted from a movie, …
… automatically identify those images/segments that viewers report to be
interesting.
Binary classification task on a per movie basis…
… but confidence values are also required.
5
Task definition
9/13/2017
6. From Hollywood-like movie trailers or full-length movie extracts
Manual segmentation of shots/longer segments with a semantic meaning
Extraction of middle key-frame of each shot/segment
6
Dataset & additional features
9/13/2017
Development Set Test Set
78 trailers 26 trailers 4 movie extracts
(ca.15min)
Total % interesting Total % interesting Total % interesting
Shot # 7,396 9.0 2,192 11.3 243 11.5
Key-frame # 7,396 11.6 2,192 11.9 243 22.6
Modified in
2017
7. From Hollywood-like movie trailers or full-length movie extracts
Manual segmentation of shots/longer segments with a semantic meaning
Extraction of middle key-frame of each shot/segment
7
Dataset & additional features
9/13/2017
Development Set Test Set
78 trailers 26 trailers 4 movie extracts
(ca.15min)
Total % interesting Total % interesting Total % interesting
Shot # 7,396 9.0 2,192 11.3 243 11.5
Key-frame # 7,396 11.6 2,192 11.9 243 22.6
Precomputed content descriptors:
Low-level: denseSift, HoG, LBP, GIST, HSV color histograms, MFCC, fc7 and
prob layers from AlexNet
Mid-level: face detection and tracking-by-detection
Segment-based: C3D from fc6 layer and averaged over each segment
Modified in
2017
Added
in 2017
9. Up to 5 runs per subtask!
Image subtask: Visual information, external data allowed
Video subtask: BOTH audio and visual information, external data allowed
9
Required runs
9/13/2017
Modified in
2017
Modified in
2017
10. 2017 official measure:
➢ Mean Average Precision at 10 (over all movies)
Additional metrics are computed:
2016 Mean Average Precision
False alarm rate, miss detection rate, precision, recall, F-measure, etc.
10
Evaluation metrics
9/13/2017
Modified in
2017
11. 11
Task participation
9/13/2017
Registrations:
32 teams
18 countries
Submissions:
10 teams
7 ‘experienced’ teams
0
5
10
15
20
25
30
35
40
Registrations Returned agreements Submitting teams Experienced teams Workshop
Task participation
2016 2017
15. 15
Official results – Video subtask – best runs
9/13/2017
* organizers
Run MAP@10 MAP Official ranking
me17in_Eurecom_video_run4.txt 0.0827 0.2094 Eurecom
me17in_LAPI_video_run4.txt 0.0732 0.2028 LAPI*
me17in_technicolor_video_run4.txt 0.0641 0.1878 Technicolor*
me17in_DAIICT_video_run4.txt 0.064 0.1885 DAIICT
me17in_RUC_video_run2.txt 0.0637 0.1897 RUC
me17in_gibis_video_run5.txt 0.0628 0.183 GIBIS
2016 BEST RUN 0.1815
Baseline 0.0564 0.1716
me17in_HKBU_video_1.txt 0.0556 0.1813 HKBU
me17in_IITB_video_run1-required.txt 0.0525 0.1795 IITB
me17in_TCNJ-CS_video_run1-required.txt 0.0524 0.1774 TCNJ-CS
me17in_DUT-MMSR_video_run5histface.txt 0.0516 0.1791 DUT-MMSR
16. Reconfirmed that Image interestingness is NOT video interestingness
Some significant improvement, especially for the image subtask
Dataset quality improved:
Increase of number of iterations/annotations per sample
Increase of dataset size
Longer movie extracts
➢ Image subtask: All teams did better: Best MAP@10=0.2105 Best MAP=0.4343
➢ Video subtask: 1 team cleary improved, 5 teams depending on their runs: Best
MAP@10=0.1678 Best MAP=0.2637
16
What we have learned on the TASK itself
9/13/2017
17. This year’s trend?
DNN as (last) classifying step is not the majority choice
Dataset size….
Multimodal equals audio+video ONLY (text was used only once)
(Mostly) no temporal approaches
(Mostly) no use of external data
Late fusion, dimension reduction
Adding semantic/affect in the approaches
Genre recognition pre-step
Aesthetics-related features
Movie context (Contextual feature, Textual description)
17
What we have learned on the participants systems
9/13/2017
18. This year’s trend?
DNN as (last) classifying step is not the majority choice
Dataset size….
Multimodal equals audio+video ONLY (text was used only once)
(Mostly) no temporal approaches
(Mostly) no use of external data
Late fusion, dimension reduction
Adding semantic/affect in the approaches
Genre recognition pre-step
Aesthetics-related features
Movie context (Contextual feature, Textual description)
Insights
What works for the images does not work for the videos
Monomodal systems (no audio) did as well as multimodal systems
Adding semantic/affect/context in the approaches is promising!
18
What we have learned on the participants systems
9/13/2017