The media industry is constantly making use of affective signals whether in text, sound, image or moving images with the aim of attracting our attention and conveying a message or a story. In this presentation we will look at the analysis of audio-visual content for understanding its affective properties. We will start with proposing a semi supervised approach for identifying the genre of a media (action, drama, horror, etc..). We will then show how the genre of video segments can be used to determine its interestingness. There are many usage scenarios where such information about the content has value for the media editors and archivists. Beyond genre and interestingness, emotion recognition in videos in another important cue when understanding the content of audio-visual documents. For this a deep model combining three key component for recognizing human expression of emotions has been devised. It includes static as well as dynamic facial features and audio information. The approach was shown to perform well on the Emotion Recognition in the Wild 2017 challenge. Applications to past and ongoing research and industry projects will be used throughout to illustrate the presentation.
2. Motivation and Context
• Media Industry and Multimedia Retrieval
– Indexing and retrieval
– Annotation
– Summarization and Trailer creation
– Enrichments and hyperlinking
• NexGenTV (ANRT National):
– Multimodal analysis for enriching broadcast content
via second screen applications
• ANTRACT (ANR National):
– Mine video archive to provide information for historians
• MeMAD (H2020 EU):
– Provide new methods to translate Image and Sounds into
Words (for indexing, retrieving, repurposing and accessibility)
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 2
3. Predicting Media Interestingness (PMI)
– Automatically analyze media data
– Identify the most attractive content
• Data Driven /
Content based approaches
– gap between low-level features
and high-level human perception
• Our proposal
– Address PMI in association with
Media Genre
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 3
http://www.dailyherald.com/article/20110627/entlife/706279989/
4. Why Genre Inference for PMI
• Motivation
– Interestingness is highly correlated with data emotional content
– Affective representation of data content
• Hypothesis
– Emotional impact of movie genre can be a factor for interestingness of a video fragment
• Constraints
– Subjectivity of the task – Data collection issues
– Limited Dataset
• Method
– Mid-level representation based on media genre prediction for PMI
• Represent each video fragment/image as a distribution of genres
– Transfer Learning
• Using genre probability distribution to infer Media Fragment Interestingness
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 4
6. Media Genre Prediction
• Visual Branch
– Deep CNN for features extraction
– LSTM for modelling temporal cue
• Audio Branch
– Deep features extraction : Soundnet
– SVM classifier
ResNet
Architecture
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 6
7. Media Genre Dataset
• Movie trailers dataset
– K S Sivaraman and G. Somappa. MovieScope
• 525 YouTube Videos over 4 Genres (available on IMDB)
– Extended with a 5th Genre
• Sci-Fi
Trailer
Genre
Training Test Total
Action 69 44 114
Drama 95 39 134
Horror 99 59 158
Romance 80 39 119
Sci-fi 72 35 107
Total 415 216 632
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 7
8. Visual Audio Audio-Visual
Action
Drama
Horror
Romance
Sci-fi
Visual Audio Audio-Visual
Action 2.5% 33,34% 17.92%
Drama 0% 17,78% 8.89%
Horror 0.37% 2.61% 1.49%
Romance 0% 3.87% 1.93%
Sci-fi 97.12% 42,40% 69,76%
Media Genre Prediction Example
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 8
Features/Metrics Precision
Sivaraman and Somappa. (4 genres) 0,85
One Frame-VGG + Soundnet 0,85
ResNet-LSTM + Soundnet 0,87
9. Interestingness Classification
• Features vectors
– Probability vector for the genre distribution for video fragments
• Classifier
– Binary SVM,
– Including assessors confidence scores
• Modality Analysis
– Audio genre vector
– Visual genre vector
– Audio-Visual genre vector
• Concatenated visual- and audio-based genre vectors probabilities
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 9
10. Media Interestingness Dataset
• Creative Commons licensed “Hollywood” Video
– 103 movie trailers and 4 continuous extracts
– Shot segmentation
– 7396 video segments for training and 2435 video segments
for testing
– Low level and mid level features
– Annotation (GT) performed by human assessors
• Interesting (1) or Not Interesting (2)
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET 10
11. Example 1: The longest ride
GENRE Action Drama Horror Romance Sci-fi
COVERAGE 9.02% 29.86% 6.25% 43.75% 11.11%
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 11
12. Example 1: The longest ride
GENRE Action Drama Horror Romance Sci-fi
COVERAGE 9.02% 29.86% 6.25% 43.75% 11.11%
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 12
13. Example 1: The longest ride
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 13
Predicted Interestingness= 1
Romance = 97,45%
14. Example 1: The longest ride
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 14
Predicted Interestingness= 0
Romance = 48,49%
15. Example 2: Tears of Steel
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 15
GENRE Action Drama Horror Romance Sci-fi
COVERAGE 18.37% 16.32% 6.12% 6.12% 53.06%
16. Example 2: Tears of Steel
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 16
GENRE Action Drama Horror Romance Sci-fi
COVERAGE 18.37% 16.32% 6.12% 6.12% 53.06%
17. Example 2: Tears of Steel
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 17
Predicted Interestingness= 1
Sci-fi =95,38%
18. Example 2: Tears of Steel
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 18
Predicted Interestingness= 0
Sci-fi= 53,67%
20. Conclusions and Future Work
• We proposed to train a Genre Recognition System as a mid-level representation
for Predicting Media Interestingness
• Deep Audio and Visual Features for Genre Recognition
• LSTM used to predict Genre over Video Shot duration
• Transfer Learning for Predicting Media Interestingness
• Best Results:
– 0.21 MAP (on test set) [Previous State of the Art 0.20 MAP]
• Audio brings limited additional information for PMI
• End-to-End joint learning (fine-tuning) of audio-visual features
• Extent mid-level representation with other features (Emotion, Valence/Arousal, etc.)
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET - p 20
21. Recent Related Publications
• Ben-Ahmed, O., J. Wacker, A. Gaballo, B. Huet, EURECOM @MediaEval 2017: Media genre inference for
predicting media interestingness, MEDIAEVAL 2017, MediaEval Benchmarking Initiative for Multimedia
Evaluation, 13-15 September 2017, Dublin, Ireland
• Pini S., O. Ben-Ahmed, M. Cornia, L. Baraldi, R. Cucchiara, Rita; B. Huet, Modeling multimodal cues in a
deep learning-based framework for emotion recognition in the wild, ICMI 2017, 19th ACM International
Conference on Multimodal Interaction, November 13-17th, 2017, Glasgow, United Kingdom
• Smith, J. R., D. Joshi, B. Huet, W. Hsu, J. Cota, Harnessing A.I. for augmenting creativity: Application to
movie trailer creation, ACMMM 2017, 25th ACM Multimedia Conference, October 23-27, 2017, Mountain
View, CA, USA
• Tiwari S. N., N. Q. K. Duong, F. Lefebvre, C.-H. Demarty, B. Huet, L. Chevallier, Deep features for
multimodal emotion classification, on HAL
• Paleari, M., R. Chellali, B. Huet, Bimodal emotion recognition, ICSR 2010, International Conference on
Social Robotics, November 23-24, 2010, Singapore / Also published as LNCS Volume 6414/2010
• Mérialdo B. and B. Huet, Automatic video summarization, Book chapter in "Interactive Video, Algorithms
and Technologies" by Hammoud, Riad (Ed.), 2006, XVI, 250 p, ISBN: 3-540-33214-6
6/27/2018 ICMR18- ACMMM TPC Workshop - B. HUET 21
The media industry is constantly making use of affective signals whether in text, sound, image or moving images with the aim of attracting our attention and conveying a message or a story. In this presentation we will look at the analysis of audio-visual content for understanding its affective properties. We will start with proposing a semi supervised approach for identifying the genre of a media (action, drama, horror, etc..). We will then show how the genre of video segments can be used to determine its interestingness. There are many usage scenarios where such information about the content has value for the media editors and archivists. Beyond genre and interestingness, emotion recognition in videos in another important cue when understanding the content of audio-visual documents. For this a deep model combining three key component for recognizing human expression of emotions has been devised. It includes static as well as dynamic facial features and audio information. The approach was shown to perform well on the Emotion Recognition in the Wild 2017 challenge. Applications to past and ongoing research and industry projects will be used throughout to illustrate the presentation.
Soleymani 2015 Visual Interestingness
Van Gool 2013 Image Interestingness
Predicting Media Interestingness (PMI) aims to automatically analyze media data and identify the most attractive content
Most of the previous proposed Approaches for PMI are directly based on the multimedia content However those approaches suffer from the gap between low-level features and high-level human perception.
Supported by recent research, we address PMI in association with media emotional content through Genre Inference
Recent research proved that perceived interestingness is highly correlated with data emotional content
Soleymani 2015 Visual Interestingness
Van Gool 2013 Image Interestingness
Hence, an affective representation of video content will be useful for identifying the most important parts in a movie.
In this work, we hypothesize that the emotional impact of the movie genre can be a factor for the perceived interestingness of a video for a given viewer.
Therefore, we adopt a mid-level representation based on media genre recognition.
We propose to represent each sample as a distribution over genres (action, drama, horror, romance, sci-fi). For instance, a high confidence for the horror label inside the shot genre distribution could be perceived as more emotional (scary in this case). Therefore, this shot might be more characteristic and therefore more interesting than a neutral genre that could appear in any shot.
About Soundnet :
The Soundnet model is build by learning acoustic representation from large collections of unlabeled videos. The training is based on a student-teacher procedure which transfers discriminative visual knowledge from well established visual models (ImageNet, Places) into sound domain using unlabeled video as a bridge.
Audio features extracted from the soundnet model achieved state-of-the-art performance when used as input to an SVM on two audio scene classification benchmarks.
A shot from the devset of MediaEval set and its genre prediction for visual, audio and audio-visual features
The video is here : https://drive.google.com/open?id=0B-Ebwt-Z3nNMNEFsbnNwRzdrazg
The name of the film is After Earth Film
Video 59 dans devset
Déja la plupart des pics sont dans les genres romance et Drama car c un fil de drama et romance de base
Video 59 dans devset
Déja la plupart des pics sont dans les genres romance et Drama car c un fil de drama et romance de base
Il ya 8 interssting shots dans le GT (['3-41', '42-69', '874-920', '1736-1754', '1772-1795', '2289-2329', '2746-2769', '3027-3047']
Je vais mettre deux shots intéressants ( je vous laisse le choix pour choisir un pour que je le mentionne sur le graphe avec l’etoile rouge)
Index des vidéos rouges : 13 et 35
Et un shot non interss (index 85 (2477-2495))
Il ya 8 interesting shots dans le GT (['3-41', '42-69', '874-920', '1736-1754', '1772-1795', '2289-2329', '2746-2769', '3027-3047']
les shots intérrassant sont : 9 ->(0.9744806289672852),
15 ->(0.9995414018630981) ,
48 ->(0.9999010562896729),
66->(0.9726647734642029),
104->(0.9965728521347046),
110->(0.6256349086761475)
111->( 0.927717924118042),
141-> ((0.7384583353996277)
111, 104, 66, 48, 110, 15, 141, 9]
Video 1 avec etoile rouge (3027-3047)---> index sur le graphe = 9 avec romance = 0.9744806289672852
Video 2 avec etoile rouge (2746-2769)---> index sur la graphe = 141 (0.7384583353996277)
Ici je propose qu’on laisse que la video 1 avec etoile rouge ?
Video avec etoile verte (2477-2495)--> index sur graphe =85 avec romance = 0.48481258749961853
Je vais mettre deux shots intéressants ( je vous laisse le choix pour choisir un pour que je le mentionne sur le graphe avec l’etoile rouge)
Et un shot non interss (index 85 (2477-2495))
Video 59 dans devset
Déja la plupart des pics sont dans les genres romance et Drama car c un fil de drama et romance de base
Video 59 dans devset
Déja la plupart des pics sont dans les genres romance et Drama car c un fil de drama et romance de base
Ok ! les shots intérrassant sont : 7(86.85%),10(99.98%),18 (95.22%) ,27(99.99%),35(95.38%), et 48(91.12%)
Ok ! les shots intérrassant sont : 7(86.85%),10(99.98%),18 (95.22%) ,27(99.99%),35(95.38%), et 48(91.12%)