O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
In this paper, we present the work done at UMons regarding the MediaEval 2015 Affective Impact of Movies Task (including Violent Scenes Detection). This task can be divided into two subtasks. On the one hand, Violent Scene Detection, which means automatically finding scenes that are violent in a set if videos. And on the other hand, evaluate the affective impact of the video, through an estimation of the valence and arousal. In order to offer a solution for both detection and classification subtasks, we investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio, and optical flow maps processed through a deep convolutional neural network are tested for extracting features from the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used.
MediaEval 2015 - UMons at Affective Impact of Movies Task - Poster
Université de MonsUniversité de Mons
UMons at Affective Impact of
Gueorgui Pironkov | TCTS Lab, UMons | firstname.lastname@example.org
Omar Seddati, Emre Kulah*, Gueorgui Pironkov, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit
University of Mons, Mons, Belgium / *Middle East Technical University, Ankara, Turkey
In this work, we propose a solution for both violence detection and affective impact evaluation. We
investigate different visual and auditory feature extraction methods. An i-vector approach is applied for
the audio and optical flow maps processed through a deep convolutional neural network are tested for
the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward
neural networks are then used.
• Our system is based on:
i-vector: a low-dimensional feature vector extracted from high-dimensional data without losing
too much of the relevant acoustic information
ConvNets: a state-of-the-art technique in the field of object recognition within images
• Instead of applying ConvNets to 2D images (frames), we extract dense optical flow maps that represent
the displacement of each pixel between two successive frames, and use a sequence of those maps as
input for our ConvNets. In this way, local temporal information is projected onto a space similar to the
pixel space, and ConvNets can be effectively used for dynamic information.
i-vector – pLDA 9,56
OFM – ConvNets 9,67
OFM – ConvNets – HMDB-51 6,56
Run Valence (%) Arousal (%)
i-vector – pLDA 37,03 31,71
OFM – ConvNets 35,28 44,39
OFM – ConvNets – HMDB-51 37,28 52,44
We have performed three runs for both subtasks:
1. Run 1: we use ConvNets trained from scratch on the dense optical flow maps extracted from the
2. Run 2: we train a ConvNet on the HMDB-51 benchmark. Then, we use the convolutional layers of
this network as a feature extractor, and we train a multi layer perceptron on those features.
3. Run 3: for each video, we extract 20 Mel frequency cepstral coefficients, and the associated first
and second derivatives. For each shot, a 100-dimensional i-vector is extracted. All the i-vectors are
then processed through three independent classifiers (one per subtask).
In this work the visual and audio features are processed separately. Both features are giving similar results
for violence detection and valence. For arousal, video features are far more interesting, especially when
the ConvNets feature extractor is trained on external data. We investigated merging the audio and visual
features together using a neural network. However, the results were poorer than using the features
separately. Our future work will focus on the merging the audio and video features.