O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Université de MonsUniversité de Mons
UMons at Affective Impact of
Movies Task
Gueorgui Pironkov | TCTS Lab, UMons | gueorg...
Próximos SlideShares
Carregando em…5
×

MediaEval 2015 - UMons at Affective Impact of Movies Task - Poster

131 visualizações

Publicada em

In this paper, we present the work done at UMons regarding the MediaEval 2015 Affective Impact of Movies Task (including Violent Scenes Detection). This task can be divided into two subtasks. On the one hand, Violent Scene Detection, which means automatically finding scenes that are violent in a set if videos. And on the other hand, evaluate the affective impact of the video, through an estimation of the valence and arousal. In order to offer a solution for both detection and classification subtasks, we investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio, and optical flow maps processed through a deep convolutional neural network are tested for extracting features from the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used.

http://ceur-ws.org/Vol-1436/
http://www.multimediaeval.org

Publicada em: Educação
  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

MediaEval 2015 - UMons at Affective Impact of Movies Task - Poster

  1. 1. Université de MonsUniversité de Mons UMons at Affective Impact of Movies Task Gueorgui Pironkov | TCTS Lab, UMons | gueorgui.pironkov@umons.ac.be Omar Seddati, Emre Kulah*, Gueorgui Pironkov, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit University of Mons, Mons, Belgium / *Middle East Technical University, Ankara, Turkey Introduction In this work, we propose a solution for both violence detection and affective impact evaluation. We investigate different visual and auditory feature extraction methods. An i-vector approach is applied for the audio and optical flow maps processed through a deep convolutional neural network are tested for the video. Classifiers based on probabilistic linear discriminant analysis and fully connected feed-forward neural networks are then used. Approach • Our system is based on:  i-vector: a low-dimensional feature vector extracted from high-dimensional data without losing too much of the relevant acoustic information  ConvNets: a state-of-the-art technique in the field of object recognition within images • Instead of applying ConvNets to 2D images (frames), we extract dense optical flow maps that represent the displacement of each pixel between two successive frames, and use a sequence of those maps as input for our ConvNets. In this way, local temporal information is projected onto a space similar to the pixel space, and ConvNets can be effectively used for dynamic information. Results Run MAP(%) i-vector – pLDA 9,56 OFM – ConvNets 9,67 OFM – ConvNets – HMDB-51 6,56 Run Valence (%) Arousal (%) i-vector – pLDA 37,03 31,71 OFM – ConvNets 35,28 44,39 OFM – ConvNets – HMDB-51 37,28 52,44 We have performed three runs for both subtasks: 1. Run 1: we use ConvNets trained from scratch on the dense optical flow maps extracted from the MediaEval dataset. 2. Run 2: we train a ConvNet on the HMDB-51 benchmark. Then, we use the convolutional layers of this network as a feature extractor, and we train a multi layer perceptron on those features. 3. Run 3: for each video, we extract 20 Mel frequency cepstral coefficients, and the associated first and second derivatives. For each shot, a 100-dimensional i-vector is extracted. All the i-vectors are then processed through three independent classifiers (one per subtask). In this work the visual and audio features are processed separately. Both features are giving similar results for violence detection and valence. For arousal, video features are far more interesting, especially when the ConvNets feature extractor is trained on external data. We investigated merging the audio and visual features together using a neural network. However, the results were poorer than using the features separately. Our future work will focus on the merging the audio and video features. Conclusion Optical flow ConvNets MLP classifier SHOT MFCC i-vector pLDA VideoAudio

×