O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

"Unsupervised Video Summarization via Attention-Driven Adversarial Learning", by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras. Proceedings of the 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.

  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

  1. 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Unsupervised Video Summarization via Attention-Driven Adversarial Learning E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2 1 CERTH-ITI, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 26th Int. Conf. on Multimedia Modeling Daejeon, Korea, January 2020
  2. 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2  Introduction  Motivation  Developed approach  Experiments  Summarization example  Conclusions
  3. 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 Video summary: a short visual summary that encapsulates the flow of the story and the essential parts of the full-length video Original video Video summary (storyboard) Problem statement
  4. 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 Problem statement Applications of video summarization  Professional CMS: effective indexing, browsing, retrieval & promotion of media assets  Video sharing platforms: improved viewer experience, enhanced viewer engagement & increased content consumption  Other summarization scenarios: movie trailer production, sports highlights video generation, video synopsis of 24h surveillance recordings
  5. 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work Deep-learning approaches  Supervised methods that use types of feedforward neural nets (e.g. CNNs), extract and use video semantics to identify important video parts based on sequence labeling [21], self- attention networks [7], or video-level metadata [17]  Supervised approaches that capture the story flow using recurrent neural nets (e.g. LSTMs)  In combination with statistical models to select a representative and diverse set of keyframes [27]  In hierarchies to identify the video structure and select key-fragments [30, 31]  In combination with DTR units and GANs to capture long-range frame dependency [28]  To form attention-based encoder-decoders [9, 13], or memory-augmented networks [8]  Unsupervised algorithms that do not rely on human-annotations, and build summaries  Using adversarial learning to: minimize the distance between videos and their summary-based reconstructions [1, 16]; maximize the mutual information between summary and video [25]; learn a mapping from raw videos to human-like summaries based on online available summaries [20]  Through a decision-making process that is learned via RL and reward functions [32]  By learning to extract key motions of appearing objects [29]
  6. 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Motivation Disadvantages of supervised learning  Restricted amount of annotated data is available for supervised training of a video summarization method  Highly-subjective nature of video summarization (relying on viewer’s demands and aesthetics); there is no “ideal” or commonly accepted summary that could be used for training an algorithm Advantages of unsupervised learning  No need for learning data; avoid laborious and time-demanding labeling of video data  Adaptability to different types of video; summarization is learned based on the video content
  7. 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Contributions  Introduce an attention mechanism in an unsupervised learning framework, whereas all previous attention-based summarization methods ([7-9, 13]) were supervised  Investigate the integration of an attention mechanism into a variational auto-encoder for video summarization purposes  Use attention to guide the generative adversarial training of the model, rather than using it to rank the video fragments as in [9]
  8. 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Starting point: the SUM-GAN architecture  Main idea: build a keyframe selection mechanism by minimizing the distance between the deep representations of the original video and a reconstructed version of it based on the selected keyframes  Problem: how to define a good distance?  Solution: use a trainable discriminator network!  Goal: train the Summarizer to maximally confuse the Discriminator when distinguishing the original from the reconstructed video 8 Building on adversarial learning
  9. 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors 9 Building on adversarial learning
  10. 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components 10 Building on adversarial learning
  11. 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 11 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components (regularization factor)
  12. 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 12 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  13. 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 13 Building on adversarial learning  SUM-GAN-sl:  Contains a linear compression layer that reduces the size of CNN feature vectors  Follows an incremental and fine-grained approach to train the model’s components
  14. 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Examined approaches: 1) Integrate an attention layer within the variational auto-encoder (VAE) of SUM-GAN-sl 2) Replace the VAE of SUM-GAN-sl with a deterministic attention auto-encoder 14 Introducing an attention mechanism
  15. 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Variational attention was described in [4] and used for natural language modeling  Models the attention vector as Gaussian distributed random variables 15 1) Adversarial learning driven by a variational attention auto-encoder Variational auto-encoder
  16. 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Extended SUM-GAN-sl with variational attention, forming the SUM-GAN-VAAE architecture  The attention weights for each frame were handled as random variables and a latent space was computed for these values, too  In every time-step t the attention component combines the encoder's output at t and the decoder's hidden state at t - 1 to compute an attention weight vector  The decoder was modified to update its hidden states based on both latent spaces during the reconstruction of the video 16 1) Adversarial learning driven by a variational attention auto-encoder Variational attention auto-encoder
  17. 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Inspired by the efficiency of attention-based encoder-decoder network in [13]  Built on the findings of [4] w.r.t. the impact of deterministic attention on VAE  VAE was entirely replaced by an attention auto- encoder (AAE) network, forming the SUM- GAN-AAE architecture 17 2) Adversarial learning driven by deterministic attention auto-encoder
  18. 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 18 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder Processing pipeline
  19. 19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 19 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder Attention auto-encoder
  20. 20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 20 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  For t > 1: use the hidden state of the previous Decoder’s step (h1)  For t = 1: use the hidden state of the last Encoder’s step (He) Attention auto-encoder
  21. 21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 21 2) Adversarial learning driven by deterministic attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using: Attention auto-encoder
  22. 22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function Developed approach 22 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  23. 23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’ Developed approach 23 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  24. 24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1 Developed approach 24 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder
  25. 25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach 25 2) Adversarial learning driven by deterministic attention auto-encoder Attention auto-encoder Processing pipeline  Weighted feature vectors fed to the Encoder  Encoder’s output (V) and Decoder’s previous hidden state fed to the Attention component  Attention weights (αt) computed using:  Energy score function  Soft-max function  αt multiplied with V and form Context Vector vt’  vt’ combined with Decoder’s previous output yt-1  Decoder gradually reconstructs the video
  26. 26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Developed approach  Input: The CNN feature vectors of the (sampled) video frames  Output: Frame-level importance scores  Summarization process:  CNN features pass through the linear compression layer and the frame selector  importance scores computed at frame-level  Given a video segmentation (using KTS [18]) calculate fragment-level importance scores by averaging the scores of each fragment's frames  Summary is created by selecting the fragments that maximize the total importance score provided that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem 26 Model’s I/O and summarization process
  27. 27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 27 Datasets  SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)  25 videos capturing multiple events (e.g. cooking and sports)  video length: 1.6 to 6.5 min  annotation: fragment-based video summaries  TVSum (https://github.com/yalesong/tvsum)  50 videos from 10 categories of TRECVid MED task  video length: 1 to 5 min  annotation: frame-level importance scores
  28. 28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 28 Evaluation protocol  The generated summary should not exceed 15% of the video length  Similarity between automatically generated (A) and ground-truth (G) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration)  Typical metrics for computing Precision and Recall at the frame-level
  29. 29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 29 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
  30. 30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 30 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
  31. 31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 31 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-Score1
  32. 32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 32 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-Score2 F-Score1
  33. 33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 33 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-ScoreN F-Score2 F-Score1
  34. 34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 34 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33]) F-ScoreN F-Score2 F-Score1 SumMe: TVSum: N
  35. 35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 35 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach (used in [9, 13, 16, 24, 25, 28])
  36. 36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 36 Evaluation protocol  Slight but important distinction w.r.t. what is eventually used as ground-truth summary  Alternative approach (used in [9, 13, 16, 24, 25, 28]) F-Score
  37. 37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Videos were down-sampled to 2 fps  Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet  Linear compression layer reduces the size of these vectors from 1024 to 500  All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM  Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s learning rate = 10-5  Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing set having the remaining 20% of data  Ran experiments on 5 differently created random splits and report the average performance at the training-epoch-level (i.e. for the same training epoch) over these runs Experiments 37 Implementation details
  38. 38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 1: Assessing the impact of regularization factor σ Experiments 38 SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
  39. 39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 1: Assessing the impact of regularization factor σ  Outcomes:  Value of σ affects the models’ performance and needs fine-tuning  Fine-tuning is dataset-dependent  Best overall performance for each model, observed for different σ value Experiments 39 SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
  40. 40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 2: Comparison with SoA unsupervised approaches based on multiple user summaries  Outcomes  A few SoA methods are comparable (or even worse) with a random summary generator  Best method on TVSum shows random-level performance on SumMe  Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum  Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two latent spaces in parallel to the continuous update of the model's components during the training  Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl Experiments 40 +/- indicate better/worse performance compared to SUM-GAN-AAE Note: SUM-GAN is not listed in this table as it follows the single gt-summary evaluation protocol
  41. 41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 3: Evaluating the effect of the introduced AAE component  Key-fragment selection: Attention mechanism leads to much smoother series of importance scores Experiments 41
  42. 42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 3: Evaluating the effect of the introduced AAE component  Training efficiency: much faster and more stable training of the model Experiments 42 Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
  43. 43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 4: Comparison with SoA supervised approaches based on multiple user summaries  Outcomes  Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as they exhibit random-level performance on SumMe  Only a few supervised methods surpass the performance of a random summary generator on both datasets, with VASNet being the best among them  The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum  Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods Experiments 43
  44. 44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 5: Comparison with SoA approaches based on single ground-truth summaries  Impact of regularization factor σ (best scores in bold)  The model’s performance is affected by the value of σ  The effect of σ depends (also) on the evaluation approach; best performance when using multiple human summaries was observed for σ = 0.15  SUM-GAN-AAE outperforms the original SUM-GAN model on both datasets, even for the same value of σ Experiments 44
  45. 45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Step 5: Comparison with SoA approaches based on single ground-truth summaries  Outcomes  SUM-GAN-AAE model performs consistently well on both datasets  SUM-GAN-AAE shows advanced performance compared to SoA supervised and unsupervised (*) summarization methods Experiments 45 Unsupervised approaches marked with an asterisk
  46. 46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summarization example 46 Full video Generated summary
  47. 47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project  Presented a video summarization method that combines:  The effectiveness of attention mechanisms in spotting the most important parts of the video  The learning efficiency of the generative adversarial networks for unsupervised training  Experimental evaluations on two benchmarking datasets:  Documented the positive contribution of the introduced attention auto-encoder component in the model's training and summarization performance  Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video summarization techniques Conclusions 47
  48. 48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 1. E. Apostolidis, et al.: A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In: AI4TV, ACM MM 2019 2. E. Apostolidis, et al.: Fast shot segmentation combining global and local visual descriptors. In: IEEE ICASSP 2014. pp. 6583-6587 3. K. Apostolidis, et al.: A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In: MMM 2018. pp. 29-41 4. H. Bahuleyan, et al.: Variational attention for sequence-to-sequence models. In: 27th COLING. pp. 1672-1682 (2018) 5. J. Cho: PyTorch implementation of SUM-GAN (2017), https://github.com/j-min/Adversarial Video Summary, (last accessed on Oct. 18, 2019) 6. M. Elfeki, et al.: Video summarization via actionness ranking. In: IEEE WACV 2019. pp. 754-763 7. J. Fajtl, et al.: Summarizing videos with attention. In: ACCV 2018. pp. 39-54 8. L. Feng, et al.: Extractive video summarizer with memory augmented neural networks. In: ACM MM 2018. pp. 976-983 9. T. Fu, et al.: Attentive and adversarial learning for video summarization. In: IEEE WACV 2019. pp. 1579-1587 10. M. Gygli, et al.: Creating summaries from user videos. In: ECCV 2014. pp. 505-520 11. M. Gygli, et al.: Video summarization by learning submodular mixtures of objectives. In: IEEE CVPR 2015. pp. 3090-3098 12. S. Hochreiter, et al.: Long Short-Term Memory. Neural Computation 9(8), 1735-1780 (1997) 13. Z. Ji, et al.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. on Circuits and Systems for Video Technology (2019) 14. D. Kaufman, et al.: Temporal Tessellation: A unified approach for video analysis. In: IEEE ICCV 2017. pp. 94-104 15. S. Lee, et al.: A memory network approach for story-based temporal summarization of 360 videos. In: IEEE CVPR 2018. pp. 1410-1419 16. B. Mahasseni, et al.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE CVPR 2017. pp. 2982-2991 17. M. Otani, et al.: Video summarization using deep semantic features. In: ACCV 2016. pp. 361-377 Key references 48
  49. 49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18. D. Potapov, et al.: Category-specific video summarization. In: ECCV 2014. pp. 540-555 19. A. Radford, et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR 2016 20. M. Rochan, et al.: Video summarization by learning from unpaired data. In: IEEE CVPR 2019 21. M. Rochan, et al.: Video summarization using fully convolutional sequence networks. In: ECCV 2018. pp. 358-374 22. Y. Song, et al.: TVSum: Summarizing web videos using titles. In: IEEE CVPR 2015. pp. 5179-5187 23. C. Szegedy, et al.: Going deeper with convolutions. In: IEEE CVPR 2015. pp. 1-9 24. H. Wei, et al.: Video summarization via semantic attended networks. In: AAAI 2018. pp. 216-223 25. L. Yuan, et al.: Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In: AAAI 2019. pp. 9143- 9150 26. Y. Yuan, et al.: Video summarization by learning deep side semantic embedding. IEEE Trans. on Circuits and Systems for Video Technology 29(1), 226-237 (2019) 27. K. Zhang, et al.: Video summarization with Long Short-Term Memory. In: ECCV 2016. pp. 766-782 28. Y. Zhang, et al.: DTR-GAN: Dilated temporal relational adversarial network for video summarization. In: ACM TURC 2019. pp. 89:1-89:6 29. Y. Zhang, et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Letters (2018) 30. B. Zhao, et al.: Hierarchical recurrent neural network for video summarization. In: ACM MM 2017. pp. 863-871 31. B. Zhao, et al.: HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In: IEEE/CVF CVPR 2018. pp. 7405-7414 32. K. Zhou, et al.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI 2018. pp. 7582-7589 33. K. Zhou, et al.: Video summarisation by classification with deep reinforcement learning. In: BMVC 2018 Key references 49
  50. 50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 50 Thank you for your attention! Questions? Vasileios Mezaris, bmezaris@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/SUM-GAN-AAE This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV. The work of Ioannis Patras has been supported by EPSRC under grant No. EP/R026424/1.

×