"Unsupervised Video Summarization via Attention-Driven Adversarial Learning", by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras. Proceedings of the 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
"Unsupervised Video Summarization via Attention-Driven Adversarial Learning", by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras. Proceedings of the 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
1.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Unsupervised Video Summarization via Attention-Driven
Adversarial Learning
E. Apostolidis1,2, E. Adamantidou1, A. I. Metsai1, V. Mezaris1, I. Patras2
1 CERTH-ITI, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
26th Int. Conf. on Multimedia Modeling
Daejeon, Korea, January 2020
3.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem statement
4.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
Problem statement
Applications of video summarization
Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets
Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption
Other summarization scenarios: movie trailer production, sports highlights video generation,
video synopsis of 24h surveillance recordings
5.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
Related work
Deep-learning approaches
Supervised methods that use types of feedforward neural nets (e.g. CNNs), extract and use
video semantics to identify important video parts based on sequence labeling [21], self-
attention networks [7], or video-level metadata [17]
Supervised approaches that capture the story flow using recurrent neural nets (e.g. LSTMs)
In combination with statistical models to select a representative and diverse set of keyframes [27]
In hierarchies to identify the video structure and select key-fragments [30, 31]
In combination with DTR units and GANs to capture long-range frame dependency [28]
To form attention-based encoder-decoders [9, 13], or memory-augmented networks [8]
Unsupervised algorithms that do not rely on human-annotations, and build summaries
Using adversarial learning to: minimize the distance between videos and their summary-based
reconstructions [1, 16]; maximize the mutual information between summary and video [25]; learn a
mapping from raw videos to human-like summaries based on online available summaries [20]
Through a decision-making process that is learned via RL and reward functions [32]
By learning to extract key motions of appearing objects [29]
6.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
Motivation
Disadvantages of supervised learning
Restricted amount of annotated data is available for supervised training of a video
summarization method
Highly-subjective nature of video summarization (relying on viewer’s demands and aesthetics);
there is no “ideal” or commonly accepted summary that could be used for training an algorithm
Advantages of unsupervised learning
No need for learning data; avoid laborious and time-demanding labeling of video data
Adaptability to different types of video; summarization is learned based on the video content
7.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
Contributions
Introduce an attention mechanism in an unsupervised learning framework, whereas all
previous attention-based summarization methods ([7-9, 13]) were supervised
Investigate the integration of an attention mechanism into a variational auto-encoder for video
summarization purposes
Use attention to guide the generative adversarial training of the model, rather than using it to
rank the video fragments as in [9]
8.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
Starting point: the SUM-GAN architecture
Main idea: build a keyframe selection mechanism
by minimizing the distance between the deep
representations of the original video and a
reconstructed version of it based on the selected
keyframes
Problem: how to define a good distance?
Solution: use a trainable discriminator network!
Goal: train the Summarizer to maximally confuse
the Discriminator when distinguishing the original
from the reconstructed video
8
Building on adversarial learning
9.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
SUM-GAN-sl:
Contains a linear compression layer that reduces
the size of CNN feature vectors
9
Building on adversarial learning
10.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
SUM-GAN-sl:
Contains a linear compression layer that reduces the size of CNN feature vectors
Follows an incremental and fine-grained approach to train the model’s components
10
Building on adversarial learning
11.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
11
Building on adversarial learning
SUM-GAN-sl:
Contains a linear compression layer that reduces the size of CNN feature vectors
Follows an incremental and fine-grained approach to train the model’s components
(regularization factor)
12.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
12
Building on adversarial learning
SUM-GAN-sl:
Contains a linear compression layer that reduces the size of CNN feature vectors
Follows an incremental and fine-grained approach to train the model’s components
13.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
13
Building on adversarial learning
SUM-GAN-sl:
Contains a linear compression layer that reduces the size of CNN feature vectors
Follows an incremental and fine-grained approach to train the model’s components
14.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
Examined approaches:
1) Integrate an attention layer within the variational auto-encoder (VAE) of SUM-GAN-sl
2) Replace the VAE of SUM-GAN-sl with a deterministic attention auto-encoder
14
Introducing an attention mechanism
15.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
Variational attention was described in [4] and used for natural language modeling
Models the attention vector as Gaussian distributed random variables
15
1) Adversarial learning driven by a variational attention auto-encoder
Variational auto-encoder
16.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
Extended SUM-GAN-sl with variational attention, forming the SUM-GAN-VAAE architecture
The attention weights for each frame were handled as random variables and a latent space
was computed for these values, too
In every time-step t the attention component combines the encoder's output at t and the
decoder's hidden state at t - 1 to compute an attention weight vector
The decoder was modified to update its hidden states based on both latent spaces during the
reconstruction of the video
16
1) Adversarial learning driven by a variational attention auto-encoder
Variational attention
auto-encoder
17.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
Inspired by the efficiency of attention-based
encoder-decoder network in [13]
Built on the findings of [4] w.r.t. the impact of
deterministic attention on VAE
VAE was entirely replaced by an attention auto-
encoder (AAE) network, forming the SUM-
GAN-AAE architecture
17
2) Adversarial learning driven by deterministic attention auto-encoder
19.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
19
2) Adversarial learning driven by deterministic attention auto-encoder
Processing pipeline
Weighted feature vectors fed to the Encoder
Attention auto-encoder
20.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
20
2) Adversarial learning driven by deterministic attention auto-encoder
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
For t > 1: use the hidden state of the previous
Decoder’s step (h1)
For t = 1: use the hidden state of the last
Encoder’s step (He)
Attention auto-encoder
21.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
21
2) Adversarial learning driven by deterministic attention auto-encoder
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Attention auto-encoder
22.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
Developed approach
22
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
23.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
Developed approach
23
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
24.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
Developed approach
24
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
25.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
25
2) Adversarial learning driven by deterministic attention auto-encoder
Attention auto-encoder
Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
Decoder gradually reconstructs the video
26.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Developed approach
Input: The CNN feature vectors of the (sampled) video frames
Output: Frame-level importance scores
Summarization process:
CNN features pass through the linear compression layer and the frame selector importance
scores computed at frame-level
Given a video segmentation (using KTS [18]) calculate fragment-level importance scores by
averaging the scores of each fragment's frames
Summary is created by selecting the fragments that maximize the total importance score provided
that summary length does not exceed 15% of video duration, by solving the 0/1 Knapsack problem
26
Model’s I/O and summarization process
27.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
27
Datasets
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
25 videos capturing multiple events (e.g. cooking and sports)
video length: 1.6 to 6.5 min
annotation: fragment-based video summaries
TVSum (https://github.com/yalesong/tvsum)
50 videos from 10 categories of TRECVid MED task
video length: 1 to 5 min
annotation: frame-level importance scores
28.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
28
Evaluation protocol
The generated summary should not exceed 15% of the video length
Similarity between automatically generated (A) and ground-truth (G) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
Typical metrics for computing Precision and Recall at the frame-level
29.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
29
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
30.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
30
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
31.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
31
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-Score1
32.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
32
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-Score2
F-Score1
33.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
33
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-ScoreN
F-Score2
F-Score1
34.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
34
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach (by [1, 6, 7, 8, 14, 20, 21, 26, 27, 29, 30, 31, 32, 33])
F-ScoreN
F-Score2
F-Score1
SumMe: TVSum:
N
35.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
35
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach (used in [9, 13, 16, 24, 25, 28])
36.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
36
Evaluation protocol
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach (used in [9, 13, 16, 24, 25, 28])
F-Score
37.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Videos were down-sampled to 2 fps
Feature extraction was based on the pool5 layer of GoogleNet trained on ImageNet
Linear compression layer reduces the size of these vectors from 1024 to 500
All components are 2-layer LSTMs with 500 hidden units; Frame selector is a bi-directional LSTM
Training based on the Adam optimizer; Summarizer’s learning rate = 10-4; Discriminator’s
learning rate = 10-5
Dataset was split into two non-overlapping sets; a training set having 80% of data and a testing
set having the remaining 20% of data
Ran experiments on 5 differently created random splits and report the average performance at
the training-epoch-level (i.e. for the same training epoch) over these runs
Experiments
37
Implementation details
39.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 1: Assessing the impact of regularization factor σ
Outcomes:
Value of σ affects the models’ performance and needs fine-tuning
Fine-tuning is dataset-dependent
Best overall performance for each model, observed for different σ value
Experiments
39
SUM-GAN-sl SUM-GAN-VAAE SUM-GAN-AAE
40.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 2: Comparison with SoA unsupervised approaches based on multiple user summaries
Outcomes
A few SoA methods are comparable (or even worse) with a random summary generator
Best method on TVSum shows random-level performance on SumMe
Best method on SumMe performs worse than SUM-GAN-AAE and is less competitive on TVSum
Variational attention reduces SUM-GAN-sl efficiency due to the difficulty in efficiently defining two
latent spaces in parallel to the continuous update of the model's components during the training
Replacement of VAE with AAE leads to a noticeable performance improvement over SUM-GAN-sl
Experiments
40
+/- indicate better/worse performance
compared to SUM-GAN-AAE
Note: SUM-GAN is not listed in this table as it follows
the single gt-summary evaluation protocol
41.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 3: Evaluating the effect of the introduced AAE component
Key-fragment selection: Attention mechanism leads to much smoother series of importance
scores
Experiments
41
42.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 3: Evaluating the effect of the introduced AAE component
Training efficiency: much faster and more stable training of the model
Experiments
42
Loss curves for the SUM-GAN-sl and SUM-GAN-AAE
43.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 4: Comparison with SoA supervised approaches based on multiple user summaries
Outcomes
Best methods in TVSum (MAVS and Tessellationsup, respectively) seem adapted to this dataset, as
they exhibit random-level performance on SumMe
Only a few supervised methods surpass the performance of a random summary generator on both
datasets, with VASNet being the best among them
The performance of these methods ranges between 44.1 - 49.7 on SumMe, and 56.1 - 61.4 on TVSum
Τhe unsupervised SUM-GAN-AAE model is comparable with SoA supervised methods
Experiments
43
44.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 5: Comparison with SoA approaches based on single ground-truth summaries
Impact of regularization factor σ (best scores in bold)
The model’s performance is affected by the value of σ
The effect of σ depends (also) on the evaluation approach; best performance when using multiple
human summaries was observed for σ = 0.15
SUM-GAN-AAE outperforms the original SUM-GAN model on both datasets, even for the same value
of σ
Experiments
44
45.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Step 5: Comparison with SoA approaches based on single ground-truth summaries
Outcomes
SUM-GAN-AAE model performs consistently well on both datasets
SUM-GAN-AAE shows advanced performance compared to SoA supervised and unsupervised (*)
summarization methods
Experiments
45
Unsupervised approaches
marked with an asterisk
46.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Summarization example
46
Full video Generated summary
47.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Presented a video summarization method that combines:
The effectiveness of attention mechanisms in spotting the most important parts of the video
The learning efficiency of the generative adversarial networks for unsupervised training
Experimental evaluations on two benchmarking datasets:
Documented the positive contribution of the introduced attention auto-encoder component in the
model's training and summarization performance
Highlighted the competitiveness of the unsupervised SUM-GAN-AAE method against SoA video
summarization techniques
Conclusions
47
48.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
1. E. Apostolidis, et al.: A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In:
AI4TV, ACM MM 2019
2. E. Apostolidis, et al.: Fast shot segmentation combining global and local visual descriptors. In: IEEE ICASSP 2014. pp. 6583-6587
3. K. Apostolidis, et al.: A motion-driven approach for fine-grained temporal segmentation of user-generated videos. In: MMM 2018. pp. 29-41
4. H. Bahuleyan, et al.: Variational attention for sequence-to-sequence models. In: 27th COLING. pp. 1672-1682 (2018)
5. J. Cho: PyTorch implementation of SUM-GAN (2017), https://github.com/j-min/Adversarial Video Summary, (last accessed on Oct. 18, 2019)
6. M. Elfeki, et al.: Video summarization via actionness ranking. In: IEEE WACV 2019. pp. 754-763
7. J. Fajtl, et al.: Summarizing videos with attention. In: ACCV 2018. pp. 39-54
8. L. Feng, et al.: Extractive video summarizer with memory augmented neural networks. In: ACM MM 2018. pp. 976-983
9. T. Fu, et al.: Attentive and adversarial learning for video summarization. In: IEEE WACV 2019. pp. 1579-1587
10. M. Gygli, et al.: Creating summaries from user videos. In: ECCV 2014. pp. 505-520
11. M. Gygli, et al.: Video summarization by learning submodular mixtures of objectives. In: IEEE CVPR 2015. pp. 3090-3098
12. S. Hochreiter, et al.: Long Short-Term Memory. Neural Computation 9(8), 1735-1780 (1997)
13. Z. Ji, et al.: Video summarization with attention-based encoder-decoder networks. IEEE Trans. on Circuits and Systems for Video
Technology (2019)
14. D. Kaufman, et al.: Temporal Tessellation: A unified approach for video analysis. In: IEEE ICCV 2017. pp. 94-104
15. S. Lee, et al.: A memory network approach for story-based temporal summarization of 360 videos. In: IEEE CVPR 2018. pp. 1410-1419
16. B. Mahasseni, et al.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE CVPR 2017. pp. 2982-2991
17. M. Otani, et al.: Video summarization using deep semantic features. In: ACCV 2016. pp. 361-377
Key references
48
49.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18. D. Potapov, et al.: Category-specific video summarization. In: ECCV 2014. pp. 540-555
19. A. Radford, et al.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR 2016
20. M. Rochan, et al.: Video summarization by learning from unpaired data. In: IEEE CVPR 2019
21. M. Rochan, et al.: Video summarization using fully convolutional sequence networks. In: ECCV 2018. pp. 358-374
22. Y. Song, et al.: TVSum: Summarizing web videos using titles. In: IEEE CVPR 2015. pp. 5179-5187
23. C. Szegedy, et al.: Going deeper with convolutions. In: IEEE CVPR 2015. pp. 1-9
24. H. Wei, et al.: Video summarization via semantic attended networks. In: AAAI 2018. pp. 216-223
25. L. Yuan, et al.: Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In: AAAI 2019. pp. 9143-
9150
26. Y. Yuan, et al.: Video summarization by learning deep side semantic embedding. IEEE Trans. on Circuits and Systems for Video Technology
29(1), 226-237 (2019)
27. K. Zhang, et al.: Video summarization with Long Short-Term Memory. In: ECCV 2016. pp. 766-782
28. Y. Zhang, et al.: DTR-GAN: Dilated temporal relational adversarial network for video summarization. In: ACM TURC 2019. pp. 89:1-89:6
29. Y. Zhang, et al.: Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recognition Letters (2018)
30. B. Zhao, et al.: Hierarchical recurrent neural network for video summarization. In: ACM MM 2017. pp. 863-871
31. B. Zhao, et al.: HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In: IEEE/CVF CVPR 2018. pp. 7405-7414
32. K. Zhou, et al.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI
2018. pp. 7582-7589
33. K. Zhou, et al.: Video summarisation by classification with deep reinforcement learning. In: BMVC 2018
Key references
49
50.
retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
50
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/SUM-GAN-AAE
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV. The work of Ioannis
Patras has been supported by EPSRC under grant No. EP/R026424/1.
Parece que tem um bloqueador de anúncios ativo. Ao listar o SlideShare no seu bloqueador de anúncios, está a apoiar a nossa comunidade de criadores de conteúdo.
Odeia anúncios?
Atualizámos a nossa política de privacidade.
Atualizámos a nossa política de privacidade de modo a estarmos em conformidade com os regulamentos de privacidade em constante mutação a nível mundial e para lhe fornecer uma visão sobre as formas limitadas de utilização dos seus dados.
Pode ler os detalhes abaixo. Ao aceitar, está a concordar com a política de privacidade atualizada.