Explaining video summarization based on the focus of attention

Title of presentation
Subtitle
Name of presenter
Date
Explaining video summarization based on
the focus of attention
E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
24th IEEE International Symposium
on Multimedia (ISM 2022)

2
• Explainable video summarization: why is it important?
• Related work
• Proposed method
• Experimental evaluations
• Conclusions
Overview

3
Current practice for producing a video summary
• An editor has to watch the entire video
content and decide about the parts
that should be included in the summary
• Different summaries of the same video
could be needed for distribution via
different communication channels
• Laborious task that can be significantly
accelerated by video summarization
technologies
Image source: https://www.premiumbeat.com/
blog/3-tips-for-difficult-video-edit/

4
Goal of video summarization technologies
This synopsis can be made of:
• A set of representative video key-
fragments (a.k.a. video skim)
• A set representative video key-frames
(a.k.a. video storyboard)
“Generate a short visual synopsis
that summarizes the video content
by selecting the most informative
and important parts of it”
Video content
Key-fragment
Key-frame
Analysis outcomes
Video storyboard
Video skim
Video title: “Susan Boyle's First Audition -
I Dreamed a Dream - Britain's Got Talent 2009”
Video source: https://www.youtube.com/watch? v=deRF9oEbRso

5
Why explainable video summarization is important?
• Video summarization technologies can
drastically reduce the needed resources
for video summary production in terms
of both time and human effort
• However, their outcome needs to be
curated by the editor to ensure that all
needed parts have been selected
• Content curation could be facilitated if
the editor get’s explanations about the
suggestions of the used technology
Image source: https://www.appier.com/en/blog/
what-is-supervised-learning
Such explanations would increase the editor’s trust in the used technology,
thus facilitating and accelerating content curation

6
Works on explainable networks for video analysis tasks
• (Aakur, 2018) extraction of explainable representations for video activity interpretation
• (Bargal, 2018) spatio-temporal cues contributing to network’s classification/captioning
output, to spot fragments linked to specific action/phrase from caption
• (Zhuo, 2019) spatio-temporal graph of semantic-level video states and state transition
analysis for video action reasoning
• (Stergiou, 2019) heatmaps visualizing focus of attention and explaining networks for
action classification and recognition
• (Manttari, 2020) perturbation-based method to spot the video fragment with the
greatest impact on the video classification results
• (Li, 2021) generic perturbation-based method for spatio-temporally-smooth
explanations of video classification networks
• (Gkalelis, 2022) in-degrees of graph attention networks’ adjacency matrices to explain
video event recognition, in terms of salient objects and frames

7
Typical video summarization pipeline
1. Video frames are represented using pre-trained CNNs (e.g., GoogleNet)
2. Video summarization networks estimate the frames’ importance
3. Given a video fragmentation and a time budget, the video summary is formed
by selecting fragments that maximize its importance (Knapsack problem)
Proposed method: Problem formulation

8
Explanation’s goal/output
• A video-fragment-level explanation mask indicating the most influential video
fragments for the network’s estimates about the frames’ importance
Assumptions
• Video is split in fixed-size fragments; Summary is made of the M top-scoring ones
Proposed method: Problem formulation

9
• Studied in the NLP domain [Jain, 2019; Serrano, 2019; Wiegreffe, 2019;
Kobayashi, 2020; Chrysostomou, 2021; Liu, 2022] and elsewhere
• Can be used on attention-based video summarization networks?
• Various possible explanation signals can be formed using the Attention matrix
Proposed method: Attention as explanation

10
Attention-based explanation signals
• Inherent Attention (IA): 𝑎𝑖,𝑖 𝑖=1
𝑇
• Grad of Attention (GoA): 𝛻𝑎𝑖,𝑖 𝑖=1
𝑇
• Grad Attention (GA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 𝑖=1
𝑇
• Input Norm Attention (NA): 𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1
𝑇
• Input Norm Grad Attention (NGA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1
𝑇

11
Replacement functions
• Slice out: completely removes the specified part
• Input Mask: replaces the specified part with a mask composed of black/white
frames’ feature representations
• Randomization: replaces 50% of the elements of each feature representation
within the specified part
• Attention Mask: sets the attention weights associated with the specified part
equal to zero
Modeling network’s input-output relationship

12
• Quantify the kth video fragment’s influence in the network’s output, based on
the Difference of Estimates
ΔΕ 𝚾, 𝚾𝜿
= 𝜏 𝒚, 𝒚𝒌
• 𝚾: original feature vectors
• 𝚾𝜿
: updated feature vectors after replacing the kth fragment
• 𝒚: network’s output for 𝚾
• 𝒚𝒌: network’s output for 𝚾𝜿
• 𝜏: Kendall’s τ correlation coefficient
Quantifying video fragment’s influence

13
• Discoverability+ (D+): evaluates if fragments with high explanation scores have a
significant influence to the network’s estimates; D+ = Mean(ΔΕ) after replacing
top-1%, 5%, 10%, 15%, 20% (batch) and 5 top-scoring fragments (1-by-1)
• Discoverability- (D-): evaluates if fragments with low explanation scores have
small influence to network’s estimates ; D- = Mean(ΔΕ) after replacing bottom-
1%, 5%, 10%, 15%, 20% (batch) and the 5 less-scoring fragments (1-by-1)
• Sanity Violation (SV): quantifies the ability of explanations to discriminate
important from unimportant video fragments; SV = % of cases where the sanity
test (D+ > D-) is violated
• Rank Correlation (RC): measures the (Spearman) correlation between fragment-
level explanation scores and obtained ΔE values after replacing each fragment
Experiments: Evaluation measures

14
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
• 25 videos capturing multiple events (e.g., cooking and sports) from
first-person and third-person view
• Video length: 1 to 6 min
TVSum (https://github.com/yalesong/tvsum)
• 50 videos of various genres (e.g., news, “how-to”, documentary, vlog,
egocentric) from 10 categories of the TRECVid MED task
• Video length: 1 to 11 min
Experiments: Datasets

15
• Frame sampling: 2 fps
• Feature extraction: GoogleNet (pool5 layer) trained on ImageNet
• Highlighted fragments in explanation mask: 5
• Size of video fragments: 20 frames (10 sec)
• Data splits: 5
• Video summarization network: CA-SUM (Apostolidis, 2022); trained models
on SumMe and TVSum, available at: https://zenodo.org/record/6562992
Experiments: Evaluation settings

16
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode

in batch mode
Explanations formed using
the attention weights are the
most competitive on both
datasets
On average they achieve
higher/lower D- / D+ scores
and pass sanity test in ~66%
and 80% of cases on SumMe
and TVSum, respectively
20

20
in batch mode
the norm-based weighted
attention weights are also
good, but less effective, esp.
in terms of SV

20
in batch mode
The use of gradients to form
explanations results in clearly
worse performance
Sanity test is violated in 56%

21
in 1-by-1 mode

21
in 1-by-1 mode
Explanations formed using the
attention weights are the
best-performing ones
Pass the sanity test in 65%
Assign scores that are more
representative of each
fragments’ influence

21
in 1-by-1 mode
the norm-based weighted
attention weights are also
performing well

21
in 1-by-1 mode
gradients typically perform
worse
Violate the sanity test in 57%
Assign explanation scores
that are neutrally/negatively
correlated with fragments’
influence to the network

22
Replacement in batch mode Replacement in 1-by-1 mode
The use of inherent attention weights to form explanations
for the CA-SUM model, is the best option

23
Experimental results: Qualitative analysis
Video summary (blue bounding boxes)
• Mainly associated with the dog (4 / 5 selected fragments)
• Contains visually diverse fragments showing the dog (3 / 4 are clearly different)
• Contains a fragment showing the dog’s owner (to further increase diversity)
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary

23
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the dog
• Pays less attention to speaking persons, dog products, and the pet store
• Models the video’s context based on the dog

24
• Associated with the motorcycle riders doing tricks (5 / 5 selected fragments)
• Contains visually diverse fragments (all fragments are clearly different)

24
• Pays more attention on parts showing the tricks made by the riders
• Pays less attention to the logo of the TV-show and the interview

25
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
• Containing parts showing the bird and the courtyard (e.g., paving, chair)
• Missing parts showing the dog and the bird playing together

25
• Pays more attention on parts showing the courtyard (3 / 5 fragments)
• Pays less attention on parts showing the dog and the bird’s playing (1 fragment)

25
Forming explanations as proposed, can lead to useful clues about the focus
of attention, and assist the explanation of the video summarization results

26
Concluding remarks
• First attempt on explaining the outcomes of video summarization networks
• Focused on attention-based network architectures and considered several related
explanation signals studied in the NLP domain and elsewhere
• Introduced evaluation measures to assess explanations’ ability to spot the most
and least influential parts of the video, for the network’s predictions
• Modeled network’s input-output relationship using various replacement functions
• Conducted experiments using the CA-SUM network, and SumMe and TVSum
datasets for video summarization
• Using the attention weights to form explanations as proposed, allows to spot the
focus of attention mechanism and assist the explanation of summarization results

27
References
• S. N. Aakur et al., “An inherently explainable model for video activity interpretation,” in AAAI 2018
• E. Apostolidis et al., “Summarizing videos using concentrated attention and considering the
uniqueness and diversity of the video frames,” in 2022 ACM ICMR
• S. A. Bargal et al., “Excitation backprop for RNNs,” in CVPR 2018
• G. Chrysostomou et al., “Improving the faithfulness of attention-based explanations with task-
specific information for text classification,” in 2021 ACL Meeting
• N. Gkalelis et al., “ViGAT: Bottom-up event recognition and explanation in video using factorized
graph attention network,” IEEE Access, vol. 10, pp. 108 797–108 816, 2022
• S. Jain et al., “Attention is not Explanation,” in NAACL-HLT 2019
• G. Kobayashi et al., “Attention is not only a weight: Analyzing transformers with vector norms,” in
EMNLP 2020

27
References
• Z. Li et al., “Towards visually explaining video understanding networks with perturbation,” in
IEEE WACV 2021
• Y. Liu et al., “Rethinking attention-model explainability through faithfulness violation test,” in
ICML 2022, vol. 162
• J. Manttari et al., “Interpreting video features: A comparison of 3D conv. networks and conv.
LSTM networks,” in ACCV 2020
• S. Serrano et al., “Is attention interpretable?” in 2019 ACL Meeting
• A. Stergiou et al., “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE
ICIP 2019
• S. Wiegreffe et al., “Attention is not not explanation,” in EMNLP 2019
• T. Zhuo et al., “Explainable video action reasoning via prior knowledge and state transitions,” in
2019 ACM MM

Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/XAI-SUM
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement 951911 AI4Media

Explaining video summarization based on the focus of attention

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Explaining video summarization based on the focus of attention

Semelhante a Explaining video summarization based on the focus of attention (20)

Mais de VasileiosMezaris

Mais de VasileiosMezaris (20)

Último

Último (20)

Explaining video summarization based on the focus of attention