SlideShare uma empresa Scribd logo
1 de 35
Title of presentation
Subtitle
Name of presenter
Date
Explaining video summarization based on
the focus of attention
E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
24th IEEE International Symposium
on Multimedia (ISM 2022)
2
• Explainable video summarization: why is it important?
• Related work
• Proposed method
• Experimental evaluations
• Conclusions
Overview
3
Current practice for producing a video summary
• An editor has to watch the entire video
content and decide about the parts
that should be included in the summary
• Different summaries of the same video
could be needed for distribution via
different communication channels
• Laborious task that can be significantly
accelerated by video summarization
technologies
Image source: https://www.premiumbeat.com/
blog/3-tips-for-difficult-video-edit/
4
Goal of video summarization technologies
This synopsis can be made of:
• A set of representative video key-
fragments (a.k.a. video skim)
• A set representative video key-frames
(a.k.a. video storyboard)
“Generate a short visual synopsis
that summarizes the video content
by selecting the most informative
and important parts of it”
Video content
Key-fragment
Key-frame
Analysis outcomes
Video storyboard
Video skim
Video title: “Susan Boyle's First Audition -
I Dreamed a Dream - Britain's Got Talent 2009”
Video source: https://www.youtube.com/watch? v=deRF9oEbRso
5
Why explainable video summarization is important?
• Video summarization technologies can
drastically reduce the needed resources
for video summary production in terms
of both time and human effort
• However, their outcome needs to be
curated by the editor to ensure that all
needed parts have been selected
• Content curation could be facilitated if
the editor get’s explanations about the
suggestions of the used technology
Image source: https://www.appier.com/en/blog/
what-is-supervised-learning
Such explanations would increase the editor’s trust in the used technology,
thus facilitating and accelerating content curation
6
Works on explainable networks for video analysis tasks
• (Aakur, 2018) extraction of explainable representations for video activity interpretation
• (Bargal, 2018) spatio-temporal cues contributing to network’s classification/captioning
output, to spot fragments linked to specific action/phrase from caption
• (Zhuo, 2019) spatio-temporal graph of semantic-level video states and state transition
analysis for video action reasoning
• (Stergiou, 2019) heatmaps visualizing focus of attention and explaining networks for
action classification and recognition
• (Manttari, 2020) perturbation-based method to spot the video fragment with the
greatest impact on the video classification results
• (Li, 2021) generic perturbation-based method for spatio-temporally-smooth
explanations of video classification networks
• (Gkalelis, 2022) in-degrees of graph attention networks’ adjacency matrices to explain
video event recognition, in terms of salient objects and frames
7
Typical video summarization pipeline
1. Video frames are represented using pre-trained CNNs (e.g., GoogleNet)
2. Video summarization networks estimate the frames’ importance
3. Given a video fragmentation and a time budget, the video summary is formed
by selecting fragments that maximize its importance (Knapsack problem)
Proposed method: Problem formulation
8
Explanation’s goal/output
• A video-fragment-level explanation mask indicating the most influential video
fragments for the network’s estimates about the frames’ importance
Assumptions
• Video is split in fixed-size fragments; Summary is made of the M top-scoring ones
Proposed method: Problem formulation
9
• Studied in the NLP domain [Jain, 2019; Serrano, 2019; Wiegreffe, 2019;
Kobayashi, 2020; Chrysostomou, 2021; Liu, 2022] and elsewhere
• Can be used on attention-based video summarization networks?
• Various possible explanation signals can be formed using the Attention matrix
Proposed method: Attention as explanation
10
Attention-based explanation signals
• Inherent Attention (IA): 𝑎𝑖,𝑖 𝑖=1
𝑇
• Grad of Attention (GoA): 𝛻𝑎𝑖,𝑖 𝑖=1
𝑇
• Grad Attention (GA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 𝑖=1
𝑇
• Input Norm Attention (NA): 𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1
𝑇
• Input Norm Grad Attention (NGA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1
𝑇
11
Replacement functions
• Slice out: completely removes the specified part
• Input Mask: replaces the specified part with a mask composed of black/white
frames’ feature representations
• Randomization: replaces 50% of the elements of each feature representation
within the specified part
• Attention Mask: sets the attention weights associated with the specified part
equal to zero
Modeling network’s input-output relationship
12
• Quantify the kth video fragment’s influence in the network’s output, based on
the Difference of Estimates
ΔΕ 𝚾, 𝚾𝜿
= 𝜏 𝒚, 𝒚𝒌
• 𝚾: original feature vectors
• 𝚾𝜿
: updated feature vectors after replacing the kth fragment
• 𝒚: network’s output for 𝚾
• 𝒚𝒌: network’s output for 𝚾𝜿
• 𝜏: Kendall’s τ correlation coefficient
Quantifying video fragment’s influence
13
• Discoverability+ (D+): evaluates if fragments with high explanation scores have a
significant influence to the network’s estimates; D+ = Mean(ΔΕ) after replacing
top-1%, 5%, 10%, 15%, 20% (batch) and 5 top-scoring fragments (1-by-1)
• Discoverability- (D-): evaluates if fragments with low explanation scores have
small influence to network’s estimates ; D- = Mean(ΔΕ) after replacing bottom-
1%, 5%, 10%, 15%, 20% (batch) and the 5 less-scoring fragments (1-by-1)
• Sanity Violation (SV): quantifies the ability of explanations to discriminate
important from unimportant video fragments; SV = % of cases where the sanity
test (D+ > D-) is violated
• Rank Correlation (RC): measures the (Spearman) correlation between fragment-
level explanation scores and obtained ΔE values after replacing each fragment
Experiments: Evaluation measures
14
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
• 25 videos capturing multiple events (e.g., cooking and sports) from
first-person and third-person view
• Video length: 1 to 6 min
TVSum (https://github.com/yalesong/tvsum)
• 50 videos of various genres (e.g., news, “how-to”, documentary, vlog,
egocentric) from 10 categories of the TRECVid MED task
• Video length: 1 to 11 min
Experiments: Datasets
15
• Frame sampling: 2 fps
• Feature extraction: GoogleNet (pool5 layer) trained on ImageNet
• Highlighted fragments in explanation mask: 5
• Size of video fragments: 20 frames (10 sec)
• Data splits: 5
• Video summarization network: CA-SUM (Apostolidis, 2022); trained models
on SumMe and TVSum, available at: https://zenodo.org/record/6562992
Experiments: Evaluation settings
16
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Explanations formed using
the attention weights are the
most competitive on both
datasets
On average they achieve
higher/lower D- / D+ scores
and pass sanity test in ~66%
and 80% of cases on SumMe
and TVSum, respectively
20
20
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
Explanations formed using
the norm-based weighted
attention weights are also
good, but less effective, esp.
in terms of SV
20
Experimental results: Quantitative analysis
Fragments’ replacement
in batch mode
The use of gradients to form
explanations results in clearly
worse performance
Sanity test is violated in 56%
and 82% of cases on SumMe
and TVSum, respectively
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using the
attention weights are the
best-performing ones
Pass the sanity test in 65%
and 80% of cases on SumMe
and TVSum, respectively
Assign scores that are more
representative of each
fragments’ influence
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using
the norm-based weighted
attention weights are also
performing well
21
Experimental results: Quantitative analysis
Fragments’ replacement
in 1-by-1 mode
Explanations formed using
gradients typically perform
worse
Violate the sanity test in 57%
and 80% of cases on SumMe
and TVSum, respectively
Assign explanation scores
that are neutrally/negatively
correlated with fragments’
influence to the network
22
Experimental results: Quantitative analysis
Replacement in batch mode Replacement in 1-by-1 mode
The use of inherent attention weights to form explanations
for the CA-SUM model, is the best option
23
Experimental results: Qualitative analysis
Video summary (blue bounding boxes)
• Mainly associated with the dog (4 / 5 selected fragments)
• Contains visually diverse fragments showing the dog (3 / 4 are clearly different)
• Contains a fragment showing the dog’s owner (to further increase diversity)
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
23
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the dog
• Pays less attention to speaking persons, dog products, and the pet store
• Models the video’s context based on the dog
24
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Video summary (blue bounding boxes)
• Associated with the motorcycle riders doing tricks (5 / 5 selected fragments)
• Contains visually diverse fragments (all fragments are clearly different)
24
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the tricks made by the riders
• Pays less attention to the logo of the TV-show and the interview
25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Video summary (blue bounding boxes)
• Containing parts showing the bird and the courtyard (e.g., paving, chair)
• Missing parts showing the dog and the bird playing together
25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Attention mechanism (yellow bounding boxes)
• Pays more attention on parts showing the courtyard (3 / 5 fragments)
• Pays less attention on parts showing the dog and the bird’s playing (1 fragment)
25
Experimental results: Qualitative analysis
Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue
boxes indicate the 5 (most important) fragments of the video summary
Forming explanations as proposed, can lead to useful clues about the focus
of attention, and assist the explanation of the video summarization results
26
Concluding remarks
• First attempt on explaining the outcomes of video summarization networks
• Focused on attention-based network architectures and considered several related
explanation signals studied in the NLP domain and elsewhere
• Introduced evaluation measures to assess explanations’ ability to spot the most
and least influential parts of the video, for the network’s predictions
• Modeled network’s input-output relationship using various replacement functions
• Conducted experiments using the CA-SUM network, and SumMe and TVSum
datasets for video summarization
• Using the attention weights to form explanations as proposed, allows to spot the
focus of attention mechanism and assist the explanation of summarization results
27
References
• S. N. Aakur et al., “An inherently explainable model for video activity interpretation,” in AAAI 2018
• E. Apostolidis et al., “Summarizing videos using concentrated attention and considering the
uniqueness and diversity of the video frames,” in 2022 ACM ICMR
• S. A. Bargal et al., “Excitation backprop for RNNs,” in CVPR 2018
• G. Chrysostomou et al., “Improving the faithfulness of attention-based explanations with task-
specific information for text classification,” in 2021 ACL Meeting
• N. Gkalelis et al., “ViGAT: Bottom-up event recognition and explanation in video using factorized
graph attention network,” IEEE Access, vol. 10, pp. 108 797–108 816, 2022
• S. Jain et al., “Attention is not Explanation,” in NAACL-HLT 2019
• G. Kobayashi et al., “Attention is not only a weight: Analyzing transformers with vector norms,” in
EMNLP 2020
27
References
• Z. Li et al., “Towards visually explaining video understanding networks with perturbation,” in
IEEE WACV 2021
• Y. Liu et al., “Rethinking attention-model explainability through faithfulness violation test,” in
ICML 2022, vol. 162
• J. Manttari et al., “Interpreting video features: A comparison of 3D conv. networks and conv.
LSTM networks,” in ACCV 2020
• S. Serrano et al., “Is attention interpretable?” in 2019 ACL Meeting
• A. Stergiou et al., “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE
ICIP 2019
• S. Wiegreffe et al., “Attention is not not explanation,” in EMNLP 2019
• T. Zhuo et al., “Explainable video action reasoning via prior knowledge and state transitions,” in
2019 ACM MM
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/XAI-SUM
This work was supported by the EUs Horizon 2020 research and innovation programme
under grant agreement 951911 AI4Media

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
異常検知とGAN: AnoGan
異常検知とGAN: AnoGan異常検知とGAN: AnoGan
異常検知とGAN: AnoGan
 
Object Detection and Ship Classification Using YOLOv5
Object Detection and Ship Classification Using YOLOv5Object Detection and Ship Classification Using YOLOv5
Object Detection and Ship Classification Using YOLOv5
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
한글 언어 자원과 R: KoNLP 개선과 활용
한글 언어 자원과 R: KoNLP 개선과 활용한글 언어 자원과 R: KoNLP 개선과 활용
한글 언어 자원과 R: KoNLP 개선과 활용
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset Introduction
 
Results on video summarization
Results on video summarizationResults on video summarization
Results on video summarization
 
Image to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GANImage to image translation with Pix2Pix GAN
Image to image translation with Pix2Pix GAN
 
論文紹介:Multimodal Learning with Transformers: A Survey
論文紹介:Multimodal Learning with Transformers: A Survey論文紹介:Multimodal Learning with Transformers: A Survey
論文紹介:Multimodal Learning with Transformers: A Survey
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2
 
Image-to-Image Translation pix2pix
Image-to-Image Translation pix2pixImage-to-Image Translation pix2pix
Image-to-Image Translation pix2pix
 
[DLHacks]StyleGANとBigGANのStyle mixing, morphing
[DLHacks]StyleGANとBigGANのStyle mixing, morphing[DLHacks]StyleGANとBigGANのStyle mixing, morphing
[DLHacks]StyleGANとBigGANのStyle mixing, morphing
 
論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning
論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning
論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning
 
[DL輪読会]A Style-Based Generator Architecture for Generative Adversarial Networks
[DL輪読会]A Style-Based Generator Architecture for Generative Adversarial Networks[DL輪読会]A Style-Based Generator Architecture for Generative Adversarial Networks
[DL輪読会]A Style-Based Generator Architecture for Generative Adversarial Networks
 

Semelhante a Explaining video summarization based on the focus of attention

Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
ijtsrd
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
IAEME Publication
 
absorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRDabsorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRD
IJERA Editor
 

Semelhante a Explaining video summarization based on the focus of attention (20)

VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHMVIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
 
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHMVIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
 
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHMVIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
VIDEO SEGMENTATION & SUMMARIZATION USING MODIFIED GENETIC ALGORITHM
 
Enhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language EmbeddingEnhancing Video Summarization via Vision-Language Embedding
Enhancing Video Summarization via Vision-Language Embedding
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
A Ensemble Learning-based No Reference QoE Model for User Generated Contents
A Ensemble Learning-based No Reference QoE Model for User Generated ContentsA Ensemble Learning-based No Reference QoE Model for User Generated Contents
A Ensemble Learning-based No Reference QoE Model for User Generated Contents
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
Key Frame Extraction in Video Stream using Two Stage Method with Colour and S...
 
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
absorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRDabsorption, Cu2+ : glass, emission, excitation, XRD
absorption, Cu2+ : glass, emission, excitation, XRD
 
Improved Error Detection and Data Recovery Architecture for Motion Estimation...
Improved Error Detection and Data Recovery Architecture for Motion Estimation...Improved Error Detection and Data Recovery Architecture for Motion Estimation...
Improved Error Detection and Data Recovery Architecture for Motion Estimation...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
1829 1833
1829 18331829 1833
1829 1833
 
1829 1833
1829 18331829 1833
1829 1833
 
Deep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendationsDeep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendations
 
NMSL_2017summer
NMSL_2017summerNMSL_2017summer
NMSL_2017summer
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
 

Mais de VasileiosMezaris

Mais de VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1Icme2020 tutorial video_summarization_part1
Icme2020 tutorial video_summarization_part1
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 

Último

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Último (20)

Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 

Explaining video summarization based on the focus of attention

  • 1. Title of presentation Subtitle Name of presenter Date Explaining video summarization based on the focus of attention E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 24th IEEE International Symposium on Multimedia (ISM 2022)
  • 2. 2 • Explainable video summarization: why is it important? • Related work • Proposed method • Experimental evaluations • Conclusions Overview
  • 3. 3 Current practice for producing a video summary • An editor has to watch the entire video content and decide about the parts that should be included in the summary • Different summaries of the same video could be needed for distribution via different communication channels • Laborious task that can be significantly accelerated by video summarization technologies Image source: https://www.premiumbeat.com/ blog/3-tips-for-difficult-video-edit/
  • 4. 4 Goal of video summarization technologies This synopsis can be made of: • A set of representative video key- fragments (a.k.a. video skim) • A set representative video key-frames (a.k.a. video storyboard) “Generate a short visual synopsis that summarizes the video content by selecting the most informative and important parts of it” Video content Key-fragment Key-frame Analysis outcomes Video storyboard Video skim Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: https://www.youtube.com/watch? v=deRF9oEbRso
  • 5. 5 Why explainable video summarization is important? • Video summarization technologies can drastically reduce the needed resources for video summary production in terms of both time and human effort • However, their outcome needs to be curated by the editor to ensure that all needed parts have been selected • Content curation could be facilitated if the editor get’s explanations about the suggestions of the used technology Image source: https://www.appier.com/en/blog/ what-is-supervised-learning Such explanations would increase the editor’s trust in the used technology, thus facilitating and accelerating content curation
  • 6. 6 Works on explainable networks for video analysis tasks • (Aakur, 2018) extraction of explainable representations for video activity interpretation • (Bargal, 2018) spatio-temporal cues contributing to network’s classification/captioning output, to spot fragments linked to specific action/phrase from caption • (Zhuo, 2019) spatio-temporal graph of semantic-level video states and state transition analysis for video action reasoning • (Stergiou, 2019) heatmaps visualizing focus of attention and explaining networks for action classification and recognition • (Manttari, 2020) perturbation-based method to spot the video fragment with the greatest impact on the video classification results • (Li, 2021) generic perturbation-based method for spatio-temporally-smooth explanations of video classification networks • (Gkalelis, 2022) in-degrees of graph attention networks’ adjacency matrices to explain video event recognition, in terms of salient objects and frames
  • 7. 7 Typical video summarization pipeline 1. Video frames are represented using pre-trained CNNs (e.g., GoogleNet) 2. Video summarization networks estimate the frames’ importance 3. Given a video fragmentation and a time budget, the video summary is formed by selecting fragments that maximize its importance (Knapsack problem) Proposed method: Problem formulation
  • 8. 8 Explanation’s goal/output • A video-fragment-level explanation mask indicating the most influential video fragments for the network’s estimates about the frames’ importance Assumptions • Video is split in fixed-size fragments; Summary is made of the M top-scoring ones Proposed method: Problem formulation
  • 9. 9 • Studied in the NLP domain [Jain, 2019; Serrano, 2019; Wiegreffe, 2019; Kobayashi, 2020; Chrysostomou, 2021; Liu, 2022] and elsewhere • Can be used on attention-based video summarization networks? • Various possible explanation signals can be formed using the Attention matrix Proposed method: Attention as explanation
  • 10. 10 Attention-based explanation signals • Inherent Attention (IA): 𝑎𝑖,𝑖 𝑖=1 𝑇 • Grad of Attention (GoA): 𝛻𝑎𝑖,𝑖 𝑖=1 𝑇 • Grad Attention (GA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 𝑖=1 𝑇 • Input Norm Attention (NA): 𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1 𝑇 • Input Norm Grad Attention (NGA): 𝑎𝑖,𝑖 ∗ 𝛻𝑎𝑖,𝑖 ∗ 𝒗𝒊 𝑖=1 𝑇
  • 11. 11 Replacement functions • Slice out: completely removes the specified part • Input Mask: replaces the specified part with a mask composed of black/white frames’ feature representations • Randomization: replaces 50% of the elements of each feature representation within the specified part • Attention Mask: sets the attention weights associated with the specified part equal to zero Modeling network’s input-output relationship
  • 12. 12 • Quantify the kth video fragment’s influence in the network’s output, based on the Difference of Estimates ΔΕ 𝚾, 𝚾𝜿 = 𝜏 𝒚, 𝒚𝒌 • 𝚾: original feature vectors • 𝚾𝜿 : updated feature vectors after replacing the kth fragment • 𝒚: network’s output for 𝚾 • 𝒚𝒌: network’s output for 𝚾𝜿 • 𝜏: Kendall’s τ correlation coefficient Quantifying video fragment’s influence
  • 13. 13 • Discoverability+ (D+): evaluates if fragments with high explanation scores have a significant influence to the network’s estimates; D+ = Mean(ΔΕ) after replacing top-1%, 5%, 10%, 15%, 20% (batch) and 5 top-scoring fragments (1-by-1) • Discoverability- (D-): evaluates if fragments with low explanation scores have small influence to network’s estimates ; D- = Mean(ΔΕ) after replacing bottom- 1%, 5%, 10%, 15%, 20% (batch) and the 5 less-scoring fragments (1-by-1) • Sanity Violation (SV): quantifies the ability of explanations to discriminate important from unimportant video fragments; SV = % of cases where the sanity test (D+ > D-) is violated • Rank Correlation (RC): measures the (Spearman) correlation between fragment- level explanation scores and obtained ΔE values after replacing each fragment Experiments: Evaluation measures
  • 14. 14 SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark) • 25 videos capturing multiple events (e.g., cooking and sports) from first-person and third-person view • Video length: 1 to 6 min TVSum (https://github.com/yalesong/tvsum) • 50 videos of various genres (e.g., news, “how-to”, documentary, vlog, egocentric) from 10 categories of the TRECVid MED task • Video length: 1 to 11 min Experiments: Datasets
  • 15. 15 • Frame sampling: 2 fps • Feature extraction: GoogleNet (pool5 layer) trained on ImageNet • Highlighted fragments in explanation mask: 5 • Size of video fragments: 20 frames (10 sec) • Data splits: 5 • Video summarization network: CA-SUM (Apostolidis, 2022); trained models on SumMe and TVSum, available at: https://zenodo.org/record/6562992 Experiments: Evaluation settings
  • 16. 16 Experimental results: Quantitative analysis Fragments’ replacement in batch mode
  • 17. Experimental results: Quantitative analysis Fragments’ replacement in batch mode Explanations formed using the attention weights are the most competitive on both datasets On average they achieve higher/lower D- / D+ scores and pass sanity test in ~66% and 80% of cases on SumMe and TVSum, respectively 20
  • 18. 20 Experimental results: Quantitative analysis Fragments’ replacement in batch mode Explanations formed using the norm-based weighted attention weights are also good, but less effective, esp. in terms of SV
  • 19. 20 Experimental results: Quantitative analysis Fragments’ replacement in batch mode The use of gradients to form explanations results in clearly worse performance Sanity test is violated in 56% and 82% of cases on SumMe and TVSum, respectively
  • 20. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode
  • 21. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode Explanations formed using the attention weights are the best-performing ones Pass the sanity test in 65% and 80% of cases on SumMe and TVSum, respectively Assign scores that are more representative of each fragments’ influence
  • 22. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode Explanations formed using the norm-based weighted attention weights are also performing well
  • 23. 21 Experimental results: Quantitative analysis Fragments’ replacement in 1-by-1 mode Explanations formed using gradients typically perform worse Violate the sanity test in 57% and 80% of cases on SumMe and TVSum, respectively Assign explanation scores that are neutrally/negatively correlated with fragments’ influence to the network
  • 24. 22 Experimental results: Quantitative analysis Replacement in batch mode Replacement in 1-by-1 mode The use of inherent attention weights to form explanations for the CA-SUM model, is the best option
  • 25. 23 Experimental results: Qualitative analysis Video summary (blue bounding boxes) • Mainly associated with the dog (4 / 5 selected fragments) • Contains visually diverse fragments showing the dog (3 / 4 are clearly different) • Contains a fragment showing the dog’s owner (to further increase diversity) Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary
  • 26. 23 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary Attention mechanism (yellow bounding boxes) • Pays more attention on parts showing the dog • Pays less attention to speaking persons, dog products, and the pet store • Models the video’s context based on the dog
  • 27. 24 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary Video summary (blue bounding boxes) • Associated with the motorcycle riders doing tricks (5 / 5 selected fragments) • Contains visually diverse fragments (all fragments are clearly different)
  • 28. 24 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a TVSum video; blue boxes indicate the 5 (most important) fragments of the video summary Attention mechanism (yellow bounding boxes) • Pays more attention on parts showing the tricks made by the riders • Pays less attention to the logo of the TV-show and the interview
  • 29. 25 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue boxes indicate the 5 (most important) fragments of the video summary Video summary (blue bounding boxes) • Containing parts showing the bird and the courtyard (e.g., paving, chair) • Missing parts showing the dog and the bird playing together
  • 30. 25 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue boxes indicate the 5 (most important) fragments of the video summary Attention mechanism (yellow bounding boxes) • Pays more attention on parts showing the courtyard (3 / 5 fragments) • Pays less attention on parts showing the dog and the bird’s playing (1 fragment)
  • 31. 25 Experimental results: Qualitative analysis Explanation mask showing the 5 most influential fragments (yellow boxes) for a SumMe video; blue boxes indicate the 5 (most important) fragments of the video summary Forming explanations as proposed, can lead to useful clues about the focus of attention, and assist the explanation of the video summarization results
  • 32. 26 Concluding remarks • First attempt on explaining the outcomes of video summarization networks • Focused on attention-based network architectures and considered several related explanation signals studied in the NLP domain and elsewhere • Introduced evaluation measures to assess explanations’ ability to spot the most and least influential parts of the video, for the network’s predictions • Modeled network’s input-output relationship using various replacement functions • Conducted experiments using the CA-SUM network, and SumMe and TVSum datasets for video summarization • Using the attention weights to form explanations as proposed, allows to spot the focus of attention mechanism and assist the explanation of summarization results
  • 33. 27 References • S. N. Aakur et al., “An inherently explainable model for video activity interpretation,” in AAAI 2018 • E. Apostolidis et al., “Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames,” in 2022 ACM ICMR • S. A. Bargal et al., “Excitation backprop for RNNs,” in CVPR 2018 • G. Chrysostomou et al., “Improving the faithfulness of attention-based explanations with task- specific information for text classification,” in 2021 ACL Meeting • N. Gkalelis et al., “ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network,” IEEE Access, vol. 10, pp. 108 797–108 816, 2022 • S. Jain et al., “Attention is not Explanation,” in NAACL-HLT 2019 • G. Kobayashi et al., “Attention is not only a weight: Analyzing transformers with vector norms,” in EMNLP 2020
  • 34. 27 References • Z. Li et al., “Towards visually explaining video understanding networks with perturbation,” in IEEE WACV 2021 • Y. Liu et al., “Rethinking attention-model explainability through faithfulness violation test,” in ICML 2022, vol. 162 • J. Manttari et al., “Interpreting video features: A comparison of 3D conv. networks and conv. LSTM networks,” in ACCV 2020 • S. Serrano et al., “Is attention interpretable?” in 2019 ACL Meeting • A. Stergiou et al., “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE ICIP 2019 • S. Wiegreffe et al., “Attention is not not explanation,” in EMNLP 2019 • T. Zhuo et al., “Explainable video action reasoning via prior knowledge and state transitions,” in 2019 ACM MM
  • 35. Thank you for your attention! Questions? Vasileios Mezaris, bmezaris@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/XAI-SUM This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement 951911 AI4Media