SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Combining Global and Local Attention with Positional Encoding
for Video Summarization
E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2
1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece
2 School of EECS, Queen Mary University of London, London, UK
23rd IEEE International Symposium
on Multimedia (ISM 2021)
Outline
• Problem statement
• Related work
• Developed approach
• Experiments
• Conclusions
1
Problem statement
2
Video is everywhere!
• Captured by smart devices and
instantly shared online
• Constantly and rapidly increasing
volumes of video content on the Web
Hours of video content uploaded on
YouTube every minute
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video-
sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
Problem statement
3
But how to find what we are looking for in endless collections of video content?
Quickly inspect a video’s
content by checking its
synopsis!
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Goal of video summarization technologies
4
“Generate a short visual synopsis
that summarizes the video content
by selecting the most informative
and important parts of it”
Video title: “Susan Boyle's First Audition - I Dreamed a
Dream - Britain's Got Talent 2009”
Video source: OVP dataset (video also available at:
https://www.youtube.com/watch?v=deRF9oEbRso)
Static video summary
(made of selected key-frames)
Dynamic video summary
(made of selected key-fragments)
Video content
Key-fragment
Key-frame
Analysis outcomes
This synopsis can be:
• Static, composed of a set representative
video (key-)frames (a.k.a. video storyboard)
• Dynamic, formed of a set of representative
video (key-)fragments (a.k.a. video skim)
Related work (supervised approaches)
• Methods modeling the variable-range temporal dependence among frames, using:
• Structures of Recurrent Neural Networks (RNNs)
• Combinations of LSTMs with external storage or memory layers
• Attention mechanisms combined with classic or seq2seq RNN-based architectures
• Methods modeling the spatiotemporal structure of the video, using:
• 3D-Convolutional Neural Networks
• Combinations of CNNs with convolutional LSTMs, GRUs and optical flow maps
• Methods learning summarization using Generative Adversarial Networks
• Summarizer aims to fool Discriminator when distinguishing machine- from human-generated summary
• Methods modeling frames’ dependence by combining self-attention mechanisms, with:
• Trainable Regressor Networks that estimate frames’ importance
• Fragmentations of the video content in hierarchical key-frame selection strategies
• Multiple representations of the visual content (CNN-based & Inflated 3D ConvNet)
5
Developed approach
Starting point
• VASNet model (Fajtl et al., 2018)
• Seq2seq transformation using a soft
self-attention mechanism
Main weaknesses
• Lack of knowledge about the temporal
position of video frames
• Growing difficulty to estimate frames’
importance as video duration increases
6
J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, P. Remagnino, “Summarizing Videos with Attention,” in
Asian Conference on Computer Vision 2018 Workshops. Cham: Springer Int. Publishing, 2018, pp. 39-54.
Image source: Fajtl et al., “Summarizing Videos with Attention”,
ACCV 2018 Workshops
Developed approach
7
New network architecture (PGL-SUM)
• Uses a multi-head attention mechanism
to model frames’ dependence
according to the entire frame sequence
• Uses multiple multi-head attention
mechanisms to model short-term
dependencies over smaller video parts
• Enhances these mechanisms by adding
a component that encodes the
temporal position of video frames
Developed approach
7
Global attention
• Uses a multi-head attention mechanism
to model frames’ dependence
according to the entire frame sequence
Developed approach
Global attention
8
Developed approach
Global attention
9
Developed approach
10
Local attention
• Uses multiple multi-head attention
mechanisms to model short-term
dependencies over smaller video parts
Developed approach
11
Feature fusion & importance estimation
• New representation that carries
information about each frame’s global
and local dependencies
• Residual skip connection aims to
facilitate backpropagation
• Regressor produces a set of frame-level
scores that indicate frames’ importance
Developed approach
12
Training time
• Compute MSE between machine and
human frame-level importance scores
Inference time
• Compute fragment-level importance
based on a video segmentation
• Create the summary by selecting the
key-fragments based on a time budget
and by solving the Knapsack problem
Experiments
13
Datasets
• SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
• 25 videos capturing multiple events (e.g. cooking and sports)
• Video length: 1 to 6 min
• Annotation: fragment-based video summaries (15-18 per video)
• TVSum (https://github.com/yalesong/tvsum)
• 50 videos from 10 categories of TRECVid MED task
• Video length: 1 to 11 min
• Annotation: frame-level importance scores (20 per video)
Experiments
14
Evaluation approach
• The generated summary should not exceed 15% of the video length
• Agreement between automatically-generated (A) and user-defined (U) summary is
expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal
overlap (∩) (|| || means duration)
• 80% of video samples are used for training and the remaining 20% for testing
• Model selection is based on a criterion that focuses on rapid changes in the curve of the
training loss values or (if such changes do not occur) the minimization of the loss value
• Summarization performance is formed by averaging the experimental results on five
randomly-created splits of data
Experiments
15
Implementation details
• Videos were down-sampled to 2 fps
• Feature extraction: pool5 layer of GoogleNet trained on ImageNet (D = 1024)
• # local attention mechanisms = # video segments M: 4
• # global and local attention heads: 8 and 4 respectively
• Learning rate and L2 regularization factor: 5 x 10-5 and 10-5 respectively
• Dropout rate: 0.5
• Network initialization approach: Xavier uniform (gain = √2; biases = 0.1)
• Training: full batch mode; Adam optimizer; 200 epochs
• Hardware: NVIDIA TITAN XP
Experiments
16
Sensitivity analysis
• # video segments (# local attention mech.)
& global-local data fusion approach
SumMe TVSum
# Segm.
Fusion
2 4 8 2 4 8
Addition 49.8 55.6 51.9 61.1 61.7 59.5
Average pooling 51.5 54.1 52.5 61.1 59.7 58.7
Max pooling 51.1 53.9 52.2 61.3 59.6 58.5
Multiplication 46.3 52.1 52.5 47.3 47.6 46.9
SumMe TVSum
Local
Global
2 4 8 2 4 8
2 52.4 52.4 52.8 61.4 61.6 61.4
4 54.9 58.5 49.2 60.9 61.3 60.4
8 55.8 58.8 57.1 60.3 61.1 60.9
16 56.7 57.7 54.9 61.3 60.9 60.1
• # local and global attention heads
Reported values represent F-Score (%)
Experiments
17
Performance comparisons with two attention-based methods
• Evaluation was made under the exact same experimental conditions
SumMe TVSum Average
Rank
Data
splits
F1 Rank F1 Rank
VASNet (Fajtl et al., 2018) 50.0 3 62.5 2 2.5 5 Rand
MSVA (Ghauri et al., 2021) 54.0 2 62.4 3 2.5 5 Rand
PGL-SUM (Ours) 57.1 1 62.7 1 1 5 Rand
J. Fajtl et al., “Summarizing Videos with Attention,” in Asian Conf. on Comp. Vision 2018 Workshops.
Cham: Springer Int. Publishing, 2018, pp. 39–54.
J. Ghauri et al., “Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention,”
in 2021 IEEE Int. Conf. on Multimedia and Expo. CA, USA: IEEE, 2021, pp. 1–6.
F1 denotes F-Score (%)
Experiments
Performance comparisons with SoA supervised methods
SumMe TVSum Avg
Rank
Data
splits
F1 Rank F1 Rank
Random 40.2 19 54.4 16 17.5 -
vsLSTM 37.6 22 54.2 17 19.5 1 Rand
dppLSTM 38.6 21 54.7 15 18 1 Rand
ActionRanking 40.1 20 56.3 14 17 1 Rand
*vsLSTM+Att 43.2 16 - - 16 1 Rand
*dppLSTM+Att 43.8 15 - - 15 1 Rand
H-RNN 42.1 17 57.9 12 14.5 -
*A-AVS 43.9 14 59.4 8 11 5 Rand
*SF-CVS 46.0 9 58.0 11 10 -
SUM-FCN 47.5 7 56.8 13 10 M Rand
HSA-RNN 44.1 13 59.8 7 10 -
SumMe TVSum Avg
Rank
Data
splits
F1 Rank F1 Rank
CRSum 47.3 8 58.0 11 9.5 5 FCV
MAVS 40.3 18 66.8 1 9.5 5 FCV
TTH-RNN 44.3 12 60.2 6 9 -
*M-AVS 44.4 11 61.0 4 7.5 5 Rand
SUM-DeepLab 48.8 5 58.4 10 7.5 M Rand
*DASP 45.5 10 63.6 3 6.5 5 Rand
*SUM-GDA 52.8 3 58.9 9 6 5 FCV
SMLD 47.6 6 61.0 4 5 5 FCV
*H-MAN 51.8 4 60.4 5 4.5 5 FCV
SMN 58.3 1 64.5 2 1.5 1 Rand
PGL-SUM 55.6 2 61.0 4 3 5 Rand
18
Experiments
19
Ablation study
Core components
SumMe TVSum
Global
attention
Local
attention
Positional
encoding
Variant #1 X √ √ 46.9 52.4
Variant #2 √ X √ 46.7 59.9
Variant #3 √ √ X 53.1 61.0
PGL-SUM
(Proposed)
√ √ √ 55.6 61.0
Reported values represent F-Score (%)
Experiments
19
Ablation study
Core components
SumMe TVSum
Global
attention
Local
attention
Positional
encoding
Variant #1 X √ √ 46.9 52.4
Variant #2 √ X √ 46.7 59.9
Variant #3 √ √ X 53.1 61.0
PGL-SUM
(Proposed)
√ √ √ 55.6 61.0
Combing global and local attention
allows to learn a better modeling
of frames’ dependencies
Reported values represent F-Score (%)
Experiments
19
Ablation study
Core components
SumMe TVSum
Global
attention
Local
attention
Positional
encoding
Variant #1 X √ √ 46.9 52.4
Variant #2 √ X √ 46.7 59.9
Variant #3 √ √ X 53.1 61.0
PGL-SUM
(Proposed)
√ √ √ 55.6 61.0
Integrating knowledge about the
frames’ position positively affects
the summarization performance
Reported values represent F-Score (%)
Conclusions
20
• PGL-SUM model for supervised video summarization aims to improve
• modeling of long-range frames’ dependencies
• parallelization ability of the training process
• granularity level at which the temporal dependencies between frames are modeled
• PGL-SUM contains multiple attention mechanisms that encode frames’ position
• A global multi-head attention mechanism models frames’ dependence based on the entire video
• Multiple local multi-head attention mechanisms discover modelings of frames’ dependencies by
focusing on smaller parts of the video
• Experiments on two benchmark datasets (SumMe and TVSum)
• Showed PGL-SUM’s excellence against other methods that rely on self-attention mechanisms
• Demonstrated PGL-SUM’s competitiveness against other SoA supervised summarization approaches
• Documented the positive contribution of combining global and local multi-head attention with
absolute positional encoding
Thank you for your attention!
Questions?
Evlampios Apostolidis, apostolid@iti.gr
Code and documentation publicly available at:
https://github.com/e-apostolidis/PGL-SUM
This work was supported by the EUs Horizon 2020 research and innovation programme under
grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1

Mais conteúdo relacionado

Mais procurados

Neural Radiance Fields & Neural Rendering.pdf
Neural Radiance Fields & Neural Rendering.pdfNeural Radiance Fields & Neural Rendering.pdf
Neural Radiance Fields & Neural Rendering.pdfNavneetPaul2
 
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES cscpconf
 
multimedia data and file format
multimedia data and file formatmultimedia data and file format
multimedia data and file formatALOK SAHNI
 
Rendering Algorithms.pptx
Rendering Algorithms.pptxRendering Algorithms.pptx
Rendering Algorithms.pptxSherinRappai
 
IMAGE SEGMENTATION TECHNIQUES
IMAGE SEGMENTATION TECHNIQUESIMAGE SEGMENTATION TECHNIQUES
IMAGE SEGMENTATION TECHNIQUESVicky Kumar
 
Threat Detection in Surveillance Videos
Threat Detection in Surveillance VideosThreat Detection in Surveillance Videos
Threat Detection in Surveillance VideosDatabricks
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereoBaliThorat1
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniquesMazin Alwaaly
 
Neural Radiance Field
Neural Radiance FieldNeural Radiance Field
Neural Radiance FieldDong Heon Cho
 
MMS2401 - Multimedia system and Communication Notes
MMS2401 - Multimedia system and Communication NotesMMS2401 - Multimedia system and Communication Notes
MMS2401 - Multimedia system and Communication NotesPratik Pradhan
 
Comparison of image segmentation
Comparison of image segmentationComparison of image segmentation
Comparison of image segmentationHaitham Ahmed
 

Mais procurados (20)

Digital video
Digital videoDigital video
Digital video
 
Neural Radiance Fields & Neural Rendering.pdf
Neural Radiance Fields & Neural Rendering.pdfNeural Radiance Fields & Neural Rendering.pdf
Neural Radiance Fields & Neural Rendering.pdf
 
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
 
Video Classification Basic
Video Classification Basic Video Classification Basic
Video Classification Basic
 
multimedia data and file format
multimedia data and file formatmultimedia data and file format
multimedia data and file format
 
Image Segmentation
 Image Segmentation Image Segmentation
Image Segmentation
 
Rendering Algorithms.pptx
Rendering Algorithms.pptxRendering Algorithms.pptx
Rendering Algorithms.pptx
 
PPT s08-machine vision-s2
PPT s08-machine vision-s2PPT s08-machine vision-s2
PPT s08-machine vision-s2
 
Blender Basics
Blender BasicsBlender Basics
Blender Basics
 
IMAGE SEGMENTATION TECHNIQUES
IMAGE SEGMENTATION TECHNIQUESIMAGE SEGMENTATION TECHNIQUES
IMAGE SEGMENTATION TECHNIQUES
 
Threat Detection in Surveillance Videos
Threat Detection in Surveillance VideosThreat Detection in Surveillance Videos
Threat Detection in Surveillance Videos
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniques
 
Neural Radiance Field
Neural Radiance FieldNeural Radiance Field
Neural Radiance Field
 
MMS2401 - Multimedia system and Communication Notes
MMS2401 - Multimedia system and Communication NotesMMS2401 - Multimedia system and Communication Notes
MMS2401 - Multimedia system and Communication Notes
 
Tqm fmea
Tqm fmeaTqm fmea
Tqm fmea
 
Image denoising
Image denoising Image denoising
Image denoising
 
Comparison of image segmentation
Comparison of image segmentationComparison of image segmentation
Comparison of image segmentation
 

Semelhante a PGL SUM Video Summarization

Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionVasileiosMezaris
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video SummarizationVasileiosMezaris
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020VasileiosMezaris
 
ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV project
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxnikitha992646
 
Video Stabilization using Python and open CV
Video Stabilization using Python and open CVVideo Stabilization using Python and open CV
Video Stabilization using Python and open CVIRJET Journal
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...Rashid Mijumbi
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
 
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV CameraIRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV CameraIRJET Journal
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONcscpconf
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web applicationVasileiosMezaris
 
Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operationAngelainBay
 
Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationIRJET Journal
 
SPC Implementation - Mark Harrison
SPC Implementation - Mark HarrisonSPC Implementation - Mark Harrison
SPC Implementation - Mark HarrisonMark Harrison
 
Advanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip VerificationAdvanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip VerificationVLSICS Design
 
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSINGDIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSINGcsandit
 
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionSecure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionIJAEMSJORNAL
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorseSAT Publishing House
 

Semelhante a PGL SUM Video Summarization (20)

Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
ReTV AI4TV Summarization
ReTV AI4TV SummarizationReTV AI4TV Summarization
ReTV AI4TV Summarization
 
Defense_20140625
Defense_20140625Defense_20140625
Defense_20140625
 
UVM_Full_Print_n.pptx
UVM_Full_Print_n.pptxUVM_Full_Print_n.pptx
UVM_Full_Print_n.pptx
 
Video Stabilization using Python and open CV
Video Stabilization using Python and open CVVideo Stabilization using Python and open CV
Video Stabilization using Python and open CV
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
 
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness TaskMediaEval 2016 - UNIFESP Predicting Media Interestingness Task
MediaEval 2016 - UNIFESP Predicting Media Interestingness Task
 
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV CameraIRJET- Storage Optimization of Video Surveillance from CCTV Camera
IRJET- Storage Optimization of Video Surveillance from CCTV Camera
 
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONVISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operation
 
Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage Summarization
 
SPC Implementation - Mark Harrison
SPC Implementation - Mark HarrisonSPC Implementation - Mark Harrison
SPC Implementation - Mark Harrison
 
Advanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip VerificationAdvanced Verification Methodology for Complex System on Chip Verification
Advanced Verification Methodology for Complex System on Chip Verification
 
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSINGDIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
DIVING PERFORMANCE ASSESSMENT BY MEANS OF VIDEO PROCESSING
 
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionSecure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
 
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptorsKey frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
 

Mais de VasileiosMezaris

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskVasileiosMezaris
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AIVasileiosMezaris
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarizationVasileiosMezaris
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrievalVasileiosMezaris
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruningVasileiosMezaris
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...VasileiosMezaris
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networksVasileiosMezaris
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVasileiosMezaris
 

Mais de VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
TAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for ExplanationsTAME: Trainable Attention Mechanism for Explanations
TAME: Trainable Attention Mechanism for Explanations
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...Learning visual explanations for DCNN-based image classifiers using an attent...
Learning visual explanations for DCNN-based image classifiers using an attent...
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 
Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...Video, AI and News: video analysis and verification technologies for supporti...
Video, AI and News: video analysis and verification technologies for supporti...
 
Subclass deep neural networks
Subclass deep neural networksSubclass deep neural networks
Subclass deep neural networks
 
Video & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulationsVideo & AI: capabilities and limitations of AI in detecting video manipulations
Video & AI: capabilities and limitations of AI in detecting video manipulations
 

Último

Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curveAreesha Ahmad
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 

Último (20)

Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 

PGL SUM Video Summarization

  • 1. Combining Global and Local Attention with Positional Encoding for Video Summarization E. Apostolidis1,2, G. Balaouras1, V. Mezaris1, I. Patras2 1 Information Technologies Institute, CERTH, Thermi - Thessaloniki, Greece 2 School of EECS, Queen Mary University of London, London, UK 23rd IEEE International Symposium on Multimedia (ISM 2021)
  • 2. Outline • Problem statement • Related work • Developed approach • Experiments • Conclusions 1
  • 3. Problem statement 2 Video is everywhere! • Captured by smart devices and instantly shared online • Constantly and rapidly increasing volumes of video content on the Web Hours of video content uploaded on YouTube every minute Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-age-video- sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
  • 4. Problem statement 3 But how to find what we are looking for in endless collections of video content? Quickly inspect a video’s content by checking its synopsis! Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
  • 5. Goal of video summarization technologies 4 “Generate a short visual synopsis that summarizes the video content by selecting the most informative and important parts of it” Video title: “Susan Boyle's First Audition - I Dreamed a Dream - Britain's Got Talent 2009” Video source: OVP dataset (video also available at: https://www.youtube.com/watch?v=deRF9oEbRso) Static video summary (made of selected key-frames) Dynamic video summary (made of selected key-fragments) Video content Key-fragment Key-frame Analysis outcomes This synopsis can be: • Static, composed of a set representative video (key-)frames (a.k.a. video storyboard) • Dynamic, formed of a set of representative video (key-)fragments (a.k.a. video skim)
  • 6. Related work (supervised approaches) • Methods modeling the variable-range temporal dependence among frames, using: • Structures of Recurrent Neural Networks (RNNs) • Combinations of LSTMs with external storage or memory layers • Attention mechanisms combined with classic or seq2seq RNN-based architectures • Methods modeling the spatiotemporal structure of the video, using: • 3D-Convolutional Neural Networks • Combinations of CNNs with convolutional LSTMs, GRUs and optical flow maps • Methods learning summarization using Generative Adversarial Networks • Summarizer aims to fool Discriminator when distinguishing machine- from human-generated summary • Methods modeling frames’ dependence by combining self-attention mechanisms, with: • Trainable Regressor Networks that estimate frames’ importance • Fragmentations of the video content in hierarchical key-frame selection strategies • Multiple representations of the visual content (CNN-based & Inflated 3D ConvNet) 5
  • 7. Developed approach Starting point • VASNet model (Fajtl et al., 2018) • Seq2seq transformation using a soft self-attention mechanism Main weaknesses • Lack of knowledge about the temporal position of video frames • Growing difficulty to estimate frames’ importance as video duration increases 6 J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, P. Remagnino, “Summarizing Videos with Attention,” in Asian Conference on Computer Vision 2018 Workshops. Cham: Springer Int. Publishing, 2018, pp. 39-54. Image source: Fajtl et al., “Summarizing Videos with Attention”, ACCV 2018 Workshops
  • 8. Developed approach 7 New network architecture (PGL-SUM) • Uses a multi-head attention mechanism to model frames’ dependence according to the entire frame sequence • Uses multiple multi-head attention mechanisms to model short-term dependencies over smaller video parts • Enhances these mechanisms by adding a component that encodes the temporal position of video frames
  • 9. Developed approach 7 Global attention • Uses a multi-head attention mechanism to model frames’ dependence according to the entire frame sequence
  • 12. Developed approach 10 Local attention • Uses multiple multi-head attention mechanisms to model short-term dependencies over smaller video parts
  • 13. Developed approach 11 Feature fusion & importance estimation • New representation that carries information about each frame’s global and local dependencies • Residual skip connection aims to facilitate backpropagation • Regressor produces a set of frame-level scores that indicate frames’ importance
  • 14. Developed approach 12 Training time • Compute MSE between machine and human frame-level importance scores Inference time • Compute fragment-level importance based on a video segmentation • Create the summary by selecting the key-fragments based on a time budget and by solving the Knapsack problem
  • 15. Experiments 13 Datasets • SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark) • 25 videos capturing multiple events (e.g. cooking and sports) • Video length: 1 to 6 min • Annotation: fragment-based video summaries (15-18 per video) • TVSum (https://github.com/yalesong/tvsum) • 50 videos from 10 categories of TRECVid MED task • Video length: 1 to 11 min • Annotation: frame-level importance scores (20 per video)
  • 16. Experiments 14 Evaluation approach • The generated summary should not exceed 15% of the video length • Agreement between automatically-generated (A) and user-defined (U) summary is expressed by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| || means duration) • 80% of video samples are used for training and the remaining 20% for testing • Model selection is based on a criterion that focuses on rapid changes in the curve of the training loss values or (if such changes do not occur) the minimization of the loss value • Summarization performance is formed by averaging the experimental results on five randomly-created splits of data
  • 17. Experiments 15 Implementation details • Videos were down-sampled to 2 fps • Feature extraction: pool5 layer of GoogleNet trained on ImageNet (D = 1024) • # local attention mechanisms = # video segments M: 4 • # global and local attention heads: 8 and 4 respectively • Learning rate and L2 regularization factor: 5 x 10-5 and 10-5 respectively • Dropout rate: 0.5 • Network initialization approach: Xavier uniform (gain = √2; biases = 0.1) • Training: full batch mode; Adam optimizer; 200 epochs • Hardware: NVIDIA TITAN XP
  • 18. Experiments 16 Sensitivity analysis • # video segments (# local attention mech.) & global-local data fusion approach SumMe TVSum # Segm. Fusion 2 4 8 2 4 8 Addition 49.8 55.6 51.9 61.1 61.7 59.5 Average pooling 51.5 54.1 52.5 61.1 59.7 58.7 Max pooling 51.1 53.9 52.2 61.3 59.6 58.5 Multiplication 46.3 52.1 52.5 47.3 47.6 46.9 SumMe TVSum Local Global 2 4 8 2 4 8 2 52.4 52.4 52.8 61.4 61.6 61.4 4 54.9 58.5 49.2 60.9 61.3 60.4 8 55.8 58.8 57.1 60.3 61.1 60.9 16 56.7 57.7 54.9 61.3 60.9 60.1 • # local and global attention heads Reported values represent F-Score (%)
  • 19. Experiments 17 Performance comparisons with two attention-based methods • Evaluation was made under the exact same experimental conditions SumMe TVSum Average Rank Data splits F1 Rank F1 Rank VASNet (Fajtl et al., 2018) 50.0 3 62.5 2 2.5 5 Rand MSVA (Ghauri et al., 2021) 54.0 2 62.4 3 2.5 5 Rand PGL-SUM (Ours) 57.1 1 62.7 1 1 5 Rand J. Fajtl et al., “Summarizing Videos with Attention,” in Asian Conf. on Comp. Vision 2018 Workshops. Cham: Springer Int. Publishing, 2018, pp. 39–54. J. Ghauri et al., “Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention,” in 2021 IEEE Int. Conf. on Multimedia and Expo. CA, USA: IEEE, 2021, pp. 1–6. F1 denotes F-Score (%)
  • 20. Experiments Performance comparisons with SoA supervised methods SumMe TVSum Avg Rank Data splits F1 Rank F1 Rank Random 40.2 19 54.4 16 17.5 - vsLSTM 37.6 22 54.2 17 19.5 1 Rand dppLSTM 38.6 21 54.7 15 18 1 Rand ActionRanking 40.1 20 56.3 14 17 1 Rand *vsLSTM+Att 43.2 16 - - 16 1 Rand *dppLSTM+Att 43.8 15 - - 15 1 Rand H-RNN 42.1 17 57.9 12 14.5 - *A-AVS 43.9 14 59.4 8 11 5 Rand *SF-CVS 46.0 9 58.0 11 10 - SUM-FCN 47.5 7 56.8 13 10 M Rand HSA-RNN 44.1 13 59.8 7 10 - SumMe TVSum Avg Rank Data splits F1 Rank F1 Rank CRSum 47.3 8 58.0 11 9.5 5 FCV MAVS 40.3 18 66.8 1 9.5 5 FCV TTH-RNN 44.3 12 60.2 6 9 - *M-AVS 44.4 11 61.0 4 7.5 5 Rand SUM-DeepLab 48.8 5 58.4 10 7.5 M Rand *DASP 45.5 10 63.6 3 6.5 5 Rand *SUM-GDA 52.8 3 58.9 9 6 5 FCV SMLD 47.6 6 61.0 4 5 5 FCV *H-MAN 51.8 4 60.4 5 4.5 5 FCV SMN 58.3 1 64.5 2 1.5 1 Rand PGL-SUM 55.6 2 61.0 4 3 5 Rand 18
  • 21. Experiments 19 Ablation study Core components SumMe TVSum Global attention Local attention Positional encoding Variant #1 X √ √ 46.9 52.4 Variant #2 √ X √ 46.7 59.9 Variant #3 √ √ X 53.1 61.0 PGL-SUM (Proposed) √ √ √ 55.6 61.0 Reported values represent F-Score (%)
  • 22. Experiments 19 Ablation study Core components SumMe TVSum Global attention Local attention Positional encoding Variant #1 X √ √ 46.9 52.4 Variant #2 √ X √ 46.7 59.9 Variant #3 √ √ X 53.1 61.0 PGL-SUM (Proposed) √ √ √ 55.6 61.0 Combing global and local attention allows to learn a better modeling of frames’ dependencies Reported values represent F-Score (%)
  • 23. Experiments 19 Ablation study Core components SumMe TVSum Global attention Local attention Positional encoding Variant #1 X √ √ 46.9 52.4 Variant #2 √ X √ 46.7 59.9 Variant #3 √ √ X 53.1 61.0 PGL-SUM (Proposed) √ √ √ 55.6 61.0 Integrating knowledge about the frames’ position positively affects the summarization performance Reported values represent F-Score (%)
  • 24. Conclusions 20 • PGL-SUM model for supervised video summarization aims to improve • modeling of long-range frames’ dependencies • parallelization ability of the training process • granularity level at which the temporal dependencies between frames are modeled • PGL-SUM contains multiple attention mechanisms that encode frames’ position • A global multi-head attention mechanism models frames’ dependence based on the entire video • Multiple local multi-head attention mechanisms discover modelings of frames’ dependencies by focusing on smaller parts of the video • Experiments on two benchmark datasets (SumMe and TVSum) • Showed PGL-SUM’s excellence against other methods that rely on self-attention mechanisms • Demonstrated PGL-SUM’s competitiveness against other SoA supervised summarization approaches • Documented the positive contribution of combining global and local multi-head attention with absolute positional encoding
  • 25. Thank you for your attention! Questions? Evlampios Apostolidis, apostolid@iti.gr Code and documentation publicly available at: https://github.com/e-apostolidis/PGL-SUM This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-832921 MIRROR, and by EPSRC under grant No. EP/R026424/1