Combining Text & Visual Features with Dual Softmax for Video Retrieval

•Transferir como PPTX, PDF•

0 gostou•12 visualizações

The document proposes a method called TxV for retrieving videos using natural language text queries. TxV combines various textual and visual features from inputs like BERT, CLIP and ResNet using an identity layer. It also uses a Dual Softmax Inference technique to revise the initial text-video similarities using background queries. Experimental results on datasets like MSR-VTT, IACC.3 and V3C1 showed that TxV outperforms state-of-the-art methods, with improvements in metrics like mean extended inferred average precision and recall. The approach provides an efficient way to combine multiple features for the text-video retrieval task.

Ciências

Are all combinations equal?
Combining textual and visual features with
multiple space learning for text-based
video retrieval
Damianos Galanopoulos, Vasileios Mezaris
Information Technologies Institute-CERTH
ECCV2022 Workshop on AI for Creative Video Editing and Understanding

Problem statement
● Retrieving videos using natural-language text
○ “I would like to find a video where some kids are playing basketball in an open field”
Proposed Method
● We propose the TxV model
● Combination of various textual and visual features
● Utilization of prior text-video similarities for revising the ones
computed by our network

TxV model
bow
w2v
bert
clip
resnet152-11k
resnext101-wsl
clip vit-b/32
Identity
layer

Dual Softmax Inference (DSinf)
● Inspired from the Dual Softmax loss
● Text-video similarities revision
● Fixed set of predefined background queries

Experimental Results
● AVS datasets:
○ Training on MSR-VTT, TGIF, ActivityNet Captions and Vatex
○ Evaluation on IACC.3 and V3C1 evaluated on 2016-2018 and
2019-2021 queries V3C1 evaluated on TRECVID AVS
○ Mean extended inferred average precision (MXinfAP)
● MSR-VTT datasets:
○ MSR-VTT full and 1k-A splits
○ Recall R@k; Median rank (Medr); Mean average precision (mAP)

Experimental results
Results and SotA comparisons on the IACC.3 and V3C1 datasets

Experimental results
Results and SotA comparisons on the MSR-VTT full and 1k-A splits.

Experimental results
Different strategies on background queries selection when using DSinf

Conclusion
● We proposed the TxV model for the text-video retrieval task
● Efficient combination of multiple textual and visual features
● A dual softmax operation at the retrieval stage boosts the
performance
● Experimental results and comparisons on different datasets
document the value of our approach

Contact details
Damianos Galanopoulos, Vasileios Mezaris
Information Technologies Institute-CERTH
dgalanop@iti.gr, bmezaris@iti.gr
www.iti.gr/~bmezaris
This work was supported by the EU Horizon 2020 programme under grant
agreements H2020-101021866 CRiTERIA and H2020-832921 MIRROR.

Mais conteúdo relacionado

Semelhante a Combining Text & Visual Features with Dual Softmax for Video Retrieval

Video + LanguageGoergen Institute for Data Science

MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...multimediaeval

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri

Video content analysis and retrieval system using video storytelling and inde...IJECEIAES

OPTE: Online Per-title Encoding for Live Video StreamingAlpen-Adria-Universität

OPTE: Online Per-title Encoding for Live Video Streaming.pdfVignesh V Menon

Parking Surveillance Footage SummarizationIRJET Journal

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri

Optimal Quality and Efficiency in Adaptive Live Streaming with JND-Aware Low ...Vignesh V Menon

Paper discussion:Video-to-Video Synthesis (NIPS 2018)Motaz Sabri

Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsWee Hyong Tok

ADS Team 8 Final PresentationPranay Mankad

Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris

VCIP_MCBE_presentation.pdfVignesh V Menon

Energy-Efficient Multi-Codec Bitrate-Ladder Estimation for Adaptive Video Str...Alpen-Adria-Universität

Delay Analysis of Layered Video Caching in Crowdsourced Heterogeneous Wireles...Behrouz Jedari

[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo

Video Coding Enhancements for HTTP Adaptive StreamingAlpen-Adria-Universität

Research@Lunch_Presentation.pdfVignesh V Menon

論文紹介：Temporal Sentence Grounding in Videos: A Survey and Future DirectionsToru Tamaki

Semelhante a Combining Text & Visual Features with Dual Softmax for Video Retrieval (20)

Video + Language

MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...

Video content analysis and retrieval system using video storytelling and inde...

OPTE: Online Per-title Encoding for Live Video Streaming

OPTE: Online Per-title Encoding for Live Video Streaming.pdf

Parking Surveillance Footage Summarization

Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...

Optimal Quality and Efficiency in Adaptive Live Streaming with JND-Aware Low ...

Paper discussion:Video-to-Video Synthesis (NIPS 2018)

Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects

ADS Team 8 Final Presentation

Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022

VCIP_MCBE_presentation.pdf

Energy-Efficient Multi-Codec Bitrate-Ladder Estimation for Adaptive Video Str...

Delay Analysis of Layered Video Caching in Crowdsourced Heterogeneous Wireles...

[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...

Video Coding Enhancements for HTTP Adaptive Streaming

Research@Lunch_Presentation.pdf

論文紹介：Temporal Sentence Grounding in Videos: A Survey and Future Directions

Mais de VasileiosMezaris

Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris

CERTH-ITI at MediaEval 2023 NewsImages TaskVasileiosMezaris

Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris

Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris

TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris

Gated-ViGATVasileiosMezaris

Explaining video summarization based on the focus of attentionVasileiosMezaris

Explaining the decisions of image/video classifiersVasileiosMezaris

Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris

CA-SUM Video SummarizationVasileiosMezaris

Video smart cropping web applicationVasileiosMezaris

PGL SUM Video SummarizationVasileiosMezaris

Video Thumbnail SelectorVasileiosMezaris

Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris

Misinformation on the internet: Video and AIVasileiosMezaris

LSTM Structured PruningVasileiosMezaris

PoR_evaluation_measure_acm_mm_2020VasileiosMezaris

GAN-based video summarizationVasileiosMezaris

Migration-related video retrievalVasileiosMezaris

Fractional step discriminant pruningVasileiosMezaris

Mais de VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization

CERTH-ITI at MediaEval 2023 NewsImages Task

Spatio-Temporal Summarization of 360-degrees Videos

Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...

TAME: Trainable Attention Mechanism for Explanations

Gated-ViGAT

Explaining video summarization based on the focus of attention

Explaining the decisions of image/video classifiers

Learning visual explanations for DCNN-based image classifiers using an attent...

CA-SUM Video Summarization

Video smart cropping web application

PGL SUM Video Summarization

Video Thumbnail Selector

Hard-Negatives Selection Strategy for Cross-Modal Retrieval

Misinformation on the internet: Video and AI

LSTM Structured Pruning

PoR_evaluation_measure_acm_mm_2020

GAN-based video summarization

Migration-related video retrieval

Fractional step discriminant pruning

Último

PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2

GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide

Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131

Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju

Speech, hearing, noise, intelligibility.pptxpriyankatabhane

Harmful and Useful Microorganisms Presentationtahreemzahra82

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

User Guide: Magellan MX™ Weather StationColumbia Weather Systems

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur

Citronella presentation SlideShare mani upadhyayupadhyaymani499

OECD bibliometric indicators: Selected highlights, April 2024innovationoecd

Ai in communication electronicss[1].pptxsubscribeus100

User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems

Introduction of Human Body & Structure of cell.pptxMedical College

Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju

Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain

Let’s Say Someone Did Drop the Bomb. Then What?LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP

Manassas R - Parkside Middle School 🌎🏫qfactory1

Combining Text & Visual Features with Dual Softmax for Video Retrieval

1. Are all combinations equal? Combining textual and visual features with multiple space learning for text-based video retrieval Damianos Galanopoulos, Vasileios Mezaris Information Technologies Institute-CERTH ECCV2022 Workshop on AI for Creative Video Editing and Understanding

2. Problem statement ● Retrieving videos using natural-language text ○ “I would like to find a video where some kids are playing basketball in an open field” Proposed Method ● We propose the TxV model ● Combination of various textual and visual features ● Utilization of prior text-video similarities for revising the ones computed by our network

3. TxV model bow w2v bert clip resnet152-11k resnext101-wsl clip vit-b/32 Identity layer

4. Dual Softmax Inference (DSinf) ● Inspired from the Dual Softmax loss ● Text-video similarities revision ● Fixed set of predefined background queries

5. Experimental Results ● AVS datasets: ○ Training on MSR-VTT, TGIF, ActivityNet Captions and Vatex ○ Evaluation on IACC.3 and V3C1 evaluated on 2016-2018 and 2019-2021 queries V3C1 evaluated on TRECVID AVS ○ Mean extended inferred average precision (MXinfAP) ● MSR-VTT datasets: ○ MSR-VTT full and 1k-A splits ○ Recall R@k; Median rank (Medr); Mean average precision (mAP)

6. Experimental results Results and SotA comparisons on the IACC.3 and V3C1 datasets

7. Experimental results Results and SotA comparisons on the MSR-VTT full and 1k-A splits.

8. Experimental results Different strategies on background queries selection when using DSinf

9. Conclusion ● We proposed the TxV model for the text-video retrieval task ● Efficient combination of multiple textual and visual features ● A dual softmax operation at the retrieval stage boosts the performance ● Experimental results and comparisons on different datasets document the value of our approach

10. Contact details Damianos Galanopoulos, Vasileios Mezaris Information Technologies Institute-CERTH dgalanop@iti.gr, bmezaris@iti.gr www.iti.gr/~bmezaris This work was supported by the EU Horizon 2020 programme under grant agreements H2020-101021866 CRiTERIA and H2020-832921 MIRROR.

Combining Text & Visual Features with Dual Softmax for Video Retrieval

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Combining Text & Visual Features with Dual Softmax for Video Retrieval

Semelhante a Combining Text & Visual Features with Dual Softmax for Video Retrieval (20)

Mais de VasileiosMezaris

Mais de VasileiosMezaris (20)

Último

Último (20)

Combining Text & Visual Features with Dual Softmax for Video Retrieval