Multi modal fusion for video retrieval based on CLIP guide feature alignment

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Downloads (Pure)

Abstract

With the rise of short video platforms, a large amount of video data is generated daily. These videos vary in quality and are not well-tagged. How to fully utilize the multimodal information in videos, bridge the differences between modalities, and achieve precise video retrieval is a major challenge currently faced in the field of video retrieval. This paper presents a novel approach to multimodal video retrieval, aiming to boost search precision by incorporating visual, textual, and audio information through the CLIP model and T5. Tackling the issue of retrieving pertinent content from extensive, untagged video repositories, we propose a method that fuses multimodal data through innovative feature extraction and alignment techniques. Our method showcases performance are close to the current state-of-the-art, showcasing its effectiveness in improving search accuracy on MSR-VTT benchmark.
Original languageEnglish
Title of host publicationMVRMLM '24: Proceedings of 2024 ACM ICMR Workshop on Multimodal Video Retrieval
PublisherAssociation for Computing Machinery
Pages45-50
Number of pages6
ISBN (Electronic)9798400706844
DOIs
Publication statusPublished - 10 Jun 2024
Event
ICMR '24: International Conference on Multimedia Retrieval Phuket Thailand
- Phuket, Thailand
Duration: 10 Jun 202414 Jun 2024

Publication series

NameProceedings of ACM ICMR Workshop on Multimodal Video Retrieval
PublisherAssociation for Computing Machinery

Conference

Conference
ICMR '24: International Conference on Multimedia Retrieval Phuket Thailand
Country/TerritoryThailand
CityPhuket
Period10/06/202414/06/2024

Fingerprint

Dive into the research topics of 'Multi modal fusion for video retrieval based on CLIP guide feature alignment'. Together they form a unique fingerprint.

Cite this