Abstract
With the rise of short video platforms, a large amount of video data is generated daily. These videos vary in quality and are not well-tagged. How to fully utilize the multimodal information in videos, bridge the differences between modalities, and achieve precise video retrieval is a major challenge currently faced in the field of video retrieval. This paper presents a novel approach to multimodal video retrieval, aiming to boost search precision by incorporating visual, textual, and audio information through the CLIP model and T5. Tackling the issue of retrieving pertinent content from extensive, untagged video repositories, we propose a method that fuses multimodal data through innovative feature extraction and alignment techniques. Our method showcases performance are close to the current state-of-the-art, showcasing its effectiveness in improving search accuracy on MSR-VTT benchmark.
Original language | English |
---|---|
Title of host publication | MVRMLM '24: Proceedings of 2024 ACM ICMR Workshop on Multimodal Video Retrieval |
Publisher | Association for Computing Machinery |
Pages | 45-50 |
Number of pages | 6 |
ISBN (Electronic) | 9798400706844 |
DOIs | |
Publication status | Published - 10 Jun 2024 |
Event | ICMR '24: International Conference on Multimedia Retrieval Phuket Thailand - Phuket, Thailand Duration: 10 Jun 2024 → 14 Jun 2024 |
Publication series
Name | Proceedings of ACM ICMR Workshop on Multimodal Video Retrieval |
---|---|
Publisher | Association for Computing Machinery |
Conference
Conference | ICMR '24: International Conference on Multimedia Retrieval Phuket Thailand |
---|---|
Country/Territory | Thailand |
City | Phuket |
Period | 10/06/2024 → 14/06/2024 |