Abstract
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code is available at https://github.com/RobertLuo1/NeurIPS2023_SOC.
Original language | English |
---|---|
Title of host publication | Advances in Neural Information Processing Systems 36 (NeurIPS 2023): Proceedings |
Editors | A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine |
Publisher | Neural Information Processing Systems Foundation |
Number of pages | 13 |
ISBN (Electronic) | 9781713899921 |
Publication status | Published - Jul 2024 |
Externally published | Yes |
Event | 37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States Duration: 10 Dec 2023 → 16 Dec 2023 |
Conference
Conference | 37th Conference on Neural Information Processing Systems, NeurIPS 2023 |
---|---|
Country/Territory | United States |
City | New Orleans |
Period | 10/12/2023 → 16/12/2023 |
Keywords
- SOC
- Semantic-assisted Object Cluster
- video object segmentation