VTT: Long-term Visual Tracking with Transformers

Tianling Bian, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Neil Robertson, Haibing Guan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

253 Downloads (Pure)


Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three large-scale long-term tracking datasets and have achieved state-of-the-art performance.
Original languageEnglish
Title of host publicationInternational Conference on Pattern Recognition (ICPR)
Publisher IEEE
Number of pages8
ISBN (Electronic)978-1-7281-8808-9
ISBN (Print)978-1-7281-8809-6
Publication statusPublished - 05 May 2021

Publication series

NameInternational Conference on Pattern Recognition (ICPR): Proceedings
ISSN (Print)1051-4651


Dive into the research topics of 'VTT: Long-term Visual Tracking with Transformers'. Together they form a unique fingerprint.

Cite this