TDViT: Temporal Dilated Video Transformer for dense video tasks

Guanxiong Sun*, Yang Hua, Guosheng Hu, Neil Robertson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully-designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at .

Original languageEnglish
Title of host publicationProceedings of the 17th European Conference on Computer Vision
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Nature Switzerland
ISBN (Electronic)9783031198335
ISBN (Print)9783031198328
Publication statusPublished - 04 Nov 2022
EventEuropean Conference on Computer Vision - Israel, Tel-Aviv, Israel
Duration: 23 Oct 202227 Oct 2022

Publication series

NameLecture Notes in Computer Science
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferenceEuropean Conference on Computer Vision
Abbreviated titleECCV 2022


Dive into the research topics of 'TDViT: Temporal Dilated Video Transformer for dense video tasks'. Together they form a unique fingerprint.

Cite this