Towards effective and efficient video object detection

  • Guanxiong Sun

Student thesis: Doctoral ThesisDoctor of Philosophy


The past decade has witnessed great progress in object detection on still images. As one of the fundamental computer vision tasks, object detection aims to locate and classify given objects simultaneously. However, in many real-world computer vision applications, e.g., video surveillance and autonomous driving, data are obtained in the format of video rather than image. Till now, video object detection is still very challenging.

This thesis focuses on addressing core challenges for video object detection. Specifically, we summarise three research questions. Firstly, how to achieve accurate video object detection performance on low-quality frames? Secondly, how to achieve accurate video object detection algorithms with real-time speed? Finally, how to extract spatio-temporal features more efficiently and effectively?

In this thesis, we present three novel methods to address the aforementioned challenges and achieve effective and efficient video object detection. Firstly, we propose a memory bank structure to enhance the low-quality frame using features from the previous frames. Using the memory bank, we can easily enhance the quality of features extracted from poor video frames. Therefore, our method outperforms many state-of-the-art methods on the large-scale ImageNet VID benchmark. Secondly, we propose a general framework that can be applied to various one-stage detectors for video object detection. We design a location prior network and a size prior network to skip unnecessary computations on background regions. Thirdly, we carefully design a temporal dilated transformer block (TDTB) to build a simple yet effective backbone for dense video tasks, named temporal dilated video transformer (TDViT). TDViT achieves excellent performance on two widely used dense video benchmarks, ImageNet VID for video object detection and YouTube VIS for video instance segmentation. The excellent performance and generalisation ability demonstrate that our TDViT can be served as a general backbone for dense video tasks.
Date of AwardJul 2023
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsAnyvision ltd.
SupervisorYang Hua (Supervisor), Hui Wang (Supervisor) & Neil Robertson (Supervisor)


  • Object detection
  • video recognition
  • video object detection

Cite this