Multi-modal object detection in non-corresponding imagery using unsupervised techniques

  • Rachael Abbott

Student thesis: Doctoral ThesisDoctor of Philosophy


Substantial advancements have been made in the field of object detection in the past number of years. The presented thesis focuses on multi-modal (RGB-LWIR) object detection for surveillance in a Security and Defence context. At present, the current state-of-the-art in object detection use convolutional neural networks (CNNs). CNNs require large datasets for training; however, there is a severe lack of labelled LWIR datasets available, unlike the RGB modality. Collecting, preprocessing and annotating a large dataset is exceptionally time-consuming and expensive. Therefore, we aim to develop novel techniques to achieve high detection rates in small LWIR datasets using unsupervised techniques. Our ultimate aim is to achieve multi-modal object detection in non-corresponding RGB and LWIR imagery.

First, a supervised scenario with only a small low-resolution LWIR dataset available for training is considered. We propose to exploit another larger high-resolution LWIR dataset using transfer learning and fine-tuning techniques usually only used within the RGB domain. This work’s originality lies in our approach focusing on transfer learning across LWIR datasets taken at different locations, angles, resolutions, and target classes at varying distances from the camera. Also, we show that pre-training detection algorithms with the ImageNet RGB dataset significantly improve detection rates in the LWIR imagery by up to ∼ 18%. These results lead us to go a stage further and prove transfer learning between the RGB and LWIR modality is possible. We show that an RGB trained detection algorithm produces some LWIR invariant features. Subsequently, an unsupervised adaptation to this detection network is proposed to increase detection in LWIR imagery. For this adaptation, the distance between the source (RGB) and target (LWIR) distributions is minimised during training to create modality invariant features. This technique can also be applied to non-corresponding imagery, i.e. imagery that is not lined up in space or time. Not only are high detection rates achieved in LWIR imagery, but high RGB F1 scores are maintained, thereby creating a multi-modal detection algorithm.

Next, the potential of using synthetic imagery to enhance detection in the LWIR modality in an unsupervised manner is investigated. Synthetic RGB imagery is translated from real LWIR imagery by training CycleGANs. The requirement to adapt an RGB trained object detection network for LWIR imagery is removed. In addition, CycleGANs use non-corresponding imagery that is easy to obtain and is applicable to real-life scenarios. This translation approach results in a multi-modal detection system, using a generator as a first step to translate the LWIR imagery to the RGB modality. Additionally, we translate RGB to the LWIR modality to create synthetic LWIR imagery for training proposes. We use the labels from the RGB imagery and the syn- thetic LWIR imagery to fine-tune the RGB trained detection network. This method produces the best results of all the unsupervised approaches in real LWIR imagery, with F1 scores of up to 85.6%.
Date of AwardJul 2021
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsThales UK Limited
SupervisorJesus Martinez-del-Rincon (Supervisor) & Neil Robertson (Supervisor)


  • Multi-modal imagery
  • LWIR-RGB imagery
  • object detection
  • unsupervised adaptation
  • modality adaptation

Cite this