Vehicle re-identification from multi-modal vision sensors with deep metric learning

  • Eleni Kamenou

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

With the growing use of camera sensor networks over the past decade, computer vision applications are becoming more and more relevant. In this context, multi-camera multi-target tracking is a prominent computer vision task, that entails multiple cameras covering a wide area of interest and multiple targets transiting across the camera views. In wide area tracking scenarios the cameras are often placed far apart and their fields of view do not overlap. A key component of such systems is the ability to preserve the identities of the targets as they move from one camera view to another or after long periods of occlusion in single camera view. This functionality requires a re-identification mechanism to be integrated across the camera network.

This dissertation focuses on addressing the re-identification problem in multi-camera settings using vehicle data, as well as the various challenges that this task encompasses. Real-world tracking data involve low resolution, occlusions, varying viewpoints and illumination conditions. Moreover, vehicle appearances are often characterized by inter-class similarity and at the same time intra-class variability. These challenges require the development of robust re-identification algorithms employing highly discriminative features in order to accurately associate the vehicles.

In Chapter 3 of this dissertation, we explore various training and testing configurations of a vehicle re-identification model, aiming at addressing the difficulties associated with vehicle tracking data. In particular, we conduct a thorough analysis of some existing loss functions. Moreover, we experiment with multi-level feature extraction approaches and time-pooling techniques, which proved highly beneficial for the model performance.

Moving forward with the experimental analysis, we observed that in re-identification settings, where the vehicle classes which the model should identify at testing time differ from the vehicle classes used for model training, there is a considerable domain shift between training and testing data distributions. A typical issue arising in such setting is the over-fitting phenomenon that makes the model learn too well the training data and fail to generalise to the testing data. To mitigate this negative effect, in Chapter 4, we propose a novel regularisation method, in the form of a loss function term designed to be integrated to the total loss function calculation, in order to achieve more generalisable trained models. We extensively evaluate the effectiveness of our newly proposed regulariser, and demonstrate consistent improvement against recent state-of-art regularisers, and in combination with several baseline loss functions, and multiple datasets.

In the context of computer vision applications -including re-identification- research has been mainly focused on visible-light imagery. However, due to cost decrease thermal sensors have also been increasingly adopted for wide area surveillance systems. Using both visible and infrared light modality capturing, can provide complementary high quality imagery irrespective to low illumination or adverse weather conditions, such as fog. In fact, the combination of both modalities offers a more complete representation of the scene, superior to what the human eye can perceive. Therefore, in the last two technical chapters, we employ multi-modal imaging data captured by diverse sensor types (RGB and infrared), which proves to be beneficial for the re-identification performance.

The radical dissimilarities across the physical phenomena that the two modalities leverage, cause high domain discrepancy between the RGB and infrared data domains. Therefore, building a model that can simultaneously deal with both modalities is an extremely challenging task, that remains highly understudied. In Chapter 5, we first define two different re-identification scenarios, the multi-modal and cross-modal. Multi-modal settings refer to utilising data of both modalities to perform the matching; whereas cross-modal matching is applied between samples of different modalities. We thoroughly study methods to inimize the domain gap between the two data modalities in the embedding space, in order to train deep neural architectures able to apply multi-modal re-identification, as well as cross-modal matching, which involve retrieving infrared images, given an RGB query image, and vice versa. Our model design consists of a set of loss components and network architectural configurations, aiming to bridge the domain gap. Our insights show that although cross-modal matching remains a very challenging scenario, with generally lower accuracy than the multi-modal case, our proposed model can significantly improve cross-modal performance while maintaining a high multi-modal accuracy.

Finally, in Chapter 6, we attempt to evaluate the generalization ability of a re-identification model to perform on previously unseen visual modalities at testing time. In particular, RGB, near-infrared and thermal-infrared modalities are the considered data domains. Approaching this as a domain generalisation problem, we propose a deep neural framework based on the meta-learning training paradigm. Meta-learning leverages a two-step gradient calculation process that exposes the model to the train-test domain shift during backpropagation. This technique proves to improve out-of-distribution generalisation of the model compared to conventional training settings. This work package aims to alleviate the lack of training data availability from diverse imaging modalities, by creating models that can easily adapt to new modality imagery.
Date of AwardJul 2024
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsThales UK Limited
SupervisorJesus Martinez-del-Rincon (Supervisor) & Paul Miller (Supervisor)

Cite this

'