-
Knowledge Distillation for Multi-Target Domain Adaptation in Real-Time Person Re-Identification
Authors:
Félix Remigereau,
Djebril Mekhazni,
Sajjad Abdoli,
Le Thanh Nguyen-Meidine,
Rafael M. O. Cruz,
Eric Granger
Abstract:
Despite the recent success of deep learning architectures, person re-identification (ReID) remains a challenging problem in real-word applications. Several unsupervised single-target domain adaptation (STDA) methods have recently been proposed to limit the decline in ReID accuracy caused by the domain shift that typically occurs between source and target video data. Given the multimodal nature of…
▽ More
Despite the recent success of deep learning architectures, person re-identification (ReID) remains a challenging problem in real-word applications. Several unsupervised single-target domain adaptation (STDA) methods have recently been proposed to limit the decline in ReID accuracy caused by the domain shift that typically occurs between source and target video data. Given the multimodal nature of person ReID data (due to variations across camera viewpoints and capture conditions), training a common CNN backbone to address domain shifts across multiple target domains, can provide an efficient solution for real-time ReID applications. Although multi-target domain adaptation (MTDA) has not been widely addressed in the ReID literature, a straightforward approach consists in blending different target datasets, and performing STDA on the mixture to train a common CNN. However, this approach may lead to poor generalization, especially when blending a growing number of distinct target domains to train a smaller CNN.
To alleviate this problem, we introduce a new MTDA method based on knowledge distillation (KD-ReID) that is suitable for real-time person ReID applications. Our method adapts a common lightweight student backbone CNN over the target domains by alternatively distilling from multiple specialized teacher CNNs, each one adapted on data from a specific target domain. Extensive experiments conducted on several challenging person ReID datasets indicate that our approach outperforms state-of-art methods for MTDA, including blending methods, particularly when training a compact CNN backbone like OSNet. Results suggest that our flexible MTDA approach can be employed to design cost-effective ReID systems for real-time video surveillance applications.
△ Less
Submitted 10 July, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Dynamic Template Selection Through Change Detection for Adaptive Siamese Tracking
Authors:
Madhu Kiran,
Le Thanh Nguyen-Meidine,
Rajat Sahay,
Rafael Menelau Oliveira E Cruz,
Louis-Antoine Blais-Morin,
Eric Granger
Abstract:
Deep Siamese trackers have recently gained much attention in recent years since they can track visual objects at high speeds. Additionally, adaptive tracking methods, where target samples collected by the tracker are employed for online learning, have achieved state-of-the-art accuracy. However, single object tracking (SOT) remains a challenging task in real-world application due to changes and de…
▽ More
Deep Siamese trackers have recently gained much attention in recent years since they can track visual objects at high speeds. Additionally, adaptive tracking methods, where target samples collected by the tracker are employed for online learning, have achieved state-of-the-art accuracy. However, single object tracking (SOT) remains a challenging task in real-world application due to changes and deformations in a target object's appearance. Learning on all the collected samples may lead to catastrophic forgetting, and thereby corrupt the tracking model.
In this paper, SOT is formulated as an online incremental learning problem. A new method is proposed for dynamic sample selection and memory replay, preventing template corruption. In particular, we propose a change detection mechanism to detect gradual changes in object appearance and select the corresponding samples for online adaption. In addition, an entropy-based sample selection strategy is introduced to maintain a diversified auxiliary buffer for memory replay. Our proposed method can be integrated into any object tracking algorithm that leverages online learning for model adaptation.
Extensive experiments conducted on the OTB-100, LaSOT, UAV123, and TrackingNet datasets highlight the cost-effectiveness of our method, along with the contribution of its key components. Results indicate that integrating our proposed method into state-of-art adaptive Siamese trackers can increase the potential benefits of a template update strategy, and significantly improve performance.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
Generative Target Update for Adaptive Siamese Tracking
Authors:
Madhu Kiran,
Le Thanh Nguyen-Meidine,
Rajat Sahay,
Rafael Menelau Oliveira E Cruz,
Louis-Antoine Blais-Morin,
Eric Granger
Abstract:
Siamese trackers perform similarity matching with templates (i.e., target models) to recursively localize objects within a search region. Several strategies have been proposed in the literature to update a template based on the tracker output, typically extracted from the target search region in the current frame, and thereby mitigate the effects of target drift. However, this may lead to corrupte…
▽ More
Siamese trackers perform similarity matching with templates (i.e., target models) to recursively localize objects within a search region. Several strategies have been proposed in the literature to update a template based on the tracker output, typically extracted from the target search region in the current frame, and thereby mitigate the effects of target drift. However, this may lead to corrupted templates, limiting the potential benefits of a template update strategy.
This paper proposes a model adaptation method for Siamese trackers that uses a generative model to produce a synthetic template from the object search regions of several previous frames, rather than directly using the tracker output. Since the search region encompasses the target, attention from the search region is used for robust model adaptation. In particular, our approach relies on an auto-encoder trained through adversarial learning to detect changes in a target object's appearance and predict a future target template, using a set of target templates localized from tracker outputs at previous frames. To prevent template corruption during the update, the proposed tracker also performs change detection using the generative model to suspend updates until the tracker stabilizes, and robust matching can resume through dynamic template fusion.
Extensive experiments conducted on VOT-16, VOT-17, OTB-50, and OTB-100 datasets highlight the effectiveness of our method, along with the impact of its key components. Results indicate that our proposed approach can outperform state-of-art trackers, and its overall robustness allows tracking for a longer time before failure.
△ Less
Submitted 20 February, 2022;
originally announced February 2022.
-
Holistic Guidance for Occluded Person Re-Identification
Authors:
Madhu Kiran,
R Gnana Praveen,
Le Thanh Nguyen-Meidine,
Soufiane Belharbi,
Louis-Antoine Blais-Morin,
Eric Granger
Abstract:
In real-world video surveillance applications, person re-identification (ReID) suffers from the effects of occlusions and detection errors. Despite recent advances, occlusions continue to corrupt the features extracted by state-of-art CNN backbones, and thereby deteriorate the accuracy of ReID systems. To address this issue, methods in the literature use an additional costly process such as pose e…
▽ More
In real-world video surveillance applications, person re-identification (ReID) suffers from the effects of occlusions and detection errors. Despite recent advances, occlusions continue to corrupt the features extracted by state-of-art CNN backbones, and thereby deteriorate the accuracy of ReID systems. To address this issue, methods in the literature use an additional costly process such as pose estimation, where pose maps provide supervision to exclude occluded regions. In contrast, we introduce a novel Holistic Guidance (HG) method that relies only on person identity labels, and on the distribution of pairwise matching distances of datasets to alleviate the problem of occlusion, without requiring additional supervision. Hence, our proposed student-teacher framework is trained to address the occlusion problem by matching the distributions of between- and within-class distances (DCDs) of occluded samples with that of holistic (non-occluded) samples, thereby using the latter as a soft labeled reference to learn well separated DCDs. This approach is supported by our empirical study where the distribution of between- and within-class distances between images have more overlap in occluded than holistic datasets. In particular, features extracted from both datasets are jointly learned using the student model to produce an attention map that allows separating visible regions from occluded ones. In addition to this, a joint generative-discriminative backbone is trained with a denoising autoencoder, allowing the system to self-recover from occlusions. Extensive experiments on several challenging public datasets indicate that the proposed approach can outperform state-of-the-art methods on both occluded and holistic datasets
△ Less
Submitted 22 July, 2023; v1 submitted 13 April, 2021;
originally announced April 2021.
-
Incremental Multi-Target Domain Adaptation for Object Detection with Efficient Domain Transfer
Authors:
Le Thanh Nguyen-Meidine,
Madhu Kiran,
Marco Pedersoli,
Jose Dolz,
Louis-Antoine Blais-Morin,
Eric Granger
Abstract:
Recent advances in unsupervised domain adaptation have significantly improved the recognition accuracy of CNNs by alleviating the domain shift between (labeled) source and (unlabeled) target data distributions. While the problem of single-target domain adaptation (STDA) for object detection has recently received much attention, multi-target domain adaptation (MTDA) remains largely unexplored, desp…
▽ More
Recent advances in unsupervised domain adaptation have significantly improved the recognition accuracy of CNNs by alleviating the domain shift between (labeled) source and (unlabeled) target data distributions. While the problem of single-target domain adaptation (STDA) for object detection has recently received much attention, multi-target domain adaptation (MTDA) remains largely unexplored, despite its practical relevance in several real-world applications, such as multi-camera video surveillance. Compared to the STDA problem that may involve large domain shifts between complex source and target distributions, MTDA faces additional challenges, most notably the computational requirements and catastrophic forgetting of previously-learned targets, which can depend on the order of target adaptations. STDA for detection can be applied to MTDA by adapting one model per target, or one common model with a mixture of data from target domains. However, these approaches are either costly or inaccurate. The only state-of-art MTDA method specialized for detection learns targets incrementally, one target at a time, and mitigates the loss of knowledge by using a duplicated detection model for knowledge distillation, which is computationally expensive and does not scale well to many domains. In this paper, we introduce an efficient approach for incremental learning that generalizes well to multiple target domains. Our MTDA approach is more suitable for real-world applications since it allows updating the detection model incrementally, without storing data from previous-learned target domains, nor retraining when a new target domain becomes available. Our proposed method, MTDA-DTM, achieved the highest level of detection accuracy compared against state-of-the-art approaches on several MTDA detection benchmarks and Wildtrack, a benchmark for multi-camera pedestrian detection.
△ Less
Submitted 11 May, 2022; v1 submitted 13 April, 2021;
originally announced April 2021.
-
Knowledge Distillation Methods for Efficient Unsupervised Adaptation Across Multiple Domains
Authors:
Le Thanh Nguyen-Meidine,
Atif Belal,
Madhu Kiran,
Jose Dolz,
Louis-Antoine Blais-Morin,
Eric Granger
Abstract:
Beyond the complexity of CNNs that require training on large annotated datasets, the domain shift between design and operational data has limited the adoption of CNNs in many real-world applications. For instance, in person re-identification, videos are captured over a distributed set of cameras with non-overlapping viewpoints. The shift between the source (e.g. lab setting) and target (e.g. camer…
▽ More
Beyond the complexity of CNNs that require training on large annotated datasets, the domain shift between design and operational data has limited the adoption of CNNs in many real-world applications. For instance, in person re-identification, videos are captured over a distributed set of cameras with non-overlapping viewpoints. The shift between the source (e.g. lab setting) and target (e.g. cameras) domains may lead to a significant decline in recognition accuracy. Additionally, state-of-the-art CNNs may not be suitable for such real-time applications given their computational requirements. Although several techniques have recently been proposed to address domain shift problems through unsupervised domain adaptation (UDA), or to accelerate/compress CNNs through knowledge distillation (KD), we seek to simultaneously adapt and compress CNNs to generalize well across multiple target domains. In this paper, we propose a progressive KD approach for unsupervised single-target DA (STDA) and multi-target DA (MTDA) of CNNs. Our method for KD-STDA adapts a CNN to a single target domain by distilling from a larger teacher CNN, trained on both target and source domain data in order to maintain its consistency with a common representation. Our proposed approach is compared against state-of-the-art methods for compression and STDA of CNNs on the Office31 and ImageClef-DA image classification datasets. It is also compared against state-of-the-art methods for MTDA on Digits, Office31, and OfficeHome. In both settings -- KD-STDA and KD-MTDA -- results indicate that our approach can achieve the highest level of accuracy across target domains, while requiring a comparable or lower CNN complexity.
△ Less
Submitted 18 January, 2021;
originally announced January 2021.
-
Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation
Authors:
Le Thanh Nguyen-Meidine,
Atif Belal,
Madhu Kiran,
Jose Dolz,
Louis-Antoine Blais-Morin,
Eric Granger
Abstract:
Unsupervised domain adaptation (UDA) seeks to alleviate the problem of domain shift between the distribution of unlabeled data from the target domain w.r.t. labeled data from the source domain. While the single-target UDA scenario is well studied in the literature, Multi-Target Domain Adaptation (MTDA) remains largely unexplored despite its practical importance, e.g., in multi-camera video-surveil…
▽ More
Unsupervised domain adaptation (UDA) seeks to alleviate the problem of domain shift between the distribution of unlabeled data from the target domain w.r.t. labeled data from the source domain. While the single-target UDA scenario is well studied in the literature, Multi-Target Domain Adaptation (MTDA) remains largely unexplored despite its practical importance, e.g., in multi-camera video-surveillance applications. The MTDA problem can be addressed by adapting one specialized model per target domain, although this solution is too costly in many real-world applications. Blending multiple targets for MTDA has been proposed, yet this solution may lead to a reduction in model specificity and accuracy. In this paper, we propose a novel unsupervised MTDA approach to train a CNN that can generalize well across multiple target domains. Our Multi-Teacher MTDA (MT-MTDA) method relies on multi-teacher knowledge distillation (KD) to iteratively distill target domain knowledge from multiple teachers to a common student. The KD process is performed in a progressive manner, where the student is trained by each teacher on how to perform UDA for a specific target, instead of directly learning domain adapted features. Finally, instead of combining the knowledge from each teacher, MT-MTDA alternates between teachers that distill knowledge, thereby preserving the specificity of each target (teacher) when learning to adapt to the student. MT-MTDA is compared against state-of-the-art methods on several challenging UDA benchmarks, and empirical results show that our proposed model can provide a considerably higher level of accuracy across multiple target domains. Our code is available at: https://github.com/LIVIAETS/MT-MTDA
△ Less
Submitted 19 November, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Joint Progressive Knowledge Distillation and Unsupervised Domain Adaptation
Authors:
Le Thanh Nguyen-Meidine,
Eric Granger,
Madhu Kiran,
Jose Dolz,
Louis-Antoine Blais-Morin
Abstract:
Currently, the divergence in distributions of design and operational data, and large computational complexity are limiting factors in the adoption of CNNs in real-world applications. For instance, person re-identification systems typically rely on a distributed set of cameras, where each camera has different capture conditions. This can translate to a considerable shift between source (e.g. lab se…
▽ More
Currently, the divergence in distributions of design and operational data, and large computational complexity are limiting factors in the adoption of CNNs in real-world applications. For instance, person re-identification systems typically rely on a distributed set of cameras, where each camera has different capture conditions. This can translate to a considerable shift between source (e.g. lab setting) and target (e.g. operational camera) domains. Given the cost of annotating image data captured for fine-tuning in each target domain, unsupervised domain adaptation (UDA) has become a popular approach to adapt CNNs. Moreover, state-of-the-art deep learning models that provide a high level of accuracy often rely on architectures that are too complex for real-time applications. Although several compression and UDA approaches have recently been proposed to overcome these limitations, they do not allow optimizing a CNN to simultaneously address both. In this paper, we propose an unexplored direction -- the joint optimization of CNNs to provide a compressed model that is adapted to perform well for a given target domain. In particular, the proposed approach performs unsupervised knowledge distillation (KD) from a complex teacher model to a compact student model, by leveraging both source and target data. It also improves upon existing UDA techniques by progressively teaching the student about domain-invariant features, instead of directly adapting a compact model on target domain data. Our method is compared against state-of-the-art compression and UDA techniques, using two popular classification datasets for UDA -- Office31 and ImageClef-DA. In both datasets, results indicate that our method can achieve the highest level of accuracy while requiring a comparable or lower time complexity.
△ Less
Submitted 15 May, 2020;
originally announced May 2020.
-
On the Interaction Between Deep Detectors and Siamese Trackers in Video Surveillance
Authors:
Madhu Kiran,
Vivek Tiwari,
Le Thanh Nguyen-Meidine,
Eric Granger
Abstract:
Visual object tracking is an important function in many real-time video surveillance applications, such as localization and spatio-temporal recognition of persons. In real-world applications, an object detector and tracker must interact on a periodic basis to discover new objects, and thereby to initiate tracks. Periodic interactions with the detector can also allow the tracker to validate and/or…
▽ More
Visual object tracking is an important function in many real-time video surveillance applications, such as localization and spatio-temporal recognition of persons. In real-world applications, an object detector and tracker must interact on a periodic basis to discover new objects, and thereby to initiate tracks. Periodic interactions with the detector can also allow the tracker to validate and/or update its object template with new bounding boxes. However, bounding boxes provided by a state-of-the-art detector are noisy, due to changes in appearance, background and occlusion, which can cause the tracker to drift. Moreover, CNN-based detectors can provide a high level of accuracy at the expense of computational complexity, so interactions should be minimized for real-time applications.
In this paper, a new approach is proposed to manage detector-tracker interactions for trackers from the Siamese-FC family. By integrating a change detection mechanism into a deep Siamese-FC tracker, its template can be adapted in response to changes in a target's appearance that lead to drifts during tracking. An abrupt change detection triggers an update of tracker template using the bounding box produced by the detector, while in the case of a gradual change, the detector is used to update an evolving set of templates for robust matching.
Experiments were performed using state-of-the-art Siamese-FC trackers and the YOLOv3 detector on a subset of videos from the OTB-100 dataset that mimic video surveillance scenarios. Results highlight the importance for reliable VOT of using accurate detectors. They also indicate that our adaptive Siamese trackers are robust to noisy object detections, and can significantly improve the performance of Siamese-FC tracking.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Exploiting Prunability for Person Re-Identification
Authors:
Hugo Masson,
Amran Bhuiyan,
Le Thanh Nguyen-Meidine,
Mehrsan Javan,
Parthipan Siva,
Ismail Ben Ayed,
Eric Granger
Abstract:
Recent years have witnessed a substantial increase in the deep learning (DL)architectures proposed for visual recognition tasks like person re-identification,where individuals must be recognized over multiple distributed cameras. Althoughthese architectures have greatly improved the state-of-the-art accuracy, thecomputational complexity of the CNNs commonly used for feature extractionremains an is…
▽ More
Recent years have witnessed a substantial increase in the deep learning (DL)architectures proposed for visual recognition tasks like person re-identification,where individuals must be recognized over multiple distributed cameras. Althoughthese architectures have greatly improved the state-of-the-art accuracy, thecomputational complexity of the CNNs commonly used for feature extractionremains an issue, hindering their deployment on platforms with limited resources,or in applications with real-time constraints. There is an obvious advantage toaccelerating and compressing DL models without significantly decreasing theiraccuracy. However, the source (pruning) domain differs from operational (target)domains, and the domain shift between image data captured with differentnon-overlapping camera viewpoints leads to lower recognition accuracy. In thispaper, we investigate the prunability of these architectures under different designscenarios. This paper first revisits pruning techniques that are suitable forreducing the computational complexity of deep CNN networks applied to personre-identification. Then, these techniques are analysed according to their pruningcriteria and strategy, and according to different scenarios for exploiting pruningmethods to fine-tuning networks to target domains. Experimental resultsobtained using DL models with ResNet feature extractors, and multiplebenchmarks re-identification datasets, indicate that pruning can considerablyreduce network complexity while maintaining a high level of accuracy. Inscenarios where pruning is performed with large pre-training or fine-tuningdatasets, the number of FLOPS required by ResNet architectures is reduced byhalf, while maintaining a comparable rank-1 accuracy (within 1% of the originalmodel). Pruning while training a larger CNNs can also provide a significantlybetter performance than fine-tuning smaller ones.
△ Less
Submitted 14 April, 2021; v1 submitted 4 July, 2019;
originally announced July 2019.
-
Progressive Gradient Pruning for Classification, Detection and DomainAdaptation
Authors:
Le Thanh Nguyen-Meidine,
Eric Granger,
Madhu Kiran,
Louis-Antoine Blais-Morin,
Marco Pedersoli
Abstract:
Although deep neural networks (NNs) have achievedstate-of-the-art accuracy in many visual recognition tasks,the growing computational complexity and energy con-sumption of networks remains an issue, especially for ap-plications on platforms with limited resources and requir-ing real-time processing. Filter pruning techniques haverecently shown promising results for the compression andacceleration…
▽ More
Although deep neural networks (NNs) have achievedstate-of-the-art accuracy in many visual recognition tasks,the growing computational complexity and energy con-sumption of networks remains an issue, especially for ap-plications on platforms with limited resources and requir-ing real-time processing. Filter pruning techniques haverecently shown promising results for the compression andacceleration of convolutional NNs (CNNs). However, thesetechniques involve numerous steps and complex optimisa-tions because some only prune after training CNNs, whileothers prune from scratch during training by integratingsparsity constraints or modifying the loss function.In this paper we propose a new Progressive GradientPruning (PGP) technique for iterative filter pruning dur-ing training. In contrast to previous progressive pruningtechniques, it relies on a novel filter selection criterion thatmeasures the change in filter weights, uses a new hard andsoft pruning strategy and effectively adapts momentum ten-sors during the backward propagation pass. Experimentalresults obtained after training various CNNs on image datafor classification, object detection and domain adaptationbenchmarks indicate that the PGP technique can achievea better trade-off between classification accuracy and net-work (time and memory) complexity than PSFP and otherstate-of-the-art filter pruning techniques.
△ Less
Submitted 25 February, 2020; v1 submitted 20 June, 2019;
originally announced June 2019.
-
A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance Applications
Authors:
Le Thanh Nguyen-Meidine,
Eric Granger,
Madhu Kiran,
Louis-Antoine Blais-Morin
Abstract:
Detecting faces and heads appearing in video feeds are challenging tasks in real-world video surveillance applications due to variations in appearance, occlusions and complex backgrounds. Recently, several CNN architectures have been proposed to increase the accuracy of detectors, although their computational complexity can be an issue, especially for real-time applications, where faces and heads…
▽ More
Detecting faces and heads appearing in video feeds are challenging tasks in real-world video surveillance applications due to variations in appearance, occlusions and complex backgrounds. Recently, several CNN architectures have been proposed to increase the accuracy of detectors, although their computational complexity can be an issue, especially for real-time applications, where faces and heads must be detected live using high-resolution cameras. This paper compares the accuracy and complexity of state-of-the-art CNN architectures that are suitable for face and head detection. Single pass and region-based architectures are reviewed and compared empirically to baseline techniques according to accuracy and to time and memory complexity on images from several challenging datasets. The viability of these architectures is analyzed with real-time video surveillance applications in mind. Results suggest that, although CNN architectures can achieve a very high level of accuracy compared to traditional detectors, their computational cost can represent a limitation for many practical real-time applications.
△ Less
Submitted 10 September, 2018;
originally announced September 2018.