-
Integrating Information Theory and Adversarial Learning for Cross-modal Retrieval
Authors:
Wei Chen,
Yu Liu,
Erwin M. Bakker,
Michael S. Lew
Abstract:
Accurately matching visual and textual data in cross-modal retrieval has been widely studied in the multimedia community. To address these challenges posited by the heterogeneity gap and the semantic gap, we propose integrating Shannon information theory and adversarial learning. In terms of the heterogeneity gap, we integrate modality classification and information entropy maximization adversaria…
▽ More
Accurately matching visual and textual data in cross-modal retrieval has been widely studied in the multimedia community. To address these challenges posited by the heterogeneity gap and the semantic gap, we propose integrating Shannon information theory and adversarial learning. In terms of the heterogeneity gap, we integrate modality classification and information entropy maximization adversarially. For this purpose, a modality classifier (as a discriminator) is built to distinguish the text and image modalities according to their different statistical properties. This discriminator uses its output probabilities to compute Shannon information entropy, which measures the uncertainty of the modality classification it performs. Moreover, feature encoders (as a generator) project uni-modal features into a commonly shared space and attempt to fool the discriminator by maximizing its output information entropy. Thus, maximizing information entropy gradually reduces the distribution discrepancy of cross-modal features, thereby achieving a domain confusion state where the discriminator cannot classify two modalities confidently. To reduce the semantic gap, Kullback-Leibler (KL) divergence and bi-directional triplet loss are used to associate the intra- and inter-modality similarity between features in the shared space. Furthermore, a regularization term based on KL-divergence with temperature scaling is used to calibrate the biased label classifier caused by the data imbalance issue. Extensive experiments with four deep models on four benchmarks are conducted to demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
Lifelong Person Re-Identification via Adaptive Knowledge Accumulation
Authors:
Nan Pu,
Wei Chen,
Yu Liu,
Erwin M. Bakker,
Michael S. Lew
Abstract:
Person ReID methods always learn through a stationary domain that is fixed by the choice of a given dataset. In many contexts (e.g., lifelong learning), those methods are ineffective because the domain is continually changing in which case incremental learning over multiple domains is required potentially. In this work we explore a new and challenging ReID task, namely lifelong person re-identific…
▽ More
Person ReID methods always learn through a stationary domain that is fixed by the choice of a given dataset. In many contexts (e.g., lifelong learning), those methods are ineffective because the domain is continually changing in which case incremental learning over multiple domains is required potentially. In this work we explore a new and challenging ReID task, namely lifelong person re-identification (LReID), which enables to learn continuously across multiple domains and even generalise on new and unseen domains. Following the cognitive processes in the human brain, we design an Adaptive Knowledge Accumulation (AKA) framework that is endowed with two crucial abilities: knowledge representation and knowledge operation. Our method alleviates catastrophic forgetting on seen domains and demonstrates the ability to generalize to unseen domains. Correspondingly, we also provide a new and large-scale benchmark for LReID. Extensive experiments demonstrate our method outperforms other competitors by a margin of 5.8% mAP in generalising evaluation.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Deep Learning for Instance Retrieval: A Survey
Authors:
Wei Chen,
Yu Liu,
Weiping Wang,
Erwin Bakker,
Theodoros Georgiou,
Paul Fieguth,
Li Liu,
Michael S. Lew
Abstract:
In recent years a vast amount of visual content has been generated and shared from many fields, such as social media platforms, medical imaging, and robotics. This abundance of content creation and sharing has introduced new challenges, particularly that of searching databases for similar content-Content Based Image Retrieval (CBIR)-a long-established research area in which improved efficiency and…
▽ More
In recent years a vast amount of visual content has been generated and shared from many fields, such as social media platforms, medical imaging, and robotics. This abundance of content creation and sharing has introduced new challenges, particularly that of searching databases for similar content-Content Based Image Retrieval (CBIR)-a long-established research area in which improved efficiency and accuracy are needed for real-time retrieval. Artificial intelligence has made progress in CBIR and has significantly facilitated the process of instance search. In this survey we review recent instance retrieval works that are developed based on deep learning algorithms and techniques, with the survey organized by deep network architecture types, deep features, feature embedding and aggregation methods, and network fine-tuning strategies. Our survey considers a wide variety of recent methods, whereby we identify milestone work, reveal connections among various methods and present the commonly used benchmarks, evaluation results, common challenges, and propose promising future directions.
△ Less
Submitted 30 October, 2022; v1 submitted 27 January, 2021;
originally announced January 2021.
-
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
Authors:
Wei Chen,
Weiping Wang,
Li Liu,
Michael S. Lew
Abstract:
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go…
▽ More
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Dual Gaussian-based Variational Subspace Disentanglement for Visible-Infrared Person Re-Identification
Authors:
Nan Pu,
Wei Chen,
Yu Liu,
Erwin M. Bakker,
Michael S. Lew
Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person re-identification mainly overcomes, VI-ReID suffers from additional inter-modality variance caused by the inherent heterogeneous gap. To solve the problem, we present a carefully designed dual Gaussian-bas…
▽ More
Visible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person re-identification mainly overcomes, VI-ReID suffers from additional inter-modality variance caused by the inherent heterogeneous gap. To solve the problem, we present a carefully designed dual Gaussian-based variational auto-encoder (DG-VAE), which disentangles an identity-discriminable and an identity-ambiguous cross-modality feature subspace, following a mixture-of-Gaussians (MoG) prior and a standard Gaussian distribution prior, respectively. Disentangling cross-modality identity-discriminable features leads to more robust retrieval for VI-ReID. To achieve efficient optimization like conventional VAE, we theoretically derive two variational inference terms for the MoG prior under the supervised setting, which not only restricts the identity-discriminable subspace so that the model explicitly handles the cross-modality intra-identity variance, but also enables the MoG distribution to avoid posterior collapse. Furthermore, we propose a triplet swap reconstruction (TSR) strategy to promote the above disentangling process. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two VI-ReID datasets.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
On the Exploration of Convolutional Fusion Networks for Visual Recognition
Authors:
Yu Liu,
Yanming Guo,
Michael S. Lew
Abstract:
Despite recent advances in multi-scale deep representations, their limitations are attributed to expensive parameters and weak fusion modules. Hence, we propose an efficient approach to fuse multi-scale deep representations, called convolutional fusion networks (CFN). Owing to using 1$\times$1 convolution and global average pooling, CFN can efficiently generate the side branches while adding few p…
▽ More
Despite recent advances in multi-scale deep representations, their limitations are attributed to expensive parameters and weak fusion modules. Hence, we propose an efficient approach to fuse multi-scale deep representations, called convolutional fusion networks (CFN). Owing to using 1$\times$1 convolution and global average pooling, CFN can efficiently generate the side branches while adding few parameters. In addition, we present a locally-connected fusion module, which can learn adaptive weights for the side branches and form a discriminatively fused feature. CFN models trained on the CIFAR and ImageNet datasets demonstrate remarkable improvements over the plain CNNs. Furthermore, we generalize CFN to three new tasks, including scene recognition, fine-grained recognition and image retrieval. Our experiments show that it can obtain consistent improvements towards the transferring tasks.
△ Less
Submitted 16 November, 2016;
originally announced November 2016.
-
Across Browsers SVG Implementation
Authors:
Liang Wang,
Nies Huijsmans,
Michael S. Lew,
Dan Tsymbala
Abstract:
In this work SVG will be translated into VML or HTML by using Javascript based on Backbase Client Framework. The target of this project is to implement SVG to be viewed in Internet Explorer without any plug-in and work together with other Backbase Client Framework languages. The result of this project will be added as an extension to the current Backbase Client Framework.
In this work SVG will be translated into VML or HTML by using Javascript based on Backbase Client Framework. The target of this project is to implement SVG to be viewed in Internet Explorer without any plug-in and work together with other Backbase Client Framework languages. The result of this project will be added as an extension to the current Backbase Client Framework.
△ Less
Submitted 31 December, 2010;
originally announced January 2011.
-
Binary and nonbinary description of hypointensity in human brain MR images
Authors:
Xiaojing Chen,
Michael S. Lew
Abstract:
Accumulating evidence has shown that iron is involved in the mechanism underlying many neurodegenerative diseases, such as Alzheimer's disease, Parkinson's disease and Huntington's disease. Abnormal (higher) iron accumulation has been detected in the brains of most neurodegenerative patients, especially in the basal ganglia region. Presence of iron leads to changes in MR signal in both magnitude a…
▽ More
Accumulating evidence has shown that iron is involved in the mechanism underlying many neurodegenerative diseases, such as Alzheimer's disease, Parkinson's disease and Huntington's disease. Abnormal (higher) iron accumulation has been detected in the brains of most neurodegenerative patients, especially in the basal ganglia region. Presence of iron leads to changes in MR signal in both magnitude and phase. Accordingly, tissues with high iron concentration appear hypo-intense (darker than usual) in MR contrasts. In this report, we proposed an improved binary hypointensity description and a novel nonbinary hypointensity description based on principle components analysis. Moreover, Kendall's rank correlation coefficient was used to compare the complementary and redundant information provided by the two methods in order to better understand the individual descriptions of iron accumulation in the brain.
△ Less
Submitted 31 December, 2010;
originally announced January 2011.
-
A Framework for Real-Time Face and Facial Feature Tracking using Optical Flow Pre-estimation and Template Tracking
Authors:
E. R. Gast,
Michael S. Lew
Abstract:
This work presents a framework for tracking head movements and capturing the movements of the mouth and both the eyebrows in real-time. We present a head tracker which is a combination of a optical flow and a template based tracker. The estimation of the optical flow head tracker is used as starting point for the template tracker which fine-tunes the head estimation. This approach together with re…
▽ More
This work presents a framework for tracking head movements and capturing the movements of the mouth and both the eyebrows in real-time. We present a head tracker which is a combination of a optical flow and a template based tracker. The estimation of the optical flow head tracker is used as starting point for the template tracker which fine-tunes the head estimation. This approach together with re-updating the optical flow points prevents the head tracker from drifting. This combination together with our switching scheme, makes our tracker very robust against fast movement and motion-blur. We also propose a way to reduce the influence of partial occlusion of the head. In both the optical flow and the template based tracker we identify and exclude occluded points.
△ Less
Submitted 31 December, 2010;
originally announced January 2011.
-
Analysis of Using Browser-native Technology to Build Rich Internet Applications for Image Manipulation
Authors:
Thomas Steenbergen,
Michael S. Lew
Abstract:
In this work we investigate whether browser-native technologies can be used to perform photo manipulation tasks e.g cropping, resizing or rotating an image within the current mainstream browser. By the use of a case study we will analyze problems that have occurred during the implementation of a prototype web application that utilizes browser-native web technology in order to create an online vers…
▽ More
In this work we investigate whether browser-native technologies can be used to perform photo manipulation tasks e.g cropping, resizing or rotating an image within the current mainstream browser. By the use of a case study we will analyze problems that have occurred during the implementation of a prototype web application that utilizes browser-native web technology in order to create an online version of a real world photo scrapbook. Implementation of a prototype will allows us to analyze the strengths and weaknesses of current web technology when it comes to browser-based image manipulation. Furthermore we explore the possibilities of the Ajax in combination Canvas, SVG and VML to provide a more interactive graphical user interface to perform image manipulation tasks on the web.
△ Less
Submitted 31 December, 2010;
originally announced January 2011.
-
Dynamic Feature Description in Human Action Recognition
Authors:
Ruoyun Gao,
Michael S. Lew,
Ling Shao
Abstract:
This work aims to present novel description methods for human action recognition. Generally, a video sequence can be represented as a collection of spatial temporal words by detecting space-time interest points and describing the unique features around the detected points (Bag of Words representation). Interest points as well as the cuboids around them are considered informative for feature descri…
▽ More
This work aims to present novel description methods for human action recognition. Generally, a video sequence can be represented as a collection of spatial temporal words by detecting space-time interest points and describing the unique features around the detected points (Bag of Words representation). Interest points as well as the cuboids around them are considered informative for feature description in terms of both the structural distribution of interest points and the information content inside the cuboids. Our proposed description approaches are based on this idea and making the feature descriptors more discriminative.
△ Less
Submitted 31 December, 2010;
originally announced January 2011.