-
GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing
Authors:
Khawar Islam,
Muhammad Zaigham Zaheer,
Arif Mahmood,
Karthik Nandakumar,
Naveed Akhtar
Abstract:
Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classifica…
▽ More
Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.
△ Less
Submitted 5 December, 2024; v1 submitted 3 December, 2024;
originally announced December 2024.
-
A Survey of the Self Supervised Learning Mechanisms for Vision Transformers
Authors:
Asifullah Khan,
Anabia Sohail,
Mustansar Fiaz,
Mehdi Hassan,
Tariq Habib Afridi,
Sibghat Ullah Marwat,
Farzeen Munir,
Safdar Ali,
Hannan Naseem,
Muhammad Zaigham Zaheer,
Kamran Ali,
Tangina Sultana,
Ziaurrehman Tanoli,
Naeem Akhter
Abstract:
Vision Transformers (ViTs) have recently demonstrated remarkable performance in computer vision tasks. However, their parameter-intensive nature and reliance on large amounts of data for effective performance have shifted the focus from traditional human-annotated labels to unsupervised learning and pretraining strategies that uncover hidden structures within the data. In response to this challeng…
▽ More
Vision Transformers (ViTs) have recently demonstrated remarkable performance in computer vision tasks. However, their parameter-intensive nature and reliance on large amounts of data for effective performance have shifted the focus from traditional human-annotated labels to unsupervised learning and pretraining strategies that uncover hidden structures within the data. In response to this challenge, self-supervised learning (SSL) has emerged as a promising paradigm. SSL leverages inherent relationships within the data itself as a form of supervision, eliminating the need for manual labeling and offering a more scalable and resource-efficient alternative for model training. Given these advantages, it is imperative to explore the integration of SSL techniques with ViTs, particularly in scenarios with limited labeled data. Inspired by this evolving trend, this survey aims to systematically review SSL mechanisms tailored for ViTs. We propose a comprehensive taxonomy to classify SSL techniques based on their representations and pre-training tasks. Additionally, we discuss the motivations behind SSL, review prominent pre-training tasks, and highlight advancements and challenges in this field. Furthermore, we conduct a comparative analysis of various SSL methods designed for ViTs, evaluating their strengths, limitations, and applicability to different scenarios.
△ Less
Submitted 10 June, 2025; v1 submitted 30 August, 2024;
originally announced August 2024.
-
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Zaigham Zaheer,
Muhammad Haris Khan,
Karthik Nandakumar,
Muhammad Haroon Yousaf,
Hassan Sajjad,
Tom De Schepper,
Markus Schedl
Abstract:
Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to t…
▽ More
Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance as well as robustness to missing modalities. Extensive experiments are performed on four challenging datasets including textual-visual (UPMC Food-101, Hateful Memes, Ferramenta) and audio-visual modalities (VoxCeleb1). Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities
Authors:
Muhammad Irzam Liaqat,
Shah Nawaz,
Muhammad Zaigham Zaheer,
Muhammad Saad Saeed,
Hassan Sajjad,
Tom De Schepper,
Karthik Nandakumar,
Muhammad Haris Khan Markus Schedl
Abstract:
Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific streams making the models reliant on the availability of a complete set of modalities. In…
▽ More
Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific streams making the models reliant on the availability of a complete set of modalities. In this work, we propose a robust textual-visual multimodal learning method, Chameleon, that completely deviates from the conventional multi-branch design. To enable this, we present the unification of input modalities into one format by encoding textual modality into visual representations. As a result, our approach does not require modality-specific branches to learn modality-independent multimodal representations making it robust to missing modalities. Extensive experiments are performed on four popular challenging datasets including Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models
Authors:
Khawar Islam,
Muhammad Zaigham Zaheer,
Arif Mahmood,
Karthik Nandakumar
Abstract:
Recently, a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks. In these techniques, two or more randomly selected natural images are mixed together to generate an augmented image. Such methods may not only omit important portions of the input images but also introduce label ambiguities by mixing images across labels resu…
▽ More
Recently, a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks. In these techniques, two or more randomly selected natural images are mixed together to generate an augmented image. Such methods may not only omit important portions of the input images but also introduce label ambiguities by mixing images across labels resulting in misleading supervisory signals. To address these limitations, we propose DiffuseMix, a novel data augmentation technique that leverages a diffusion model to reshape training images, supervised by our bespoke conditional prompts. First, concatenation of a partial natural image and its generated counterpart is obtained which helps in avoiding the generation of unrealistic images or label ambiguities. Then, to enhance resilience against adversarial attacks and improves safety measures, a randomly selected structural pattern from a set of fractal images is blended into the concatenated image to form the final augmented image for training. Our empirical results on seven different datasets reveal that DiffuseMix achieves superior performance compared to existing state-of the-art methods on tasks including general classification,fine-grained classification, fine-tuning, data scarcity, and adversarial robustness. Augmented datasets and codes are available here: https://diffusemix.github.io/
△ Less
Submitted 5 April, 2024;
originally announced May 2024.
-
Exploiting Autoencoder's Weakness to Generate Pseudo Anomalies
Authors:
Marcella Astrid,
Muhammad Zaigham Zaheer,
Djamila Aouada,
Seung-Ik Lee
Abstract:
Due to the rare occurrence of anomalous events, a typical approach to anomaly detection is to train an autoencoder (AE) with normal data only so that it learns the patterns or representations of the normal training data. At test time, the trained AE is expected to well reconstruct normal but to poorly reconstruct anomalous data. However, contrary to the expectation, anomalous data is often well re…
▽ More
Due to the rare occurrence of anomalous events, a typical approach to anomaly detection is to train an autoencoder (AE) with normal data only so that it learns the patterns or representations of the normal training data. At test time, the trained AE is expected to well reconstruct normal but to poorly reconstruct anomalous data. However, contrary to the expectation, anomalous data is often well reconstructed as well. In order to further separate the reconstruction quality between normal and anomalous data, we propose creating pseudo anomalies from learned adaptive noise by exploiting the aforementioned weakness of AE, i.e., reconstructing anomalies too well. The generated noise is added to the normal data to create pseudo anomalies. Extensive experiments on Ped2, Avenue, ShanghaiTech, CIFAR-10, and KDDCUP datasets demonstrate the effectiveness and generic applicability of our approach in improving the discriminative capability of AEs for anomaly detection.
△ Less
Submitted 17 May, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Salman Tahir,
Rohan Kumar Das,
Muhammad Zaigham Zaheer,
Marta Moscati,
Markus Schedl,
Muhammad Haris Khan,
Karthik Nandakumar,
Muhammad Haroon Yousaf
Abstract:
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2…
▽ More
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
△ Less
Submitted 22 July, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline
Authors:
Anas Al-lahham,
Muhammad Zaigham Zaheer,
Nurbek Tastan,
Karthik Nandakumar
Abstract:
Unsupervised (US) video anomaly detection (VAD) in surveillance applications is gaining more popularity recently due to its practical real-world applications. As surveillance videos are privacy sensitive and the availability of large-scale video data may enable better US-VAD systems, collaborative learning can be highly rewarding in this setting. However, due to the extremely challenging nature of…
▽ More
Unsupervised (US) video anomaly detection (VAD) in surveillance applications is gaining more popularity recently due to its practical real-world applications. As surveillance videos are privacy sensitive and the availability of large-scale video data may enable better US-VAD systems, collaborative learning can be highly rewarding in this setting. However, due to the extremely challenging nature of the US-VAD task, where learning is carried out without any annotations, privacy-preserving collaborative learning of US-VAD systems has not been studied yet. In this paper, we propose a new baseline for anomaly detection capable of localizing anomalous events in complex surveillance videos in a fully unsupervised fashion without any labels on a privacy-preserving participant-based distributed training configuration. Additionally, we propose three new evaluation protocols to benchmark anomaly detection approaches on various scenarios of collaborations and data availability. Based on these protocols, we modify existing VAD datasets to extensively evaluate our approach as well as existing US SOTA methods on two large-scale datasets including UCF-Crime and XD-Violence. All proposed evaluation protocols, dataset splits, and codes are available here: https://github.com/AnasEmad11/CLAP
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Constricting Normal Latent Space for Anomaly Detection with Normal-only Training Data
Authors:
Marcella Astrid,
Muhammad Zaigham Zaheer,
Seung-Ik Lee
Abstract:
In order to devise an anomaly detection model using only normal training data, an autoencoder (AE) is typically trained to reconstruct the data. As a result, the AE can extract normal representations in its latent space. During test time, since AE is not trained using real anomalies, it is expected to poorly reconstruct the anomalous data. However, several researchers have observed that it is not…
▽ More
In order to devise an anomaly detection model using only normal training data, an autoencoder (AE) is typically trained to reconstruct the data. As a result, the AE can extract normal representations in its latent space. During test time, since AE is not trained using real anomalies, it is expected to poorly reconstruct the anomalous data. However, several researchers have observed that it is not the case. In this work, we propose to limit the reconstruction capability of AE by introducing a novel latent constriction loss, which is added to the existing reconstruction loss. By using our method, no extra computational cost is added to the AE during test time. Evaluations using three video anomaly detection benchmark datasets, i.e., Ped2, Avenue, and ShanghaiTech, demonstrate the effectiveness of our method in limiting the reconstruction capability of AE, which leads to a better anomaly detection model.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation
Authors:
Vu Ngoc Tu,
Van Thong Huynh,
Hyung-Jeong Yang,
M. Zaigham Zaheer,
Shah Nawaz,
Karthik Nandakumar,
Soo-Hyung Kim
Abstract:
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling an…
▽ More
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.
△ Less
Submitted 31 July, 2023;
originally announced August 2023.
-
PseudoBound: Limiting the anomaly reconstruction capability of one-class classifiers using pseudo anomalies
Authors:
Marcella Astrid,
Muhammad Zaigham Zaheer,
Seung-Ik Lee
Abstract:
Due to the rarity of anomalous events, video anomaly detection is typically approached as one-class classification (OCC) problem. Typically in OCC, an autoencoder (AE) is trained to reconstruct the normal only training data with the expectation that, in test time, it can poorly reconstruct the anomalous data. However, previous studies have shown that, even trained with only normal data, AEs can of…
▽ More
Due to the rarity of anomalous events, video anomaly detection is typically approached as one-class classification (OCC) problem. Typically in OCC, an autoencoder (AE) is trained to reconstruct the normal only training data with the expectation that, in test time, it can poorly reconstruct the anomalous data. However, previous studies have shown that, even trained with only normal data, AEs can often reconstruct anomalous data as well, resulting in a decreased performance. To mitigate this problem, we propose to limit the anomaly reconstruction capability of AEs by incorporating pseudo anomalies during the training of an AE. Extensive experiments using five types of pseudo anomalies show the robustness of our training mechanism towards any kind of pseudo anomaly. Moreover, we demonstrate the effectiveness of our proposed pseudo anomaly based training approach against several existing state-ofthe-art (SOTA) methods on three benchmark video anomaly datasets, outperforming all the other reconstruction-based approaches in two datasets and showing the second best performance in the other dataset.
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
Single-branch Network for Multimodal Training
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Haris Khan,
Muhammad Zaigham Zaheer,
Karthik Nandakumar,
Muhammad Haroon Yousaf,
Arif Mahmood
Abstract:
With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each mod…
▽ More
With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each modality to bridge the gap between them. The modular structure of their branched networks is fundamental in creating numerous multimodal applications and has become a defacto standard to handle multiple modalities. In contrast, we propose a novel single-branch network capable of learning discriminative representation of unimodal as well as multimodal tasks without changing the network. An important feature of our single-branch network is that it can be trained either using single or multiple modalities without sacrificing performance. We evaluated our proposed single-branch network on the challenging multimodal problem (face-voice association) for cross-modal verification and matching tasks with various loss formulations. Experimental results demonstrate the superiority of our proposed single-branch network over the existing methods in a wide range of experiments. Code: https://github.com/msaadsaeed/SBNet
△ Less
Submitted 10 March, 2023;
originally announced March 2023.
-
Face Pyramid Vision Transformer
Authors:
Khawar Islam,
Muhammad Zaigham Zaheer,
Arif Mahmood
Abstract:
A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the…
▽ More
A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT is evaluated on seven benchmark datasets and compared with ten existing state-of-the-art methods, including CNNs, pure ViTs, and Convolutional ViTs. Despite fewer parameters, FPVT has demonstrated excellent performance over the compared methods. Project page is available at https://khawar-islam.github.io/fpvt/
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Stabilizing Adversarially Learned One-Class Novelty Detection Using Pseudo Anomalies
Authors:
Muhammad Zaigham Zaheer,
Jin Ha Lee,
Arif Mahmood,
Marcella Astrid,
Seung-Ik Lee
Abstract:
Recently, anomaly scores have been formulated using reconstruction loss of the adversarially learned generators and/or classification loss of discriminators. Unavailability of anomaly examples in the training data makes optimization of such networks challenging. Attributed to the adversarial training, performance of such models fluctuates drastically with each training step, making it difficult to…
▽ More
Recently, anomaly scores have been formulated using reconstruction loss of the adversarially learned generators and/or classification loss of discriminators. Unavailability of anomaly examples in the training data makes optimization of such networks challenging. Attributed to the adversarial training, performance of such models fluctuates drastically with each training step, making it difficult to halt the training at an optimal point. In the current study, we propose a robust anomaly detection framework that overcomes such instability by transforming the fundamental role of the discriminator from identifying real vs. fake data to distinguishing good vs. bad quality reconstructions. For this purpose, we propose a method that utilizes the current state as well as an old state of the same generator to create good and bad quality reconstruction examples. The discriminator is trained on these examples to detect the subtle distortions that are often present in the reconstructions of anomalous data. In addition, we propose an efficient generic criterion to stop the training of our model, ensuring elevated performance. Extensive experiments performed on six datasets across multiple domains including image and video based anomaly detection, medical diagnosis, and network security, have demonstrated excellent performance of our approach.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
Clustering Aided Weakly Supervised Training to Detect Anomalous Events in Surveillance Videos
Authors:
Muhammad Zaigham Zaheer,
Arif Mahmood,
Marcella Astrid,
Seung-Ik Lee
Abstract:
Formulating learning systems for the detection of real-world anomalous events using only video-level labels is a challenging task mainly due to the presence of noisy labels as well as the rare occurrence of anomalous events in the training data. We propose a weakly supervised anomaly detection system which has multiple contributions including a random batch selection mechanism to reduce inter-batc…
▽ More
Formulating learning systems for the detection of real-world anomalous events using only video-level labels is a challenging task mainly due to the presence of noisy labels as well as the rare occurrence of anomalous events in the training data. We propose a weakly supervised anomaly detection system which has multiple contributions including a random batch selection mechanism to reduce inter-batch correlation and a normalcy suppression block which learns to minimize anomaly scores over normal regions of a video by utilizing the overall information available in a training batch. In addition, a clustering loss block is proposed to mitigate the label noise and to improve the representation learning for the anomalous and normal regions. This block encourages the backbone network to produce two distinct feature clusters representing normal and anomalous events. Extensive analysis of the proposed approach is provided using three popular anomaly detection datasets including UCF-Crime, ShanghaiTech, and UCSD Ped2. The experiments demonstrate a superior anomaly detection capability of our approach.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
Generative Cooperative Learning for Unsupervised Video Anomaly Detection
Authors:
Muhammad Zaigham Zaheer,
Arif Mahmood,
Muhammad Haris Khan,
Mattia Segu,
Fisher Yu,
Seung-Ik Lee
Abstract:
Video anomaly detection is well investigated in weakly-supervised and one-class classification (OCC) settings. However, unsupervised video anomaly detection methods are quite sparse, likely because anomalies are less frequent in occurrence and usually not well-defined, which when coupled with the absence of ground truth supervision, could adversely affect the performance of the learning algorithms…
▽ More
Video anomaly detection is well investigated in weakly-supervised and one-class classification (OCC) settings. However, unsupervised video anomaly detection methods are quite sparse, likely because anomalies are less frequent in occurrence and usually not well-defined, which when coupled with the absence of ground truth supervision, could adversely affect the performance of the learning algorithms. This problem is challenging yet rewarding as it can completely eradicate the costs of obtaining laborious annotations and enable such systems to be deployed without human intervention. To this end, we propose a novel unsupervised Generative Cooperative Learning (GCL) approach for video anomaly detection that exploits the low frequency of anomalies towards building a cross-supervision between a generator and a discriminator. In essence, both networks get trained in a cooperative fashion, thereby allowing unsupervised learning. We conduct extensive experiments on two large-scale video anomaly detection datasets, UCF crime, and ShanghaiTech. Consistent improvement over the existing state-of-the-art unsupervised and OCC methods corroborate the effectiveness of our approach.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Synthetic Temporal Anomaly Guided End-to-End Video Anomaly Detection
Authors:
Marcella Astrid,
Muhammad Zaigham Zaheer,
Seung-Ik Lee
Abstract:
Due to the limited availability of anomaly examples, video anomaly detection is often seen as one-class classification (OCC) problem. A popular way to tackle this problem is by utilizing an autoencoder (AE) trained only on normal data. At test time, the AE is then expected to reconstruct the normal input well while reconstructing the anomalies poorly. However, several studies show that, even with…
▽ More
Due to the limited availability of anomaly examples, video anomaly detection is often seen as one-class classification (OCC) problem. A popular way to tackle this problem is by utilizing an autoencoder (AE) trained only on normal data. At test time, the AE is then expected to reconstruct the normal input well while reconstructing the anomalies poorly. However, several studies show that, even with normal data only training, AEs can often start reconstructing anomalies as well which depletes their anomaly detection performance. To mitigate this, we propose a temporal pseudo anomaly synthesizer that generates fake-anomalies using only normal data. An AE is then trained to maximize the reconstruction loss on pseudo anomalies while minimizing this loss on normal data. This way, the AE is encouraged to produce distinguishable reconstructions for normal and anomalous frames. Extensive experiments and analysis on three challenging video anomaly datasets demonstrate the effectiveness of our approach to improve the basic AEs in achieving superiority against several existing state-of-the-art models.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Learning Not to Reconstruct Anomalies
Authors:
Marcella Astrid,
Muhammad Zaigham Zaheer,
Jae-Yeong Lee,
Seung-Ik Lee
Abstract:
Video anomaly detection is often seen as one-class classification (OCC) problem due to the limited availability of anomaly examples. Typically, to tackle this problem, an autoencoder (AE) is trained to reconstruct the input with training set consisting only of normal data. At test time, the AE is then expected to well reconstruct the normal data while poorly reconstructing the anomalous data. Howe…
▽ More
Video anomaly detection is often seen as one-class classification (OCC) problem due to the limited availability of anomaly examples. Typically, to tackle this problem, an autoencoder (AE) is trained to reconstruct the input with training set consisting only of normal data. At test time, the AE is then expected to well reconstruct the normal data while poorly reconstructing the anomalous data. However, several studies have shown that, even with only normal data training, AEs can often start reconstructing anomalies as well which depletes the anomaly detection performance. To mitigate this problem, we propose a novel methodology to train AEs with the objective of reconstructing only normal data, regardless of the input (i.e., normal or abnormal). Since no real anomalies are available in the OCC settings, the training is assisted by pseudo anomalies that are generated by manipulating normal data to simulate the out-of-normal-data distribution. We additionally propose two ways to generate pseudo anomalies: patch and skip frame based. Extensive experiments on three challenging video anomaly datasets demonstrate the effectiveness of our method in improving conventional AEs, achieving state-of-the-art performance.
△ Less
Submitted 24 October, 2021; v1 submitted 19 October, 2021;
originally announced October 2021.
-
Deep Visual Anomaly detection with Negative Learning
Authors:
Jin-Ha Lee,
Marcella Astrid,
Muhammad Zaigham Zaheer,
Seung-Ik Lee
Abstract:
With the increase in the learning capability of deep convolution-based architectures, various applications of such models have been proposed over time. In the field of anomaly detection, improvements in deep learning opened new prospects of exploration for the researchers whom tried to automate the labor-intensive features of data collection. First, in terms of data collection, it is impossible to…
▽ More
With the increase in the learning capability of deep convolution-based architectures, various applications of such models have been proposed over time. In the field of anomaly detection, improvements in deep learning opened new prospects of exploration for the researchers whom tried to automate the labor-intensive features of data collection. First, in terms of data collection, it is impossible to anticipate all the anomalies that might exist in a given environment. Second, assuming we limit the possibilities of anomalies, it will still be hard to record all these scenarios for the sake of training a model. Third, even if we manage to record a significant amount of abnormal data, it's laborious to annotate this data on pixel or even frame level. Various approaches address the problem by proposing one-class classification using generative models trained on only normal data. In such methods, only the normal data is used, which is abundantly available and doesn't require significant human input. However, these are trained with only normal data and at the test time, given abnormal data as input, may often generate normal-looking output. This happens due to the hallucination characteristic of generative models. Next, these systems are designed to not use abnormal examples during the training. In this paper, we propose anomaly detection with negative learning (ADNL), which employs the negative learning concept for the enhancement of anomaly detection by utilizing a very small number of labeled anomaly data as compared with the normal data during training. The idea is to limit the reconstruction capability of a generative model using the given a small amount of anomaly examples. This way, the network not only learns to reconstruct normal data but also encloses the normal distribution far from the possible distribution of anomalies.
△ Less
Submitted 23 May, 2021;
originally announced May 2021.
-
Cleaning Label Noise with Clusters for Minimally Supervised Anomaly Detection
Authors:
Muhammad Zaigham Zaheer,
Jin-ha Lee,
Marcella Astrid,
Arif Mahmood,
Seung-Ik Lee
Abstract:
Learning to detect real-world anomalous events using video-level annotations is a difficult task mainly because of the noise present in labels. An anomalous labelled video may actually contain anomaly only in a short duration while the rest of the video can be normal. In the current work, we formulate a weakly supervised anomaly detection method that is trained using only video-level labels. To th…
▽ More
Learning to detect real-world anomalous events using video-level annotations is a difficult task mainly because of the noise present in labels. An anomalous labelled video may actually contain anomaly only in a short duration while the rest of the video can be normal. In the current work, we formulate a weakly supervised anomaly detection method that is trained using only video-level labels. To this end, we propose to utilize binary clustering which helps in mitigating the noise present in the labels of anomalous videos. Our formulation encourages both the main network and the clustering to complement each other in achieving the goal of weakly supervised training. The proposed method yields 78.27% and 84.16% frame-level AUC on UCF-crime and ShanghaiTech datasets respectively, demonstrating its superiority over existing state-of-the-art algorithms.
△ Less
Submitted 30 April, 2021;
originally announced April 2021.
-
CLAWS: Clustering Assisted Weakly Supervised Learning with Normalcy Suppression for Anomalous Event Detection
Authors:
Muhammad Zaigham Zaheer,
Arif Mahmood,
Marcella Astrid,
Seung-Ik Lee
Abstract:
Learning to detect real-world anomalous events through video-level labels is a challenging task due to the rare occurrence of anomalies as well as noise in the labels. In this work, we propose a weakly supervised anomaly detection method which has manifold contributions including1) a random batch based training procedure to reduce inter-batch correlation, 2) a normalcy suppression mechanism to min…
▽ More
Learning to detect real-world anomalous events through video-level labels is a challenging task due to the rare occurrence of anomalies as well as noise in the labels. In this work, we propose a weakly supervised anomaly detection method which has manifold contributions including1) a random batch based training procedure to reduce inter-batch correlation, 2) a normalcy suppression mechanism to minimize anomaly scores of the normal regions of a video by taking into account the overall information available in one training batch, and 3) a clustering distance based loss to contribute towards mitigating the label noise and to produce better anomaly representations by encouraging our model to generate distinct normal and anomalous clusters. The proposed method obtains83.03% and 89.67% frame-level AUC performance on the UCF Crime and ShanghaiTech datasets respectively, demonstrating its superiority over the existing state-of-the-art algorithms.
△ Less
Submitted 4 August, 2021; v1 submitted 24 November, 2020;
originally announced November 2020.
-
A Self-Reasoning Framework for Anomaly Detection Using Video-Level Labels
Authors:
Muhammad Zaigham Zaheer,
Arif Mahmood,
Hochul Shin,
Seung-Ik Lee
Abstract:
Anomalous event detection in surveillance videos is a challenging and practical research problem among image and video processing community. Compared to the frame-level annotations of anomalous events, obtaining video-level annotations is quite fast and cheap though such high-level labels may contain significant noise. More specifically, an anomalous labeled video may actually contain anomaly only…
▽ More
Anomalous event detection in surveillance videos is a challenging and practical research problem among image and video processing community. Compared to the frame-level annotations of anomalous events, obtaining video-level annotations is quite fast and cheap though such high-level labels may contain significant noise. More specifically, an anomalous labeled video may actually contain anomaly only in a short duration while the rest of the video frames may be normal. In the current work, we propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels. To carry out the self-reasoning based training, we generate pseudo labels by using binary clustering of spatio-temporal video features which helps in mitigating the noise present in the labels of anomalous videos. Our proposed formulation encourages both the main network and the clustering to complement each other in achieving the goal of more accurate anomaly detection. The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and UCSD Ped2. The experiments demonstrate superiority of our proposed framework over the current state-of-the-art methods.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
Old is Gold: Redefining the Adversarially Learned One-Class Classifier Training Paradigm
Authors:
Muhammad Zaigham Zaheer,
Jin-ha Lee,
Marcella Astrid,
Seung-Ik Lee
Abstract:
A popular method for anomaly detection is to use the generator of an adversarial network to formulate anomaly scores over reconstruction loss of input. Due to the rare occurrence of anomalies, optimizing such networks can be a cumbersome task. Another possible approach is to use both generator and discriminator for anomaly detection. However, attributed to the involvement of adversarial training,…
▽ More
A popular method for anomaly detection is to use the generator of an adversarial network to formulate anomaly scores over reconstruction loss of input. Due to the rare occurrence of anomalies, optimizing such networks can be a cumbersome task. Another possible approach is to use both generator and discriminator for anomaly detection. However, attributed to the involvement of adversarial training, this model is often unstable in a way that the performance fluctuates drastically with each training step. In this study, we propose a framework that effectively generates stable results across a wide range of training steps and allows us to use both the generator and the discriminator of an adversarial model for efficient and robust anomaly detection. Our approach transforms the fundamental role of a discriminator from identifying real and fake data to distinguishing between good and bad quality reconstructions. To this end, we prepare training examples for the good quality reconstruction by employing the current generator, whereas poor quality examples are obtained by utilizing an old state of the same generator. This way, the discriminator learns to detect subtle distortions that often appear in reconstructions of the anomaly inputs. Extensive experiments performed on Caltech-256 and MNIST image datasets for novelty detection show superior results. Furthermore, on UCSD Ped2 video dataset for anomaly detection, our model achieves a frame-level AUC of 98.1%, surpassing recent state-of-the-art methods.
△ Less
Submitted 19 June, 2020; v1 submitted 16 April, 2020;
originally announced April 2020.