Search | arXiv e-print repository

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Authors: Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

Abstract: The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promisi… ▽ More The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt. △ Less

Submitted 14 March, 2025; v1 submitted 30 November, 2024; originally announced December 2024.

arXiv:2403.09918 [pdf, other]

Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptation of Object Detectors

Authors: Atif Belal, Akhil Meethal, Francisco Perdigon Romero, Marco Pedersoli, Eric Granger

Abstract: Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment across source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform… ▽ More Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment across source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform feature alignment in a class-agnostic manner. This is challenging since the objects have unique modality information due to variations in object appearance across domains. A recent prototype-based approach proposed a class-wise alignment, yet it suffers from error accumulation caused by noisy pseudo-labels that can negatively affect adaptation with imbalanced data. To overcome these limitations, we propose an attention-based class-conditioned alignment method for MSDA, designed to align instances of each object category across domains. In particular, an attention module combined with an adversarial domain classifier allows learning domain-invariant and class-specific instance representations. Experimental results on multiple benchmarking MSDA datasets indicate that our method outperforms state-of-the-art methods and exhibits robustness to class imbalance, achieved through a conceptually simple class-conditioning strategy. Our code is available at: https://github.com/imatif17/ACIA. △ Less

Submitted 11 December, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2309.14950

arXiv:2309.14950 [pdf, other]

Multi-Source Domain Adaptation for Object Detection with Prototype-based Mean-teacher

Authors: Atif Belal, Akhil Meethal, Francisco Perdigon Romero, Marco Pedersoli, Eric Granger

Abstract: Adapting visual object detectors to operational target domains is a challenging task, commonly achieved using unsupervised domain adaptation (UDA) methods. Recent studies have shown that when the labeled dataset comes from multiple source domains, treating them as separate domains and performing a multi-source domain adaptation (MSDA) improves the accuracy and robustness over blending these source… ▽ More Adapting visual object detectors to operational target domains is a challenging task, commonly achieved using unsupervised domain adaptation (UDA) methods. Recent studies have shown that when the labeled dataset comes from multiple source domains, treating them as separate domains and performing a multi-source domain adaptation (MSDA) improves the accuracy and robustness over blending these source domains and performing a UDA. For adaptation, existing MSDA methods learn domain-invariant and domain-specific parameters (for each source domain). However, unlike single-source UDA methods, learning domain-specific parameters makes them grow significantly in proportion to the number of source domains. This paper proposes a novel MSDA method called Prototype-based Mean Teacher (PMT), which uses class prototypes instead of domain-specific subnets to encode domain-specific information. These prototypes are learned using a contrastive loss, aligning the same categories across domains and separating different categories far apart. Given the use of prototypes, the number of parameters required for our PMT method does not increase significantly with the number of source domains, thus reducing memory issues and possible overfitting. Empirical studies indicate that PMT outperforms state-of-the-art MSDA methods on several challenging object detection datasets. Our code is available at https://github.com/imatif17/Prototype-Mean-Teacher. △ Less

Submitted 31 July, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

arXiv:2307.06979 [pdf, other]

Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models

Authors: Arman Sakif Chowdhury, G. M. Shahariar, Ahammed Tarik Aziz, Syed Mohibul Alam, Md. Azad Sheikh, Tanveer Ahmed Belal

Abstract: With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with f… ▽ More With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection △ Less

Submitted 14 May, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: Under Review

arXiv:2304.04087 [pdf, other]

doi 10.1109/ECCE57851.2023.10101588

Interpretable Multi Labeled Bengali Toxic Comments Classification using Deep Learning

Authors: Tanveer Ahmed Belal, G. M. Shahariar, Md. Hasanul Kabir

Abstract: This paper presents a deep learning-based pipeline for categorizing Bengali toxic comments, in which at first a binary classification model is used to determine whether a comment is toxic or not, and then a multi-label classifier is employed to determine which toxicity type the comment belongs to. For this purpose, we have prepared a manually labeled dataset consisting of 16,073 instances among wh… ▽ More This paper presents a deep learning-based pipeline for categorizing Bengali toxic comments, in which at first a binary classification model is used to determine whether a comment is toxic or not, and then a multi-label classifier is employed to determine which toxicity type the comment belongs to. For this purpose, we have prepared a manually labeled dataset consisting of 16,073 instances among which 8,488 are Toxic and any toxic comment may correspond to one or more of the six toxic categories - vulgar, hate, religious, threat, troll, and insult simultaneously. Long Short Term Memory (LSTM) with BERT Embedding achieved 89.42% accuracy for the binary classification task while as a multi-label classifier, a combination of Convolutional Neural Network and Bi-directional Long Short Term Memory (CNN-BiLSTM) with attention mechanism achieved 78.92% accuracy and 0.86 as weighted F1-score. To explain the predictions and interpret the word feature importance during classification by the proposed models, we utilized Local Interpretable Model-Agnostic Explanations (LIME) framework. We have made our dataset public and can be accessed at - https://github.com/deepu099cse/Multi-Labeled-Bengali-Toxic-Comments-Classification △ Less

Submitted 8 April, 2023; originally announced April 2023.

Journal ref: 2023 International Conference on Electrical, Computer and Communication Engineering (ECCE)

arXiv:2101.07308 [pdf, other]

doi 10.1016/j.imavis.2021.104096

Knowledge Distillation Methods for Efficient Unsupervised Adaptation Across Multiple Domains

Authors: Le Thanh Nguyen-Meidine, Atif Belal, Madhu Kiran, Jose Dolz, Louis-Antoine Blais-Morin, Eric Granger

Abstract: Beyond the complexity of CNNs that require training on large annotated datasets, the domain shift between design and operational data has limited the adoption of CNNs in many real-world applications. For instance, in person re-identification, videos are captured over a distributed set of cameras with non-overlapping viewpoints. The shift between the source (e.g. lab setting) and target (e.g. camer… ▽ More Beyond the complexity of CNNs that require training on large annotated datasets, the domain shift between design and operational data has limited the adoption of CNNs in many real-world applications. For instance, in person re-identification, videos are captured over a distributed set of cameras with non-overlapping viewpoints. The shift between the source (e.g. lab setting) and target (e.g. cameras) domains may lead to a significant decline in recognition accuracy. Additionally, state-of-the-art CNNs may not be suitable for such real-time applications given their computational requirements. Although several techniques have recently been proposed to address domain shift problems through unsupervised domain adaptation (UDA), or to accelerate/compress CNNs through knowledge distillation (KD), we seek to simultaneously adapt and compress CNNs to generalize well across multiple target domains. In this paper, we propose a progressive KD approach for unsupervised single-target DA (STDA) and multi-target DA (MTDA) of CNNs. Our method for KD-STDA adapts a CNN to a single target domain by distilling from a larger teacher CNN, trained on both target and source domain data in order to maintain its consistency with a common representation. Our proposed approach is compared against state-of-the-art methods for compression and STDA of CNNs on the Office31 and ImageClef-DA image classification datasets. It is also compared against state-of-the-art methods for MTDA on Digits, Office31, and OfficeHome. In both settings -- KD-STDA and KD-MTDA -- results indicate that our approach can achieve the highest level of accuracy across target domains, while requiring a comparable or lower CNN complexity. △ Less

Submitted 18 January, 2021; originally announced January 2021.

Comments: This is the extended journal version of arXiv:2005.07839

arXiv:2007.07077 [pdf, other]

Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation

Authors: Le Thanh Nguyen-Meidine, Atif Belal, Madhu Kiran, Jose Dolz, Louis-Antoine Blais-Morin, Eric Granger

Abstract: Unsupervised domain adaptation (UDA) seeks to alleviate the problem of domain shift between the distribution of unlabeled data from the target domain w.r.t. labeled data from the source domain. While the single-target UDA scenario is well studied in the literature, Multi-Target Domain Adaptation (MTDA) remains largely unexplored despite its practical importance, e.g., in multi-camera video-surveil… ▽ More Unsupervised domain adaptation (UDA) seeks to alleviate the problem of domain shift between the distribution of unlabeled data from the target domain w.r.t. labeled data from the source domain. While the single-target UDA scenario is well studied in the literature, Multi-Target Domain Adaptation (MTDA) remains largely unexplored despite its practical importance, e.g., in multi-camera video-surveillance applications. The MTDA problem can be addressed by adapting one specialized model per target domain, although this solution is too costly in many real-world applications. Blending multiple targets for MTDA has been proposed, yet this solution may lead to a reduction in model specificity and accuracy. In this paper, we propose a novel unsupervised MTDA approach to train a CNN that can generalize well across multiple target domains. Our Multi-Teacher MTDA (MT-MTDA) method relies on multi-teacher knowledge distillation (KD) to iteratively distill target domain knowledge from multiple teachers to a common student. The KD process is performed in a progressive manner, where the student is trained by each teacher on how to perform UDA for a specific target, instead of directly learning domain adapted features. Finally, instead of combining the knowledge from each teacher, MT-MTDA alternates between teachers that distill knowledge, thereby preserving the specificity of each target (teacher) when learning to adapt to the student. MT-MTDA is compared against state-of-the-art methods on several challenging UDA benchmarks, and empirical results show that our proposed model can provide a considerably higher level of accuracy across multiple target domains. Our code is available at: https://github.com/LIVIAETS/MT-MTDA △ Less

Submitted 19 November, 2020; v1 submitted 14 July, 2020; originally announced July 2020.

Comments: Accepted for WACV2021

arXiv:1611.00027 [pdf]

doi 10.5121/ijnlc.2015.4301

CBAS: context based arabic stemmer

Authors: Mahmoud El-Defrawy, Yasser El-Sonbaty, Nahla A. Belal

Abstract: Arabic morphology encapsulates many valuable features such as word root. Arabic roots are being utilized for many tasks; the process of extracting a word root is referred to as stemming. Stemming is an essential part of most Natural Language Processing tasks, especially for derivative languages such as Arabic. However, stemming is faced with the problem of ambiguity, where two or more roots could… ▽ More Arabic morphology encapsulates many valuable features such as word root. Arabic roots are being utilized for many tasks; the process of extracting a word root is referred to as stemming. Stemming is an essential part of most Natural Language Processing tasks, especially for derivative languages such as Arabic. However, stemming is faced with the problem of ambiguity, where two or more roots could be extracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence model. It captures the meaning of a word based on its context. In this paper, a distributional semantics model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate its effectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4% improvement over other stemmers. △ Less

Submitted 28 October, 2015; originally announced November 2016.

Journal ref: International Journal on Natural Language Computing (IJNLC) Vol. 4, No.3, June 2015

arXiv:cs/0607138 [pdf]

A Foundation to Perception Computing, Logic and Automata

Authors: Mohamed A. Belal

Abstract: In this report, a novel approach to intelligence and learning is introduced, this approach is based on what we call 'perception logic'. Based on this logic, a computing mechanism and automata are introduced. Multi-resolution analysis of perceptual information is given, in which learning is accomplished in at most O(log(N))epochs, where N is the number of samples, and the convergence is guarnteed… ▽ More In this report, a novel approach to intelligence and learning is introduced, this approach is based on what we call 'perception logic'. Based on this logic, a computing mechanism and automata are introduced. Multi-resolution analysis of perceptual information is given, in which learning is accomplished in at most O(log(N))epochs, where N is the number of samples, and the convergence is guarnteed. This approach combines the favors of computational modeles in the sense that they are structured and mathematically well-defined, and the adaptivity of soft computing approaches, in addition to the continuity and real-time response of dynamical systems. △ Less

Submitted 30 July, 2006; originally announced July 2006.

Comments: 39 pages, pdf format, to be published

ACM Class: I.2.0; I.2.6

arXiv:cs/0509015 [pdf, other]

Optimal Prefix Codes with Fewer Distinct Codeword Lengths are Faster to Construct

Authors: Ahmed Belal, Amr Elmasry

Abstract: A new method for constructing minimum-redundancy binary prefix codes is described. Our method does not explicitly build a Huffman tree; instead it uses a property of optimal prefix codes to compute the codeword lengths corresponding to the input weights. Let $n$ be the number of weights and $k$ be the number of distinct codeword lengths as produced by the algorithm for the optimum codes. The runni… ▽ More A new method for constructing minimum-redundancy binary prefix codes is described. Our method does not explicitly build a Huffman tree; instead it uses a property of optimal prefix codes to compute the codeword lengths corresponding to the input weights. Let $n$ be the number of weights and $k$ be the number of distinct codeword lengths as produced by the algorithm for the optimum codes. The running time of our algorithm is $O(k \cdot n)$. Following our previous work in \cite{be}, no algorithm can possibly construct optimal prefix codes in $o(k \cdot n)$ time. When the given weights are presorted our algorithm performs $O(9^k \cdot \log^{2k}{n})$ comparisons. △ Less

Submitted 29 September, 2016; v1 submitted 6 September, 2005; originally announced September 2005.

Comments: 23 pages, a preliminary version appeared in STACS 2006

arXiv:cs/0403023 [pdf]

Secure Transmission of Sensitive data using multiple channels

Authors: Ahmed A. Belal, Abdelhamid S. Abdelhamid

Abstract: A new scheme for transmitting sensitive data is proposed, the proposed scheme depends on partitioning the output of a block encryption module using the Chinese Remainder Theorem among a set of channels. The purpose of using the Chinese Remainder Theorem is to hide the cipher text in order to increase the difficulty of attacking the cipher. The theory, implementation and the security of this sche… ▽ More A new scheme for transmitting sensitive data is proposed, the proposed scheme depends on partitioning the output of a block encryption module using the Chinese Remainder Theorem among a set of channels. The purpose of using the Chinese Remainder Theorem is to hide the cipher text in order to increase the difficulty of attacking the cipher. The theory, implementation and the security of this scheme are described in this paper. △ Less

Submitted 14 March, 2004; originally announced March 2004.

Comments: 5 pages

ACM Class: C.2.0

Showing 1–11 of 11 results for author: Belal, A