Skip to main content

Showing 1–50 of 75 results for author: Cholakkal, H

.
  1. arXiv:2506.07032  [pdf, ps, other

    cs.CL cs.CV

    A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

    Authors: Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh , et al. (4 additional authors not shown)

    Abstract: Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of vid… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  2. arXiv:2505.24876  [pdf, ps, other

    cs.CV cs.CL

    Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

    Authors: Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

    Abstract: Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we intr… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  3. arXiv:2505.18152  [pdf, ps, other

    cs.CL

    Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

    Authors: Wafa Alghallabi, Ritesh Thawkar, Sara Ghaboura, Ketan More, Omkar Thawakar, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

    Abstract: Arabic poetry is one of the richest and most culturally rooted forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fa… ▽ More

    Submitted 26 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: Github:https://github.com/mbzuai-oryx/FannOrFlop, Dataset:https://huggingface.co/datasets/omkarthawakar/FannOrFlop

  4. arXiv:2505.17021  [pdf, ps, other

    cs.CV

    ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

    Authors: Sara Ghaboura, Ketan More, Wafa Alghallabi, Omkar Thawakar, Jorma Laaksonen, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

    Abstract: As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchma… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Github : https://github.com/mbzuai-oryx/ARB, Huggingface: https://huggingface.co/datasets/MBZUAI/ARB

  5. arXiv:2505.14846  [pdf, ps, other

    cs.CV

    Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets

    Authors: Daniya Najiha A. Kareem, Jean Lahoud, Mustansar Fiaz, Amandeep Kumar, Hisham Cholakkal

    Abstract: Many practical medical imaging scenarios include categories that are under-represented but still crucial. The relevance of image recognition models to real-world applications lies in their ability to generalize to these rare classes as well as unseen classes. Real-world generalization requires taking into account the various complexities that can be encountered in the real-world. First, training d… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  6. arXiv:2504.21414  [pdf, other

    cs.CV

    Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining

    Authors: Qi Fan, Kaiqi Liu, Nian Liu, Hisham Cholakkal, Rao Muhammad Anwer, Wenbin Li, Yang Gao

    Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques, which are effective but costly to train. To address these issue… ▽ More

    Submitted 12 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

  7. arXiv:2503.22678  [pdf, other

    cs.CL

    Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

    Authors: Mohammad Almansoori, Komal Kumar, Hisham Cholakkal

    Abstract: In this work, we introduce MedAgentSim, an open-source simulated clinical environment with doctor, patient, and measurement agents designed to evaluate and enhance LLM performance in dynamic diagnostic settings. Unlike prior approaches, our framework requires doctor agents to actively engage with patients through multi-turn conversations, requesting relevant medical examinations (e.g., temperature… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: 14 page, 4 figures, 61 references

  8. arXiv:2503.14498  [pdf, other

    cs.CV cs.RO

    Tracking Meets Large Multimodal Models for Driving Scenario Understanding

    Authors: Ayesha Ishaq, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer

    Abstract: Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 13 pages, 8 figures, Github: https://github.com/mbzuai-oryx/TrackingMeetsLMM

  9. arXiv:2503.10621  [pdf, other

    cs.CV cs.RO

    DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

    Authors: Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan

    Abstract: While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understandin… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 8 pages, 4 figures, 3 tables, github: https://github.com/ayesha-ishaq/DriveLMM-o1

  10. arXiv:2503.04724  [pdf, other

    cs.CL

    LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

    Authors: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal

    Abstract: Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  11. arXiv:2502.21321  [pdf, other

    cs.CL cs.CV

    LLM Post-Training: A Deep Dive into Reasoning Large Language Models

    Authors: Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan

    Abstract: Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-tr… ▽ More

    Submitted 24 March, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

    Comments: 32 pages, 7 figures, 3 tables, 377 references. Github Repo: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training

  12. arXiv:2502.17429  [pdf, other

    cs.CV

    CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation

    Authors: Vishal Thengane, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Lu Yin, Xiatian Zhu, Salman Khan

    Abstract: While 3D instance segmentation (3DIS) has advanced significantly, existing methods typically assume that all object classes are known in advance and are uniformly distributed. However, this assumption is unrealistic in dynamic, real-world environments where new classes emerge gradually and exhibit natural imbalance. Although some approaches have addressed class emergence, they often overlook class… ▽ More

    Submitted 21 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Code: https://github.com/vgthengane/CLIMB3D

  13. arXiv:2502.14865  [pdf, other

    cs.CV cs.LG

    Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

    Authors: Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

    Abstract: Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultu… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 4 pages, 6 figures

  14. arXiv:2502.00094  [pdf, other

    cs.CV cs.AI cs.CL cs.HC cs.LG

    AIN: The Arabic INclusive Large Multimodal Model

    Authors: Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan

    Abstract: Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap,… ▽ More

    Submitted 4 February, 2025; v1 submitted 31 January, 2025; originally announced February 2025.

    Comments: 20 pages, 16 figures, ACL

  15. arXiv:2501.06186  [pdf, other

    cs.CV

    LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

    Authors: Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

    Abstract: Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: 15 pages, 5 Figures

  16. arXiv:2412.07769  [pdf, other

    cs.CV

    BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

    Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

    Abstract: This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates text and visual modalities, enabling advanced image understanding and medical applications. BiMediX2 leverages the Llama3.1 architecture and integrates text and visual capabilities to facilitate seamless interactions in both English and Arabic, su… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

  17. arXiv:2411.19346  [pdf, other

    cs.CV cs.CL cs.LG

    CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

    Authors: Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal

    Abstract: In the era of foundation models, CLIP has emerged as a powerful tool for aligning text & visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL m… ▽ More

    Submitted 10 April, 2025; v1 submitted 28 November, 2024; originally announced November 2024.

  18. arXiv:2411.16508  [pdf, other

    cs.CV cs.CL

    All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

    Authors: Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani , et al. (44 additional authors not shown)

    Abstract: Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All La… ▽ More

    Submitted 30 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: A Multilingual Multimodal cultural benchmark for 100 languages

  19. arXiv:2410.15360  [pdf, other

    eess.IV cs.CV

    Improving 3D Medical Image Segmentation at Boundary Regions using Local Self-attention and Global Volume Mixing

    Authors: Daniya Najiha Abdul Kareem, Mustansar Fiaz, Noa Novershtern, Jacob Hanna, Hisham Cholakkal

    Abstract: Volumetric medical image segmentation is a fundamental problem in medical image analysis where the objective is to accurately classify a given 3D volumetric medical image with voxel-level precision. In this work, we propose a novel hierarchical encoder-decoder-based framework that strives to explicitly capture the local and global dependencies for volumetric 3D medical image segmentation. The prop… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  20. arXiv:2410.08405  [pdf, other

    cs.CV cs.AI

    AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

    Authors: Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, Rao Muhammad Anwer

    Abstract: Significant progress has been made in advancing large multimodal conversational models (LMMs), capitalizing on vast repositories of image-text data available online. Despite this progress, these models often encounter substantial domain gaps, hindering their ability to engage in complex conversations across new domains. Recent efforts have aimed to mitigate this issue, albeit relying on domain-spe… ▽ More

    Submitted 9 January, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: Accepted at WACV, 2025

  21. arXiv:2410.01678  [pdf, other

    cs.CV cs.RO

    Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

    Authors: Ayesha Ishaq, Mohamed El Amine Boudjoghra, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer

    Abstract: 3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which ex… ▽ More

    Submitted 27 February, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: 7 pages, 4 figures, 3 tables

  22. arXiv:2409.16261  [pdf, other

    cs.CV

    CDChat: A Large Multimodal Model for Remote Sensing Change Description

    Authors: Mubashir Noman, Noor Ahsan, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Large multimodal models (LMMs) have shown encouraging performance in the natural image domain using visual instruction tuning. However, these LMMs struggle to describe the content of remote sensing images for tasks such as image or region grounding, classification, etc. Recently, GeoChat make an effort to describe the contents of the RS images. Although, GeoChat achieves promising performance for… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

  23. arXiv:2409.01021  [pdf, other

    cs.CV

    CONDA: Condensed Deep Association Learning for Co-Salient Object Detection

    Authors: Long Li, Nian Liu, Dingwen Zhang, Zhongyu Li, Salman Khan, Rao Anwer, Hisham Cholakkal, Junwei Han, Fahad Shahbaz Khan

    Abstract: Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in co… ▽ More

    Submitted 10 October, 2024; v1 submitted 2 September, 2024; originally announced September 2024.

    Comments: There is an error. In Sec 4.1, the number of images in some dataset is incorrect and needs to be revised

    Journal ref: ECCV2024

  24. arXiv:2406.17471  [pdf, other

    eess.IV cs.CV

    Medical Image Segmentation Using Directional Window Attention

    Authors: Daniya Najiha Abdul Kareem, Mustansar Fiaz, Noa Novershtern, Hisham Cholakkal

    Abstract: Accurate segmentation of medical images is crucial for diagnostic purposes, including cell segmentation, tumor identification, and organ localization. Traditional convolutional neural network (CNN)-based approaches struggled to achieve precise segmentation results due to their limited receptive fields, particularly in cases involving multi-organ segmentation with varying shapes and sizes. The tran… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 5 pages

  25. arXiv:2406.04413  [pdf, other

    cs.CV cs.AI

    Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

    Authors: Amandeep Kumar, Muhammad Awais, Sanath Narayan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

    Abstract: Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we prop… ▽ More

    Submitted 24 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted at ECCV, 2024. Amandeep Kumar and Muhammad Awais are joint first authors. More details are available at https://awaisrauf.github.io/3d_face_editing

  26. arXiv:2406.02548  [pdf, other

    cs.CV

    Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

    Authors: Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this h… ▽ More

    Submitted 13 February, 2025; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: ICLR 2025 (Oral)

  27. arXiv:2405.18304  [pdf, other

    cs.CV

    Multi-modal Generation via Cross-Modal In-Context Learning

    Authors: Amandeep Kumar, Muzammal Naseer, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal

    Abstract: In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from lengthy prompts and maintain contextual coherence within prompt sequences. Moreover, they often result in misaligned image generation for prompt sequences featu… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Technical Report

  28. arXiv:2404.17565  [pdf, other

    cs.CV

    ChangeBind: A Hybrid Change Encoder for Remote Sensing Change Detection

    Authors: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal

    Abstract: Change detection (CD) is a fundamental task in remote sensing (RS) which aims to detect the semantic changes between the same geographical regions at different time stamps. Existing convolutional neural networks (CNNs) based approaches often struggle to capture long-range dependencies. Whereas recent transformer-based methods are prone to the dominant global representation and may limit their capa… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: accepted at IGARSS 2024

  29. arXiv:2404.03836  [pdf, other

    cs.CV cs.AI

    PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

    Authors: Amrin Kareem, Jean Lahoud, Hisham Cholakkal

    Abstract: Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: 14 pages

  30. arXiv:2403.17909  [pdf, other

    cs.CV

    ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection

    Authors: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Salman Khan, Fahad Shahbaz Khan

    Abstract: Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard se… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: accepted at IEEE TGRS

  31. arXiv:2403.05419  [pdf, other

    cs.CV

    Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

    Authors: Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwar, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024

  32. Semi-supervised Open-World Object Detection

    Authors: Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal

    Abstract: Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this fo… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

    Comments: Accepted to AAAI 2024 (Main Track)

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence 2024

  33. BiMediX: Bilingual Medical Mixture of Experts LLM

    Authors: Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

    Abstract: In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question… ▽ More

    Submitted 10 December, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to EMNLP 2024 (Findings)

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16984-17002

  34. Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM

    Authors: Sahal Shaji Mullappilly, Abdelrahman Shaker, Omkar Thawakar, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Climate change is one of the most significant challenges we face together as a society. Creating awareness and educating policy makers the wide-ranging impact of climate change is an essential step towards a sustainable future. Recently, Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks. While these models are… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted to EMNLP 2023 (Findings)

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14126-14136

  35. arXiv:2311.03356  [pdf, other

    cs.CV cs.AI

    GLaMM: Pixel Grounding Large Multimodal Model

    Authors: Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan

    Abstract: Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dens… ▽ More

    Submitted 1 June, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  36. arXiv:2310.20706  [pdf, other

    cs.CV

    DDAM-PS: Diligent Domain Adaptive Mixer for Person Search

    Authors: Mohammed Khaleed Almansoori, Mustansar Fiaz, Hisham Cholakkal

    Abstract: Person search (PS) is a challenging computer vision problem where the objective is to achieve joint optimization for pedestrian detection and re-identification (ReID). Although previous advancements have shown promising performance in the field under fully and weakly supervised learning fashion, there exists a major gap in investigating the domain adaptation ability of PS models. In this paper, we… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted in WACV-2024. Code is here at \url{https://github.com/mustansarfiaz/DDAM-PS

  37. arXiv:2310.15165  [pdf, other

    cs.CV cs.AI cs.LG

    Handling Data Heterogeneity via Architectural Design for Federated Visual Recognition

    Authors: Sara Pieri, Jose Renato Restom, Samuel Horvath, Hisham Cholakkal

    Abstract: Federated Learning (FL) is a promising research paradigm that enables the collaborative training of machine learning models among various parties without the need for sensitive information exchange. Nonetheless, retaining data in individual clients introduces fundamental challenges to achieving performance on par with centrally trained models. Our study provides an extensive review of federated le… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: to be published in NeurIPS 2023

  38. arXiv:2310.02260  [pdf, other

    cs.CV cs.AI

    TransRadar: Adaptive-Directional Transformer for Real-Time Multi-View Radar Semantic Segmentation

    Authors: Yahia Dalbah, Jean Lahoud, Hisham Cholakkal

    Abstract: Scene understanding plays an essential role in enabling autonomous driving and maintaining high standards of performance and safety. To address this task, cameras and laser scanners (LiDARs) have been the most commonly used sensors, with radars being less popular. Despite that, radars remain low-cost, information-dense, and fast-sensing techniques that are resistant to adverse weather conditions.… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  39. arXiv:2309.16661  [pdf, other

    cs.CV cs.AI

    SA2-Net: Scale-aware Attention Network for Microscopic Image Segmentation

    Authors: Mustansar Fiaz, Moein Heidari, Rao Muhammad Anwer, Hisham Cholakkal

    Abstract: Microscopic image segmentation is a challenging task, wherein the objective is to assign semantic labels to each pixel in a given microscopic image. While convolutional neural networks (CNNs) form the foundation of many existing frameworks, they often struggle to explicitly capture long-range dependencies. Although transformers were initially devised to address this issue using self-attention, it… ▽ More

    Submitted 19 November, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: BMVC 2023 accepted as oral

  40. arXiv:2309.14338  [pdf, other

    cs.CV

    3D Indoor Instance Segmentation in an Open-World

    Authors: Mohamed El Amine Boudjoghra, Salwa K. Al Khatib, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

    Abstract: Existing 3D instance segmentation methods typically assume that all semantic classes to be segmented would be available during training and only seen categories are segmented at inference. We argue that such a closed-world assumption is restrictive and explore for the first time 3D indoor instance segmentation in an open-world setting, where the model is allowed to distinguish a set of known class… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at NeurIPS 2023

  41. arXiv:2309.11160  [pdf, other

    cs.CV

    Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

    Authors: Nian Liu, Kepan Nan, Wangbo Zhao, Yuanwei Liu, Xiwen Yao, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Junwei Han, Fahad Shahbaz Khan

    Abstract: Few-Shot Video Object Segmentation (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. However, this task was seldom explored. In this work, based on IPMT, a state-of-the-art few-shot image segmentation method that combines external support guidance information with adaptive query guidance cues, we propose to leverage multi-grained tem… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  42. arXiv:2307.13721  [pdf, other

    cs.CV cs.AI

    Foundational Models Defining a New Era in Vision: A Survey and Outlook

    Authors: Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Project page: https://github.com/awaisrauf/Awesome-CV-Foundational-Models

  43. arXiv:2306.07971  [pdf, other

    cs.CV

    XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

    Authors: Omkar Thawakar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Fahad Shahbaz Khan

    Abstract: The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks. Such models are trained on massive datasets comprising billions of public image-text pairs with diverse tasks. However, their performance on task-specific domains, such as radiology, is still under-investigated and potentially limited due to… ▽ More

    Submitted 7 May, 2025; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Accepted at ACL 2024-BIONLP Workshop. Code: https://github.com/mbzuai-oryx/XrayGPT

  44. Salient Mask-Guided Vision Transformer for Fine-Grained Classification

    Authors: Dmitry Demidov, Muhammad Hamza Sharif, Aliakbar Abdurahimov, Hisham Cholakkal, Fahad Shahbaz Khan

    Abstract: Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar classes. Recently, methods with Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC, generally by em… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: Accepted by VISAPP 2023 (Best Student Paper Award)

    Journal ref: VISAPP 2023

  45. arXiv:2305.00514  [pdf, other

    cs.CV

    Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection

    Authors: Long Li, Junwei Han, Ni Zhang, Nian Liu, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan

    Abstract: Most previous co-salient object detection works mainly focus on extracting co-salient cues via mining the consistency relations across images while ignoring explicit exploration of background regions. In this paper, we propose a Discriminative co-saliency and background Mining Transformer framework (DMT) based on several economical multi-grained correlation modules to explicitly mine both co-salie… ▽ More

    Submitted 5 May, 2023; v1 submitted 30 April, 2023; originally announced May 2023.

    Comments: Accepted by CVPR 2023

  46. arXiv:2304.08447  [pdf, other

    cs.CV cs.AI

    RadarFormer: Lightweight and Accurate Real-Time Radar Object Detection Model

    Authors: Yahia Dalbah, Jean Lahoud, Hisham Cholakkal

    Abstract: The performance of perception systems developed for autonomous driving vehicles has seen significant improvements over the last few years. This improvement was associated with the increasing use of LiDAR sensors and point cloud data to facilitate the task of object detection and recognition in autonomous driving. However, LiDAR and camera systems show deteriorating performances when used in unfavo… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: 18 pages (with reference), 8 figures, submitted and accepted to SCIA2023

  47. arXiv:2304.06710  [pdf, other

    cs.CV

    Remote Sensing Change Detection With Transformers Trained from Scratch

    Authors: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Current transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark. This current strategy is driven by the fact that transformers typically require a large amount of training data to learn inductive biases, which is… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: 5 figures and 4 tables

  48. arXiv:2304.01992  [pdf, other

    eess.IV cs.CV

    Cross-modulated Few-shot Image Generation for Colorectal Tissue Classification

    Authors: Amandeep Kumar, Ankan kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan

    Abstract: In this work, we propose a few-shot colorectal tissue image generation method for addressing the scarcity of histopathological training data for rare cancer tissues. Our few-shot generation method, named XM-GAN, takes one base and a pair of reference tissue images as input and generates high-quality yet diverse images. Within our XM-GAN, a novel controllable fusion block densely aggregates local r… ▽ More

    Submitted 4 July, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: Early Accept in MICCAI 2023

  49. arXiv:2304.01200  [pdf, other

    cs.CV

    Video Instance Segmentation in an Open-World

    Authors: Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan

    Abstract: Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as `unknown' and then (b) it i… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

    Comments: 9 pages, 5 figures

  50. arXiv:2304.01172  [pdf, other

    cs.CV

    Generative Multiplane Neural Radiance for 3D-Aware Image Generation

    Authors: Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views. The proposed multiplane neural radiance model, named GMNR, consists of a novel α-guided view-dependent representation (α-VdR) module for learning view-dependent information. The α-VdR module, faciliated by an α-guided pixel sampling technique, computes the view-depende… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

    Comments: Technical report