Skip to main content

Showing 1–10 of 10 results for author: Khattak, M U

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.10372  [pdf, other

    cs.CV

    UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

    Authors: Muhammad Uzair Khattak, Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. S… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

    Comments: Code, models and demo available at https://github.com/mbzuai-oryx/UniMed-CLIP

  2. arXiv:2408.11493  [pdf, other

    cs.CV

    XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-Rays

    Authors: Umaima Rahman, Abhishek Basu, Muhammad Uzair Khattak, Aniq Ur Rahman

    Abstract: This study explores the concept of cross-disease transferability (XDT) in medical imaging, focusing on the potential of binary classifiers trained on one disease to perform zero-shot classification on another disease affecting the same organ. Utilizing chest X-rays (CXR) as the primary modality, we investigate whether a model trained on one pulmonary disease can make predictions about another nove… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: Accepted in Machine Learning for Healthcare Conference MLHC 2024

  3. arXiv:2405.03690  [pdf, other

    cs.CV

    How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

    Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

    Abstract: Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives undersco… ▽ More

    Submitted 8 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: Technical report

  4. arXiv:2401.02418  [pdf, other

    cs.CV

    Learning to Prompt with Text Only Supervision for Vision-Language Models

    Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari

    Abstract: Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled dat… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: Project Page: https://muzairkhattak.github.io/ProText/

  5. arXiv:2311.01459  [pdf, other

    cs.CV

    Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization

    Authors: Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan

    Abstract: The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this w… ▽ More

    Submitted 10 January, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted to NeurIPS 2023

  6. arXiv:2307.06948  [pdf, other

    cs.CV

    Self-regulating Prompts: Foundational Model Adaptation without Forgetting

    Authors: Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's… ▽ More

    Submitted 24 August, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: Accepted to ICCV-2023. Camera-Ready version. Project page: https://muzairkhattak.github.io/PromptSRC/

  7. arXiv:2307.06947  [pdf, other

    cs.CV cs.AI

    Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

    Authors: Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

    Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work prop… ▽ More

    Submitted 27 October, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: Accepted to ICCV-2023. Camera-Ready version. Project page: https://TalalWasim.github.io/Video-FocalNets/

  8. arXiv:2212.03640  [pdf, other

    cs.CV cs.AI

    Fine-tuned CLIP Models are Efficient Video Learners

    Authors: Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan

    Abstract: Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts… ▽ More

    Submitted 26 March, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

    Comments: Accepted at CVPR 2023

  9. arXiv:2210.03117  [pdf, other

    cs.CV

    MaPLe: Multi-modal Prompt Learning

    Authors: Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan

    Abstract: Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP… ▽ More

    Submitted 1 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: Accepted at CVPR2023

  10. arXiv:2207.03482  [pdf, other

    cs.CV cs.AI

    Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

    Authors: Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan

    Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for t… ▽ More

    Submitted 29 November, 2022; v1 submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted at NeurIPS 2022