Visual Modality Prompt for Adapting Vision-Language Object Detectors

Medeiros, Heitor R.; Belal, Atif; Muralidharan, Srikanth; Granger, Eric; Pedersoli, Marco

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.00622v2 (cs)

[Submitted on 1 Dec 2024 (v1), last revised 14 Mar 2025 (this version, v2)]

Title:Visual Modality Prompt for Adapting Vision-Language Object Detectors

Authors:Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

View PDF HTML (experimental)

Abstract:The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.00622 [cs.CV]
	(or arXiv:2412.00622v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.00622

Submission history

From: Heitor Medeiros Mr. [view email]
[v1] Sun, 1 Dec 2024 00:19:59 UTC (40,413 KB)
[v2] Fri, 14 Mar 2025 20:32:12 UTC (40,776 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Modality Prompt for Adapting Vision-Language Object Detectors

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Modality Prompt for Adapting Vision-Language Object Detectors

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators