LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Li, Shufan; Kallidromitis, Konstantinos; Bansal, Hritik; Gokul, Akash; Kato, Yusuke; Kozuka, Kazuki; Kuen, Jason; Lin, Zhe; Chang, Kai-Wei; Grover, Aditya

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.16839 (cs)

[Submitted on 22 May 2025 (v1), last revised 23 May 2025 (this version, v2)]

Title:LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Authors:Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover

View PDF HTML (experimental)

Abstract:Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Comments:	25 pages, 8 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2505.16839 [cs.CV]
	(or arXiv:2505.16839v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.16839

Submission history

From: Shufan Li [view email]
[v1] Thu, 22 May 2025 16:07:12 UTC (1,863 KB)
[v2] Fri, 23 May 2025 07:07:29 UTC (1,865 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators