SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Renz, Katrin; Chen, Long; Arani, Elahe; Sinavski, Oleg

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.09594 (cs)

[Submitted on 12 Mar 2025]

Title:SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Authors:Katrin Renz, Long Chen, Elahe Arani, Oleg Sinavski

View PDF HTML (experimental)

Abstract:Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

Comments:	CVPR 2025. 1st Place @ CARLA Challenge 2024. Challenge tech report (preliminary version of SimLingo): arXiv:2406.10165
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:	arXiv:2503.09594 [cs.CV]
	(or arXiv:2503.09594v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.09594

Submission history

From: Katrin Renz [view email]
[v1] Wed, 12 Mar 2025 17:58:06 UTC (32,638 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators