MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

Zhang, Renrui; Qiu, Han; Wang, Tai; Xu, Xuanzhuo; Guo, Ziyu; Qiao, Yu; Gao, Peng; Li, Hongsheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.13310v1 (cs)

[Submitted on 24 Mar 2022 (this version), latest version 13 Feb 2025 (v5)]

Title:MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

Authors:Renrui Zhang, Han Qiu, Tai Wang, Xuanzhuo Xu, Ziyu Guo, Yu Qiao, Peng Gao, Hongsheng Li

View PDF

Abstract:Monocular 3D object detection has long been a challenging task in autonomous driving, which requires to decode 3D predictions solely from a single 2D image. Most existing methods follow conventional 2D object detectors to first localize objects by their centers, and then predict 3D attributes using center-neighboring local features. However, such center-based pipeline views 3D prediction as a subordinate task and lacks inter-object depth interactions with global spatial clues. In this paper, we introduce a simple framework for Monocular DEtection with depth-aware TRansformer, named MonoDETR. We enable the vanilla transformer to be depth-aware and enforce the whole detection process guided by depth. Specifically, we represent 3D object candidates as a set of queries and produce non-local depth embeddings of the input image by a lightweight depth predictor and an attention-based depth encoder. Then, we propose a depth-aware decoder to conduct both inter-query and query-scene depth feature communication. In this way, each object estimates its 3D attributes adaptively from the depth-informative regions on the image, not limited by center-around features. With minimal handcrafted designs, MonoDETR is an end-to-end framework without additional data, anchors or NMS and achieves competitive performance on KITTI benchmark among state-of-the-art center-based networks. Extensive ablation studies demonstrate the effectiveness of our approach and its potential to serve as a transformer baseline for future monocular research. Code is available at this https URL.

Comments:	10 pages, 5 figures, submitted to CVPR 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Cite as:	arXiv:2203.13310 [cs.CV]
	(or arXiv:2203.13310v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.13310

Submission history

From: Renrui Zhang [view email]
[v1] Thu, 24 Mar 2022 19:28:54 UTC (3,645 KB)
[v2] Mon, 28 Mar 2022 07:00:29 UTC (3,644 KB)
[v3] Sat, 28 May 2022 10:21:04 UTC (2,115 KB)
[v4] Thu, 24 Aug 2023 04:18:17 UTC (2,734 KB)
[v5] Thu, 13 Feb 2025 08:33:30 UTC (2,741 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators