Denoising Vision Transformers

Yang, Jiawei; Luo, Katie Z; Li, Jiefeng; Deng, Congyue; Guibas, Leonidas; Krishnan, Dilip; Weinberger, Kilian Q; Tian, Yonglong; Wang, Yue

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.02957 (cs)

[Submitted on 5 Jan 2024 (v1), last revised 22 Jul 2024 (this version, v2)]

Title:Denoising Vision Transformers

Authors:Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, Yue Wang

View PDF HTML (experimental)

Abstract:We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.

Comments:	Accepted to ECCV2024. Project website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.02957 [cs.CV]
	(or arXiv:2401.02957v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.02957

Submission history

From: Jiawei Yang [view email]
[v1] Fri, 5 Jan 2024 18:59:52 UTC (27,868 KB)
[v2] Mon, 22 Jul 2024 09:07:27 UTC (19,730 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Denoising Vision Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Denoising Vision Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators