Fully Attentional Networks with Self-emerging Token Labeling

Zhao, Bingyin; Yu, Zhiding; Lan, Shiyi; Cheng, Yutao; Anandkumar, Anima; Lao, Yingjie; Alvarez, Jose M.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.03844 (cs)

[Submitted on 8 Jan 2024]

Title:Fully Attentional Networks with Self-emerging Token Labeling

Authors:Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez

View PDF

Abstract:Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.03844 [cs.CV]
	(or arXiv:2401.03844v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.03844
Journal reference:	Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5585-5595

Submission history

From: Bingyin Zhao [view email]
[v1] Mon, 8 Jan 2024 12:14:15 UTC (617 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Fully Attentional Networks with Self-emerging Token Labeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Fully Attentional Networks with Self-emerging Token Labeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators