SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Venkatraman, Shravan; Walia, Jaskaran Singh; R, Joe Dhanith P

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.09420 (cs)

[Submitted on 14 Nov 2024 (v1), last revised 8 Jan 2025 (this version, v3)]

Title:SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Authors:Shravan Venkatraman, Jaskaran Singh Walia, Joe Dhanith P R

View PDF HTML (experimental)

Abstract:Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating its effectiveness in advancing image classification tasks. Our code and weights are available at this https URL.

Comments:	14 pages, 8 figures, 9 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
MSC classes:	68T07
ACM classes:	I.2.10
Cite as:	arXiv:2411.09420 [cs.CV]
	(or arXiv:2411.09420v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.09420

Submission history

From: Joe Dhanith P R [view email]
[v1] Thu, 14 Nov 2024 13:15:27 UTC (2,915 KB)
[v2] Tue, 10 Dec 2024 03:42:23 UTC (2,915 KB)
[v3] Wed, 8 Jan 2025 04:31:16 UTC (43,627 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators