Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Lu, Xiangyong; Suganuma, Masanori; Okatani, Takayuki

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.02197 (cs)

[Submitted on 3 Dec 2024 (v1), last revised 22 Aug 2025 (this version, v3)]

Title:Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Authors:Xiangyong Lu, Masanori Suganuma, Takayuki Okatani

View PDF HTML (experimental)

Abstract:In real-world applications of image recognition tasks, such as human pose estimation, cameras often capture objects, like human bodies, at low resolutions. This scenario poses a challenge in extracting and leveraging multi-scale features, which is often essential for precise inference. To address this challenge, we propose a new attention mechanism, named cascaded multi-scale attention (CMSA), tailored for use in CNN-ViT hybrid architectures, to handle low-resolution inputs effectively. The design of CMSA enables the extraction and seamless integration of features across various scales without necessitating the downsampling of the input image or feature maps. This is achieved through a novel combination of grouped multi-head self-attention mechanisms with window-based local attention and cascaded fusion of multi-scale features over different scales. This architecture allows for the effective handling of features across different scales, enhancing the model's ability to perform tasks such as human pose estimation, head pose estimation, and more with low-resolution images. Our experimental results show that the proposed method outperforms existing state-of-the-art methods in these areas with fewer parameters, showcasing its potential for broad application in real-world scenarios where capturing high-resolution images is not feasible. Code is available at this https URL.

Comments:	10 pages, 6 figures, 5 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.02197 [cs.CV]
	(or arXiv:2412.02197v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.02197

Submission history

From: Xiangyong Lu [view email]
[v1] Tue, 3 Dec 2024 06:23:19 UTC (1,415 KB)
[v2] Thu, 17 Jul 2025 01:59:39 UTC (1,769 KB)
[v3] Fri, 22 Aug 2025 06:25:34 UTC (1,768 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators