LocalMamba: Visual State Space Model with Windowed Selective Scan

Huang, Tao; Pei, Xiaohuan; You, Shan; Wang, Fei; Qian, Chen; Xu, Chang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.09338 (cs)

[Submitted on 14 Mar 2024]

Title:LocalMamba: Visual State Space Model with Windowed Selective Scan

Authors:Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu

View PDF HTML (experimental)

Abstract:Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.09338 [cs.CV]
	(or arXiv:2403.09338v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.09338

Submission history

From: Tao Huang [view email]
[v1] Thu, 14 Mar 2024 12:32:40 UTC (1,872 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LocalMamba: Visual State Space Model with Windowed Selective Scan

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LocalMamba: Visual State Space Model with Windowed Selective Scan

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators