Vision-LSTM: xLSTM as Generic Vision Backbone

Alkin, Benedikt; Beck, Maximilian; Pöppel, Korbinian; Hochreiter, Sepp; Brandstetter, Johannes

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.04303 (cs)

[Submitted on 6 Jun 2024 (v1), last revised 20 Feb 2025 (this version, v3)]

Title:Vision-LSTM: xLSTM as Generic Vision Backbone

Authors:Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter

View PDF HTML (experimental)

Abstract:Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Comments:	Published as a conference paper at ICLR 2025, Github: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.04303 [cs.CV]
	(or arXiv:2406.04303v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.04303

Submission history

From: Benedikt Alkin [view email]
[v1] Thu, 6 Jun 2024 17:49:21 UTC (249 KB)
[v2] Tue, 2 Jul 2024 12:39:46 UTC (284 KB)
[v3] Thu, 20 Feb 2025 23:20:40 UTC (306 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-LSTM: xLSTM as Generic Vision Backbone

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-LSTM: xLSTM as Generic Vision Backbone

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators