Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Ding, Hangliang; Li, Dacheng; Su, Runlong; Zhang, Peiyuan; Deng, Zhijie; Stoica, Ion; Zhang, Hao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.06155 (cs)

[Submitted on 10 Feb 2025 (v1), last revised 17 Feb 2025 (this version, v2)]

Title:Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Authors:Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang

View PDF HTML (experimental)

Abstract:Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.06155 [cs.CV]
	(or arXiv:2502.06155v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.06155

Submission history

From: Ding Hangliang [view email]
[v1] Mon, 10 Feb 2025 05:00:56 UTC (25,823 KB)
[v2] Mon, 17 Feb 2025 07:08:23 UTC (25,823 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators