VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Wang, Hanyang; Liu, Fangfu; Chi, Jiawei; Duan, Yueqi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.01956 (cs)

[Submitted on 2 Apr 2025 (v1), last revised 3 Apr 2025 (this version, v2)]

Title:VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Authors:Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan

View PDF HTML (experimental)

Abstract:Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: this https URL

Comments:	Accepted by CVPR 2025; Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.01956 [cs.CV]
	(or arXiv:2504.01956v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.01956

Submission history

From: Hanyang Wang [view email]
[v1] Wed, 2 Apr 2025 17:59:21 UTC (26,179 KB)
[v2] Thu, 3 Apr 2025 14:07:13 UTC (26,179 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators