Paper2Video: Automatic Video Generation from Scientific Papers

Zhu, Zeyu; Lin, Kevin Qinghong; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2510.05096 (cs)

[Submitted on 6 Oct 2025 (v1), last revised 9 Oct 2025 (this version, v2)]

Title:Paper2Video: Automatic Video Generation from Scientific Papers

Authors:Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

View PDF HTML (experimental)

Abstract:Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at this https URL.

Comments:	Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
Cite as:	arXiv:2510.05096 [cs.CV]
	(or arXiv:2510.05096v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2510.05096

Submission history

From: Zeyu Zhu Mr. [view email]
[v1] Mon, 6 Oct 2025 17:58:02 UTC (6,815 KB)
[v2] Thu, 9 Oct 2025 17:29:00 UTC (6,817 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Paper2Video: Automatic Video Generation from Scientific Papers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Paper2Video: Automatic Video Generation from Scientific Papers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators