Ingredients: Blending Custom Photos with Video Diffusion Transformers

Fei, Zhengcong; Li, Debang; Qiu, Di; Yu, Changqian; Fan, Mingyuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.01790 (cs)

[Submitted on 3 Jan 2025 (v1), last revised 18 Mar 2025 (this version, v2)]

Title:Ingredients: Blending Custom Photos with Video Diffusion Transformers

Authors:Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan

View PDF HTML (experimental)

Abstract:This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as Ingredients. Generally, our method consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, Ingredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.01790 [cs.CV]
	(or arXiv:2501.01790v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.01790

Submission history

From: Zhengcong Fei [view email]
[v1] Fri, 3 Jan 2025 12:45:22 UTC (7,352 KB)
[v2] Tue, 18 Mar 2025 10:47:27 UTC (7,350 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Ingredients: Blending Custom Photos with Video Diffusion Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Ingredients: Blending Custom Photos with Video Diffusion Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators