TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Qu, Leigang; Wang, Ziyang; Zheng, Na; Wang, Wenjie; Nie, Liqiang; Chua, Tat-Seng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2510.07940 (cs)

[Submitted on 9 Oct 2025]

Title:TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Authors:Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua

View PDF HTML (experimental)

Abstract:Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2510.07940 [cs.CV]
	(or arXiv:2510.07940v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2510.07940

Submission history

From: Leigang Qu [view email]
[v1] Thu, 9 Oct 2025 08:37:00 UTC (2,533 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators