Improving Compositional Text-to-image Generation with Large Vision-Language Models

Wen, Song; Fang, Guian; Zhang, Renrui; Gao, Peng; Dong, Hao; Metaxas, Dimitris

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.06311 (cs)

[Submitted on 10 Oct 2023]

Title:Improving Compositional Text-to-image Generation with Large Vision-Language Models

Authors:Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris Metaxas

View PDF

Abstract:Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Utilizing this assessment, we fine-tune the diffusion model to enhance its alignment capabilities. During the inference phase, an initial image is produced using the fine-tuned diffusion model. The LVLM is then employed to pinpoint areas of misalignment in the initial image, which are subsequently corrected using the image editing algorithm until no further misalignments are detected by the LVLM. The resultant image is consequently more closely aligned with the input text. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation, particularly with respect to object number, attribute binding, spatial relationships, and aesthetic quality.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2310.06311 [cs.CV]
	(or arXiv:2310.06311v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.06311

Submission history

From: Song Wen [view email]
[v1] Tue, 10 Oct 2023 05:09:05 UTC (4,045 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Compositional Text-to-image Generation with Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Compositional Text-to-image Generation with Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators