Diffusion Classifiers Understand Compositionality, but Conditions Apply

Jeong, Yujin; Uselis, Arnas; Oh, Seong Joon; Rohrbach, Anna

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.17955 (cs)

[Submitted on 23 May 2025 (v1), last revised 29 May 2025 (this version, v2)]

Title:Diffusion Classifiers Understand Compositionality, but Conditions Apply

Authors:Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach

View PDF HTML (experimental)

Abstract:Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2505.17955 [cs.CV]
	(or arXiv:2505.17955v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.17955

Submission history

From: Yujin Jeong [view email]
[v1] Fri, 23 May 2025 14:29:52 UTC (36,108 KB)
[v2] Thu, 29 May 2025 17:59:50 UTC (36,107 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Diffusion Classifiers Understand Compositionality, but Conditions Apply

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Diffusion Classifiers Understand Compositionality, but Conditions Apply

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators