Search | arXiv e-print repository

Diffusion Classifiers Understand Compositionality, but Conditions Apply

Authors: Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach

Abstract: Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been… ▽ More Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality. △ Less

Submitted 29 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

arXiv:2504.05461 [pdf, other]

Intermediate Layer Classifiers for OOD generalization

Authors: Arnas Uselis, Seong Joon Oh

Abstract: Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network's last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and expl… ▽ More Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network's last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce \textit{Intermediate Layer Classifiers} (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility. Code is available at https://github.com/oshapio/intermediate-layer-generalization △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: ICLR 2025

arXiv:2502.03566 [pdf, other]

CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

Authors: Darina Koishigarina, Arnas Uselis, Seong Joon Oh

Abstract: CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular,… ▽ More CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally. △ Less

Submitted 8 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2204.05192 [pdf, other]

Task-Synchronized Recurrent Neural Networks

Authors: Mantas Lukoševičius, Arnas Uselis

Abstract: Data are often sampled irregularly in time. Dealing with this using Recurrent Neural Networks (RNNs) traditionally involved ignoring the fact, feeding the time differences as additional inputs, or resampling the data. All these methods have their shortcomings. We propose an elegant straightforward alternative approach where instead the RNN is in effect resampled in time to match the time of the da… ▽ More Data are often sampled irregularly in time. Dealing with this using Recurrent Neural Networks (RNNs) traditionally involved ignoring the fact, feeding the time differences as additional inputs, or resampling the data. All these methods have their shortcomings. We propose an elegant straightforward alternative approach where instead the RNN is in effect resampled in time to match the time of the data or the task at hand. We use Echo State Network (ESN) and Gated Recurrent Unit (GRU) as the basis for our solution. Such RNNs can be seen as discretizations of continuous-time dynamical systems, which gives a solid theoretical ground to our approach. Our Task-Synchronized ESN (TSESN) and GRU (TSGRU) models allow for a direct model time setting and require no additional training, parameter tuning, or computation (solving differential equations or interpolating data) compared to their regular counterparts, thus retaining their original efficiency. We confirm empirically that our models can effectively compensate for the time-non-uniformity of the data and demonstrate that they compare favorably to data resampling, classical RNN methods, and alternative RNN models proposed to deal with time irregularities on several real-world nonuniform-time datasets. We open-source the code at https://github.com/oshapio/task-synchronized-RNNs . △ Less

Submitted 2 July, 2024; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: The 1st version was written in May 2019 and double-blind reviewed for a prominent conference. A major update. We changed the name of the article and methods to an arguably more precise one, and because a very similar title has been published in the meantime. We've rewritten much of the text, connected to the current literature, redone some experiments, figures, discussion, published source code

MSC Class: 68T07; 68T05; 37M10 ACM Class: I.2.6; G.1.2

arXiv:2006.11282 [pdf, other]

doi 10.1007/s12559-021-09849-2

Efficient implementations of echo state network cross-validation

Authors: Mantas Lukoševičius, Arnas Uselis

Abstract: Background/introduction: Cross-Validation (CV) is still uncommon in time series modeling. Echo State Networks (ESNs), as a prime example of Reservoir Computing (RC) models, are known for their fast and precise one-shot learning, that often benefit from good hyper-parameter tuning. This makes them ideal to change the status quo. Methods: We discuss CV of time series for predicting a concrete time… ▽ More Background/introduction: Cross-Validation (CV) is still uncommon in time series modeling. Echo State Networks (ESNs), as a prime example of Reservoir Computing (RC) models, are known for their fast and precise one-shot learning, that often benefit from good hyper-parameter tuning. This makes them ideal to change the status quo. Methods: We discuss CV of time series for predicting a concrete time interval of interest, suggest several schemes for cross-validating ESNs and introduce an efficient algorithm for implementing them. This algorithm is presented as two levels of optimizations of doing $k$-fold CV. Training an RC model typically consists of two stages: (i) running the reservoir with the data and (ii) computing the optimal readouts. The first level of our optimization addresses the most computationally expensive part (i) and makes it remain constant irrespective of $k$. It dramatically reduces reservoir computations in any type of RC system and is enough if $k$ is small. The second level of optimization also makes the (ii) part remain constant irrespective of large $k$, as long as the dimension of the output is low. We discuss when the proposed validation schemes for ESNs could be beneficial, three options for producing the final model and empirically investigate them on six different real-world datasets, as well as do empirical computation time experiments. We provide the code in an online repository. Results: Proposed CV schemes give better and more stable test performance in all the six different real-world datasets, three task types. Empirical run times confirm our complexity analysis. Conclusions: In most situations $k$-fold CV of ESNs and many other RC models can be done for virtually the same time and space complexity as a simple single-split validation. This enables CV to become a standard practice in RC. △ Less

Submitted 3 December, 2020; v1 submitted 19 June, 2020; originally announced June 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1908.08450

MSC Class: 68T05 (Primary) 37M10; 15A06 (Secondary) ACM Class: I.2.6

Journal ref: Cognitive Computation, 2021

arXiv:2005.05930 [pdf, other]

doi 10.3390/en13133440

Localized convolutional neural networks for geospatial wind forecasting

Authors: Arnas Uselis, Mantas Lukoševičius, Lukas Stasytis

Abstract: Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. However, in some domains, like geospatial, not all locations are exactly equal. In this work, we propose localized convolutional neural networks that enable convolutional architectures to learn lo… ▽ More Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. However, in some domains, like geospatial, not all locations are exactly equal. In this work, we propose localized convolutional neural networks that enable convolutional architectures to learn local features in addition to the global ones. We investigate their instantiations in the form of learnable inputs, local weights, and a more general form. They can be added to any convolutional layers, easily end-to-end trained, introduce minimal additional complexity, and let CNNs retain most of their benefits to the extent that they are needed. In this work we address spatio-temporal prediction: test the effectiveness of our methods on a synthetic benchmark dataset and tackle three real-world wind prediction datasets. For one of them, we propose a method to spatially order the unordered data. We compare the recent state-of-the-art spatio-temporal prediction models on the same data. Models that use convolutional layers can be and are extended with our localizations. In all these cases our extensions improve the results, and thus often the state-of-the-art. We share all the code at a public repository. △ Less

Submitted 10 July, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

MSC Class: 68T05 ACM Class: I.2.6

Journal ref: Energies, 13 (13), pp. 3440, 2020

arXiv:1908.08450 [pdf, other]

doi 10.1007/978-3-030-30493-5_12

Efficient Cross-Validation of Echo State Networks

Authors: Mantas Lukoševičius, Arnas Uselis

Abstract: Echo State Networks (ESNs) are known for their fast and precise one-shot learning of time series. But they often need good hyper-parameter tuning for best performance. For this good validation is key, but usually, a single validation split is used. In this rather practical contribution we suggest several schemes for cross-validating ESNs and introduce an efficient algorithm for implementing them.… ▽ More Echo State Networks (ESNs) are known for their fast and precise one-shot learning of time series. But they often need good hyper-parameter tuning for best performance. For this good validation is key, but usually, a single validation split is used. In this rather practical contribution we suggest several schemes for cross-validating ESNs and introduce an efficient algorithm for implementing them. The component that dominates the time complexity of the already quite fast ESN training remains constant (does not scale up with $k$) in our proposed method of doing $k$-fold cross-validation. The component that does scale linearly with $k$ starts dominating only in some not very common situations. Thus in many situations $k$-fold cross-validation of ESNs can be done for virtually the same time complexity as a simple single split validation. Space complexity can also remain the same. We also discuss when the proposed validation schemes for ESNs could be beneficial and empirically investigate them on several different real-world datasets. △ Less

Submitted 22 August, 2019; originally announced August 2019.

Comments: Accepted in ICANN'19 Workshop on Reservoir Computing

MSC Class: 68T05 (Primary) 37M10; 15A06 (Secondary) ACM Class: I.2.6

Journal ref: Artificial Neural Networks and Machine Learning - ICANN 2019: Workshop and Special Sessions. ICANN 2019., pp. 121-133

Showing 1–7 of 7 results for author: Uselis, A