-
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Authors:
Anton Voronov,
Denis Kuznedelev,
Mikhail Khoroshikh,
Valentin Khrulkov,
Dmitry Baranchuk
Abstract:
This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling…
▽ More
This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.
△ Less
Submitted 20 March, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization
Authors:
Vage Egiazarian,
Denis Kuznedelev,
Anton Voronov,
Ruslan Svirschevski,
Michael Goin,
Daniil Pavlov,
Dan Alistarh,
Dmitry Baranchuk
Abstract:
Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resou…
▽ More
Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resource-limited environments. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. Recent diffusion quantization techniques primarily rely on uniform scalar quantization, providing decent performance for the models compressed to 4 bits. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models. Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
Authors:
Anton Voronov,
Lena Wolf,
Max Ryabinin
Abstract:
Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate…
▽ More
Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across 21 models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.
△ Less
Submitted 6 June, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics
Authors:
Anton Voronov,
Mikhail Khoroshikh,
Artem Babenko,
Max Ryabinin
Abstract:
Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practic…
▽ More
Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the regular training objective on a fixed set of inputs for all training iterations. Our experiments on Stable Diffusion for 48 different concepts and three personalization methods demonstrate the competitive performance of our approach, which makes adaptation up to 8 times faster with no significant drops in quality.
△ Less
Submitted 1 November, 2023; v1 submitted 9 February, 2023;
originally announced February 2023.
-
Learning to Solve Voxel Building Embodied Tasks from Pixels and Natural Language Instructions
Authors:
Alexey Skrynnik,
Zoya Volovikova,
Marc-Alexandre Côté,
Anton Voronov,
Artem Zholus,
Negar Arabzadeh,
Shrestha Mohanty,
Milagro Teruel,
Ahmed Awadallah,
Aleksandr Panov,
Mikhail Burtsev,
Julia Kiseleva
Abstract:
The adoption of pre-trained language models to generate action plans for embodied agents is a promising research strategy. However, execution of instructions in real or simulated environments requires verification of the feasibility of actions as well as their relevance to the completion of a goal. We propose a new method that combines a language model and reinforcement learning for the task of bu…
▽ More
The adoption of pre-trained language models to generate action plans for embodied agents is a promising research strategy. However, execution of instructions in real or simulated environments requires verification of the feasibility of actions as well as their relevance to the completion of a goal. We propose a new method that combines a language model and reinforcement learning for the task of building objects in a Minecraft-like environment according to the natural language instructions. Our method first generates a set of consistently achievable sub-goals from the instructions and then completes associated sub-tasks with a pre-trained RL policy. The proposed method formed the RL baseline at the IGLU 2022 competition.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture
Authors:
Daria Bakshandaeva,
Denis Dimitrov,
Vladimir Arkhipkin,
Alex Shonenkov,
Mark Potanin,
Denis Karachev,
Andrey Kuznetsov,
Anton Voronov,
Vera Davydova,
Elena Tutubalina,
Aleksandr Petiushko
Abstract:
Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Transl…
▽ More
Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants' submissions on it. Moreover, we have collected and made publicly available a new handwritten dataset in both English and Russian, which consists of 94,128 pairs of images and texts. We also propose a multimodal and multitask architecture - a baseline solution, in the center of which is a frozen foundation model and which has been trained in Fusion mode along with Single-task mode. The proposed Fusion approach proves to be competitive and more energy-efficient compared to the task-specific one.
△ Less
Submitted 28 December, 2022; v1 submitted 21 November, 2021;
originally announced November 2021.
-
Text Detoxification using Large Pre-trained Neural Models
Authors:
David Dale,
Anton Voronov,
Daryna Dementieva,
Varvara Logacheva,
Olga Kozlova,
Nikita Semenov,
Alexander Panchenko
Abstract:
We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second…
▽ More
We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT to replace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.
△ Less
Submitted 3 November, 2021; v1 submitted 18 September, 2021;
originally announced September 2021.
-
One-Point Feedback for Composite Optimization with Applications to Distributed and Federated Learning
Authors:
Aleksandr Beznosikov,
Ivan Stepanov,
Artyom Voronov,
Alexander Gasnikov
Abstract:
This work is devoted to solving the composite optimization problem with the mixture oracle: for the smooth part of the problem, we have access to the gradient, and for the non-smooth part, only the one-point zero-order oracle is available. For such a setup, we present a new method based on the sliding algorithm. Our method allows to separate the oracle complexities and to compute the gradient for…
▽ More
This work is devoted to solving the composite optimization problem with the mixture oracle: for the smooth part of the problem, we have access to the gradient, and for the non-smooth part, only the one-point zero-order oracle is available. For such a setup, we present a new method based on the sliding algorithm. Our method allows to separate the oracle complexities and to compute the gradient for one of the functions as rarely as possible. The paper also presents the applicability of our new method to the problems of distributed optimization and federated learning. Experimental results confirm the theory.
△ Less
Submitted 10 May, 2025; v1 submitted 13 July, 2021;
originally announced July 2021.