-
Object-level Visual Prompts for Compositional Image Generation
Authors:
Gaurav Parmar,
Or Patashnik,
Kuan-Chieh Wang,
Daniil Ostashev,
Srinivasa Narasimhan,
Jun-Yan Zhu,
Daniel Cohen-Or,
Kfir Aberman
Abstract:
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts,…
▽ More
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
On the Content Bias in Fréchet Video Distance
Authors:
Songwei Ge,
Aniruddha Mahapatra,
Gaurav Parmar,
Jun-Yan Zhu,
Jia-Bin Huang
Abstract:
Fréchet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the F…
▽ More
Fréchet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
One-Step Image Translation with Text-to-Image Models
Authors:
Gaurav Parmar,
Taesung Park,
Srinivasa Narasimhan,
Jun-Yan Zhu
Abstract:
In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate va…
▽ More
In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting
Authors:
Peter Schaldenbrand,
Gaurav Parmar,
Jun-Yan Zhu,
James McCann,
Jean Oh
Abstract:
Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engagin…
▽ More
Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engaging with content already painted by a human collaborator. To improve text-image alignment, FRIDA's major weakness, our system uses pre-trained text-to-image models; however, pre-trained models in the context of real-world co-painting do not perform well because they (1) do not understand the constraints and abilities of the robot and (2) cannot perform co-painting without making unrealistic edits to the canvas and overwriting content. We propose a self-supervised fine-tuning procedure that can tackle both issues, allowing the use of pre-trained state-of-the-art text-image alignment models with robots to enable co-painting in the physical world. Our open-source approach, CoFRIDA, creates paintings and drawings that match the input text prompt more clearly than FRIDA, both from a blank canvas and one with human created work. More generally, our fine-tuning procedure successfully encodes the robot's constraints and abilities into a foundation model, showcasing promising results as an effective method for reducing sim-to-real gaps.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Zero-shot Image-to-Image Translation
Authors:
Gaurav Parmar,
Krishna Kumar Singh,
Richard Zhang,
Yijun Li,
Jingwan Lu,
Jun-Yan Zhu
Abstract:
Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can in…
▽ More
Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.
△ Less
Submitted 6 February, 2023;
originally announced February 2023.
-
Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing
Authors:
Gaurav Parmar,
Yijun Li,
Jingwan Lu,
Richard Zhang,
Jun-Yan Zhu,
Krishna Kumar Singh
Abstract:
Existing GAN inversion and editing methods work well for aligned objects with a clean background, such as portraits and animal faces, but often struggle for more difficult categories with complex scene layouts and object occlusions, such as cars, animals, and outdoor images. We propose a new method to invert and edit such complex images in the latent space of GANs, such as StyleGAN2. Our key idea…
▽ More
Existing GAN inversion and editing methods work well for aligned objects with a clean background, such as portraits and animal faces, but often struggle for more difficult categories with complex scene layouts and object occlusions, such as cars, animals, and outdoor images. We propose a new method to invert and edit such complex images in the latent space of GANs, such as StyleGAN2. Our key idea is to explore inversion with a collection of layers, spatially adapting the inversion process to the difficulty of the image. We learn to predict the "invertibility" of different image segments and project each segment into a latent layer. Easier regions can be inverted into an earlier layer in the generator's latent space, while more challenging regions can be inverted into a later feature space. Experiments show that our method obtains better inversion results compared to the recent approaches on complex categories, while maintaining downstream editability. Please refer to our project page at https://www.cs.cmu.edu/~SAMInversion.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
On Aliased Resizing and Surprising Subtleties in GAN Evaluation
Authors:
Gaurav Parmar,
Richard Zhang,
Jun-Yan Zhu
Abstract:
Metrics for evaluating generative models aim to measure the discrepancy between real and generated images. The often-used Frechet Inception Distance (FID) metric, for example, extracts "high-level" features using a deep network from the two sets. However, we find that the differences in "low-level" preprocessing, specifically image resizing and compression, can induce large variations and have unf…
▽ More
Metrics for evaluating generative models aim to measure the discrepancy between real and generated images. The often-used Frechet Inception Distance (FID) metric, for example, extracts "high-level" features using a deep network from the two sets. However, we find that the differences in "low-level" preprocessing, specifically image resizing and compression, can induce large variations and have unforeseen consequences. For instance, when resizing an image, e.g., with a bilinear or bicubic kernel, signal processing principles mandate adjusting prefilter width depending on the downsampling factor, to antialias to the appropriate bandwidth. However, commonly-used implementations use a fixed-width prefilter, resulting in aliasing artifacts. Such aliasing leads to corruptions in the feature extraction downstream. Next, lossy compression, such as JPEG, is commonly used to reduce the file size of an image. Although designed to minimally degrade the perceptual quality of an image, the operation also produces variations downstream. Furthermore, we show that if compression is used on real training images, FID can actually improve if the generated images are also subsequently compressed. This paper shows that choices in low-level image processing have been an underappreciated aspect of generative modeling. We identify and characterize variations in generative modeling development pipelines, provide recommendations based on signal processing principles, and release a reference implementation to facilitate future comparisons.
△ Less
Submitted 20 January, 2022; v1 submitted 22 April, 2021;
originally announced April 2021.
-
Dual Contradistinctive Generative Autoencoder
Authors:
Gaurav Parmar,
Dacheng Li,
Kwonjoon Lee,
Zhuowen Tu
Abstract:
We present a new generative autoencoder model with dual contradistinctive losses to improve generative autoencoder that performs simultaneous inference (reconstruction) and synthesis (sampling). Our model, named dual contradistinctive generative autoencoder (DC-VAE), integrates an instance-level discriminative loss (maintaining the instance-level fidelity for the reconstruction/synthesis) with a s…
▽ More
We present a new generative autoencoder model with dual contradistinctive losses to improve generative autoencoder that performs simultaneous inference (reconstruction) and synthesis (sampling). Our model, named dual contradistinctive generative autoencoder (DC-VAE), integrates an instance-level discriminative loss (maintaining the instance-level fidelity for the reconstruction/synthesis) with a set-level adversarial loss (encouraging the set-level fidelity for there construction/synthesis), both being contradistinctive. Extensive experimental results by DC-VAE across different resolutions including 32x32, 64x64, 128x128, and 512x512 are reported. The two contradistinctive losses in VAE work harmoniously in DC-VAE leading to a significant qualitative and quantitative performance enhancement over the baseline VAEs without architectural changes. State-of-the-art or competitive results among generative autoencoders for image reconstruction, image synthesis, image interpolation, and representation learning are observed. DC-VAE is a general-purpose VAE model, applicable to a wide variety of downstream tasks in computer vision and machine learning.
△ Less
Submitted 19 November, 2020;
originally announced November 2020.
-
Guided Variational Autoencoder for Disentanglement Learning
Authors:
Zheng Ding,
Yifan Xu,
Weijian Xu,
Gaurav Parmar,
Yang Yang,
Max Welling,
Zhuowen Tu
Abstract:
We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signals to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an uns…
▽ More
We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signals to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an unsupervised strategy and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE. In the unsupervised strategy, we guide the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components; in the supervised strategy, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement of the latent variables. Guided-VAE enjoys its transparency and simplicity for the general representation learning task, as well as disentanglement learning. On a number of experiments for representation learning, improved synthesis/sampling, better disentanglement for classification, and reduced classification errors in meta-learning have been observed.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Dispersion-managed soliton fiber laser with random dispersion, multiphoton absorption and gain dispersion
Authors:
Gurkirpal Singh Parmar,
Rajib Pradhan,
B. A. Malomed,
Soumendu Jana
Abstract:
We address the generation and interaction of dispersion-managed dissipative solitons (DMDS) in a model of fiber lasers with the cubic-quintic nonlinearity, multiphoton absorption and gain dispersion. Both anomalous and normal segments of the dispersion map include random dispersion fluctuations. Effects of the gain dispersion, higher-order nonlinearity and randomness on the generation of DMDS are…
▽ More
We address the generation and interaction of dispersion-managed dissipative solitons (DMDS) in a model of fiber lasers with the cubic-quintic nonlinearity, multiphoton absorption and gain dispersion. Both anomalous and normal segments of the dispersion map include random dispersion fluctuations. Effects of the gain dispersion, higher-order nonlinearity and randomness on the generation of DMDS are demonstrated. The solitons exhibit breather-like evolution, and are found to be robust, up to a certain critical level of the random-dispersion component, which is sufficiently high. The roles of multiphoton absorption, gain dispersion and nonlinearity on the DMDS are also identified in the absence of randomness. Pair wise interactions of solitons lead, most typically, to their merger, with breaking of the left-right symmetry. The outcome of the collisions is more sensitive to the initial temporal separation between the solitons than to their phase difference.
△ Less
Submitted 6 September, 2018;
originally announced September 2018.
-
Dissipative Soliton Fiber Lasers with Higher-Order Nonlinearity, Multiphoton Absorption and Emission, and Random Dispersion
Authors:
Gurkirpal Singh Parmar,
Soumendu Jana,
Boris A. Malomed
Abstract:
We study the generation of dissipative solitons (DSs) in the model of the fiber-laser cavities under the combined action of cubic-quintic nonlinearity, multiphoton absorption and/or multiphoton emission (nonlinear gain) and gain dispersion. A random component of the group-velocity dispersion (GVD) is included too. The DS creation and propagation is studied by means of a variational approximation a…
▽ More
We study the generation of dissipative solitons (DSs) in the model of the fiber-laser cavities under the combined action of cubic-quintic nonlinearity, multiphoton absorption and/or multiphoton emission (nonlinear gain) and gain dispersion. A random component of the group-velocity dispersion (GVD) is included too. The DS creation and propagation is studied by means of a variational approximation and direct simulations, which are found to be in reasonable agreement. With a proper choice of the gain, robust DS operation regimes are predicted for different combinations of multiphoton absorption and emission, in spite of the presence of the perturbation in the form of the random GVD. Importantly, the zero background around the solitons remains stable in the presence of the (necessary) linear gain. The solitons are stable too against a certain (realistic) level of noise. Another essential finding is that the quintic gain in the form of three-photon emission (3PE) offers an alternative mechanism for supporting stable solitons, provided that it is not too strong. The DSs coexist in low- and high amplitude forms, for a given value of their width. The low-amplitude DS is stable, while its high-amplitude counterpart is subject to the blowup instability, in the presence of the 3PE. Interactions between DSs show various scenarios of the creation of breather states through merger of the two solitons.
△ Less
Submitted 7 March, 2017;
originally announced March 2017.