-
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
Authors:
NVIDIA,
:,
Yuval Atzmon,
Maciej Bala,
Yogesh Balaji,
Tiffany Cai,
Yin Cui,
Jiaojiao Fan,
Yunhao Ge,
Siddharth Gururani,
Jacob Huffman,
Ronald Isaac,
Pooya Jannaty,
Tero Karras,
Grace Lam,
J. P. Lewis,
Aaron Licata,
Yen-Chen Lin,
Ming-Yu Liu,
Qianli Ma,
Arun Mallya,
Ashlee Martino-Tarr,
Doug Mendez,
Seungjun Nah,
Chris Pruett
, et al. (7 additional authors not shown)
Abstract:
We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-i…
▽ More
We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Guiding a Diffusion Model with a Bad Version of Itself
Authors:
Tero Karras,
Miika Aittala,
Tuomas Kynkäänniemi,
Jaakko Lehtinen,
Timo Aila,
Samuli Laine
Abstract:
The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at…
▽ More
The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.
△ Less
Submitted 19 December, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
Authors:
Tuomas Kynkäänniemi,
Miika Aittala,
Tero Karras,
Samuli Laine,
Timo Aila,
Jaakko Lehtinen
Abstract:
Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We…
▽ More
Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We thus restrict it to a specific range of noise levels, improving both the inference speed and result quality. This limited guidance interval improves the record FID in ImageNet-512 significantly, from 1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial across different sampler parameters, network architectures, and datasets, including the large-scale setting of Stable Diffusion XL. We thus suggest exposing the guidance interval as a hyperparameter in all diffusion models that use guidance.
△ Less
Submitted 6 November, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Analyzing and Improving the Training Dynamics of Diffusion Models
Authors:
Tero Karras,
Miika Aittala,
Jaakko Lehtinen,
Janne Hellsten,
Timo Aila,
Samuli Laine
Abstract:
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations an…
▽ More
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling.
As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
△ Less
Submitted 20 March, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Generative Novel View Synthesis with 3D-Aware Diffusion Models
Authors:
Eric R. Chan,
Koki Nagano,
Matthew A. Chan,
Alexander W. Bergman,
Jeong Joon Park,
Axel Levy,
Miika Aittala,
Shalini De Mello,
Tero Karras,
Gordon Wetzstein
Abstract:
We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorp…
▽ More
We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. This latent feature field captures the distribution over possible scene representations and improves our method's ability to generate view-consistent novel renderings. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences. We demonstrate state-of-the-art results on synthetic renderings and room-scale scenes; we also show compelling results for challenging, real-world objects.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
Authors:
Axel Sauer,
Tero Karras,
Samuli Laine,
Andreas Geiger,
Timo Aila
Abstract:
Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward…
▽ More
Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
Simulator-Based Self-Supervision for Learned 3D Tomography Reconstruction
Authors:
Onni Kosomaa,
Samuli Laine,
Tero Karras,
Miika Aittala,
Jaakko Lehtinen
Abstract:
We propose a deep learning method for 3D volumetric reconstruction in low-dose helical cone-beam computed tomography. Prior machine learning approaches require reference reconstructions computed by another algorithm for training. In contrast, we train our model in a fully self-supervised manner using only noisy 2D X-ray data. This is enabled by incorporating a fast differentiable CT simulator in t…
▽ More
We propose a deep learning method for 3D volumetric reconstruction in low-dose helical cone-beam computed tomography. Prior machine learning approaches require reference reconstructions computed by another algorithm for training. In contrast, we train our model in a fully self-supervised manner using only noisy 2D X-ray data. This is enabled by incorporating a fast differentiable CT simulator in the training loop. As we do not rely on reference reconstructions, the fidelity of our results is not limited by their potential shortcomings. We evaluate our method on real helical cone-beam projections and simulated phantoms. Our results show significantly higher visual fidelity and better PSNR over techniques that rely on existing reconstructions. When applied to full-dose data, our method produces high-quality results orders of magnitude faster than iterative techniques.
△ Less
Submitted 26 May, 2023; v1 submitted 14 December, 2022;
originally announced December 2022.
-
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Authors:
Yogesh Balaji,
Seungjun Nah,
Xun Huang,
Arash Vahdat,
Jiaming Song,
Qinsheng Zhang,
Karsten Kreis,
Miika Aittala,
Timo Aila,
Samuli Laine,
Bryan Catanzaro,
Tero Karras,
Ming-Yu Liu
Abstract:
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly…
▽ More
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
△ Less
Submitted 13 March, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Generating Long Videos of Dynamic Scenes
Authors:
Tim Brooks,
Janne Hellsten,
Miika Aittala,
Ting-Chun Wang,
Timo Aila,
Jaakko Lehtinen,
Ming-Yu Liu,
Alexei A. Efros,
Tero Karras
Abstract:
We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never chan…
▽ More
We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.
△ Less
Submitted 9 June, 2022; v1 submitted 7 June, 2022;
originally announced June 2022.
-
Elucidating the Design Space of Diffusion-Based Generative Models
Authors:
Tero Karras,
Miika Aittala,
Timo Aila,
Samuli Laine
Abstract:
We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new st…
▽ More
We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.
△ Less
Submitted 11 October, 2022; v1 submitted 1 June, 2022;
originally announced June 2022.
-
The Role of ImageNet Classes in Fréchet Inception Distance
Authors:
Tuomas Kynkäänniemi,
Tero Karras,
Miika Aittala,
Timo Aila,
Jaakko Lehtinen
Abstract:
Fréchet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the Imag…
▽ More
Fréchet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.
△ Less
Submitted 14 February, 2023; v1 submitted 11 March, 2022;
originally announced March 2022.
-
Efficient Geometry-aware 3D Generative Adversarial Networks
Authors:
Eric R. Chan,
Connor Z. Lin,
Matthew A. Chan,
Koki Nagano,
Boxiao Pan,
Shalini De Mello,
Orazio Gallo,
Leonidas Guibas,
Jonathan Tremblay,
Sameh Khamis,
Tero Karras,
Gordon Wetzstein
Abstract:
Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape…
▽ More
Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. We introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.
△ Less
Submitted 27 April, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Alias-Free Generative Adversarial Networks
Authors:
Tero Karras,
Miika Aittala,
Samuli Laine,
Erik Härkönen,
Janne Hellsten,
Jaakko Lehtinen,
Timo Aila
Abstract:
We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the…
▽ More
We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
△ Less
Submitted 18 October, 2021; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Modular Primitives for High-Performance Differentiable Rendering
Authors:
Samuli Laine,
Janne Hellsten,
Tero Karras,
Yeongho Seol,
Jaakko Lehtinen,
Timo Aila
Abstract:
We present a modular differentiable renderer design that yields performance superior to previous methods by leveraging existing, highly optimized hardware graphics pipelines. Our design supports all crucial operations in a modern graphics pipeline: rasterizing large numbers of triangles, attribute interpolation, filtered texture lookups, as well as user-programmable shading and geometry processing…
▽ More
We present a modular differentiable renderer design that yields performance superior to previous methods by leveraging existing, highly optimized hardware graphics pipelines. Our design supports all crucial operations in a modern graphics pipeline: rasterizing large numbers of triangles, attribute interpolation, filtered texture lookups, as well as user-programmable shading and geometry processing, all in high resolutions. Our modular primitives allow custom, high-performance graphics pipelines to be built directly within automatic differentiation frameworks such as PyTorch or TensorFlow. As a motivating application, we formulate facial performance capture as an inverse rendering problem and show that it can be solved efficiently using our tools. Our results indicate that this simple and straightforward approach achieves excellent geometric correspondence between rendered results and reference imagery.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Training Generative Adversarial Networks with Limited Data
Authors:
Tero Karras,
Miika Aittala,
Janne Hellsten,
Samuli Laine,
Jaakko Lehtinen,
Timo Aila
Abstract:
Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch…
▽ More
Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.42.
△ Less
Submitted 7 October, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Semi-Supervised StyleGAN for Disentanglement Learning
Authors:
Weili Nie,
Tero Karras,
Animesh Garg,
Shoubhik Debnath,
Anjul Patney,
Ankit B. Patel,
Anima Anandkumar
Abstract:
Disentanglement learning is crucial for obtaining disentangled representations and controllable generation. Current disentanglement methods face several inherent limitations: difficulty with high-resolution images, primarily focusing on learning disentangled representations, and non-identifiability due to the unsupervised setting. To alleviate these limitations, we design new architectures and los…
▽ More
Disentanglement learning is crucial for obtaining disentangled representations and controllable generation. Current disentanglement methods face several inherent limitations: difficulty with high-resolution images, primarily focusing on learning disentangled representations, and non-identifiability due to the unsupervised setting. To alleviate these limitations, we design new architectures and loss functions based on StyleGAN (Karras et al., 2019), for semi-supervised high-resolution disentanglement learning. We create two complex high-resolution synthetic datasets for systematic testing. We investigate the impact of limited supervision and find that using only 0.25%~2.5% of labeled data is sufficient for good disentanglement on both synthetic and real datasets. We propose new metrics to quantify generator controllability, and observe there may exist a crucial trade-off between disentangled representation learning and controllable generation. We also consider semantic fine-grained image editing to achieve better generalization to unseen images.
△ Less
Submitted 25 November, 2020; v1 submitted 6 March, 2020;
originally announced March 2020.
-
Analyzing and Improving the Image Quality of StyleGAN
Authors:
Tero Karras,
Samuli Laine,
Miika Aittala,
Janne Hellsten,
Jaakko Lehtinen,
Timo Aila
Abstract:
The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to enc…
▽ More
The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably attribute a generated image to a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.
△ Less
Submitted 23 March, 2020; v1 submitted 3 December, 2019;
originally announced December 2019.
-
Few-Shot Unsupervised Image-to-Image Translation
Authors:
Ming-Yu Liu,
Xun Huang,
Arun Mallya,
Tero Karras,
Timo Aila,
Jaakko Lehtinen,
Jan Kautz
Abstract:
Unsupervised image-to-image translation methods learn to map images in a given class to an analogous image in a different class, drawing on unstructured (non-registered) datasets of images. While remarkably successful, current methods require access to many images in both source and destination classes at training time. We argue this greatly limits their use. Drawing inspiration from the human cap…
▽ More
Unsupervised image-to-image translation methods learn to map images in a given class to an analogous image in a different class, drawing on unstructured (non-registered) datasets of images. While remarkably successful, current methods require access to many images in both source and destination classes at training time. We argue this greatly limits their use. Drawing inspiration from the human capability of picking up the essence of a novel object from a small number of examples and generalizing from there, we seek a few-shot, unsupervised image-to-image translation algorithm that works on previously unseen target classes that are specified, at test time, only by a few example images. Our model achieves this few-shot generation capability by coupling an adversarial training scheme with a novel network design. Through extensive experimental validation and comparisons to several baseline methods on benchmark datasets, we verify the effectiveness of the proposed framework. Our implementation and datasets are available at https://github.com/NVlabs/FUNIT .
△ Less
Submitted 9 September, 2019; v1 submitted 5 May, 2019;
originally announced May 2019.
-
Improved Precision and Recall Metric for Assessing Generative Models
Authors:
Tuomas Kynkäänniemi,
Tero Karras,
Samuli Laine,
Jaakko Lehtinen,
Timo Aila
Abstract:
The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the…
▽ More
The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods and identify an improved method. Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.
△ Less
Submitted 30 October, 2019; v1 submitted 15 April, 2019;
originally announced April 2019.
-
High-Quality Self-Supervised Deep Image Denoising
Authors:
Samuli Laine,
Tero Karras,
Jaakko Lehtinen,
Timo Aila
Abstract:
We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. The training does not need access to clean reference images, or explicit pairs of corrupted images, and can thus be applied in situations where such data is unacceptably expensive or impossible to acquire. We build on a recent technique that removes the need for referen…
▽ More
We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. The training does not need access to clean reference images, or explicit pairs of corrupted images, and can thus be applied in situations where such data is unacceptably expensive or impossible to acquire. We build on a recent technique that removes the need for reference data by employing networks with a "blind spot" in the receptive field, and significantly improve two key aspects: image quality and training efficiency. Our result quality is on par with state-of-the-art neural network denoisers in the case of i.i.d. additive Gaussian noise, and not far behind with Poisson and impulse noise. We also successfully handle cases where parameters of the noise model are variable and/or unknown in both training and evaluation data.
△ Less
Submitted 28 October, 2019; v1 submitted 29 January, 2019;
originally announced January 2019.
-
A Style-Based Generator Architecture for Generative Adversarial Networks
Authors:
Tero Karras,
Samuli Laine,
Timo Aila
Abstract:
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific…
▽ More
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
△ Less
Submitted 29 March, 2019; v1 submitted 12 December, 2018;
originally announced December 2018.
-
Noise2Noise: Learning Image Restoration without Clean Data
Authors:
Jaakko Lehtinen,
Jacob Munkberg,
Jon Hasselgren,
Samuli Laine,
Tero Karras,
Miika Aittala,
Timo Aila
Abstract:
We apply basic statistical reasoning to signal reconstruction by machine learning -- learning to map corrupted observations to clean signals -- with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruptio…
▽ More
We apply basic statistical reasoning to signal reconstruction by machine learning -- learning to map corrupted observations to clean signals -- with a simple and powerful conclusion: it is possible to learn to restore images by only looking at corrupted examples, at performance at and sometimes exceeding training using clean data, without explicit image priors or likelihood models of the corruption. In practice, we show that a single model learns photographic noise removal, denoising synthetic Monte Carlo images, and reconstruction of undersampled MRI scans -- all corrupted by different processes -- based on noisy data only.
△ Less
Submitted 29 October, 2018; v1 submitted 12 March, 2018;
originally announced March 2018.
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Authors:
Tero Karras,
Timo Aila,
Samuli Laine,
Jaakko Lehtinen
Abstract:
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images a…
▽ More
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
△ Less
Submitted 26 February, 2018; v1 submitted 27 October, 2017;
originally announced October 2017.
-
Pruning Convolutional Neural Networks for Resource Efficient Inference
Authors:
Pavlo Molchanov,
Stephen Tyree,
Tero Karras,
Timo Aila,
Jan Kautz
Abstract:
We propose a new formulation for pruning convolutional kernels in neural networks to enable efficient inference. We interleave greedy criteria-based pruning with fine-tuning by backpropagation - a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on Taylor expansion that approximates the change in the cost function induce…
▽ More
We propose a new formulation for pruning convolutional kernels in neural networks to enable efficient inference. We interleave greedy criteria-based pruning with fine-tuning by backpropagation - a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on Taylor expansion that approximates the change in the cost function induced by pruning network parameters. We focus on transfer learning, where large pretrained networks are adapted to specialized tasks. The proposed criterion demonstrates superior performance compared to other criteria, e.g. the norm of kernel weights or feature map activation, for pruning large CNNs after adaptation to fine-grained classification tasks (Birds-200 and Flowers-102) relaying only on the first order gradient information. We also show that pruning can lead to more than 10x theoretical (5x practical) reduction in adapted 3D-convolutional filters with a small drop in accuracy in a recurrent gesture classifier. Finally, we show results for the large-scale ImageNet dataset to emphasize the flexibility of our approach.
△ Less
Submitted 8 June, 2017; v1 submitted 19 November, 2016;
originally announced November 2016.
-
Production-Level Facial Performance Capture Using Deep Convolutional Neural Networks
Authors:
Samuli Laine,
Tero Karras,
Timo Aila,
Antti Herva,
Shunsuke Saito,
Ronald Yu,
Hao Li,
Jaakko Lehtinen
Abstract:
We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5-10 minutes of captured footage, we train a convolutional n…
▽ More
We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5-10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular video sequence of that subject. Since this 3D facial performance capture is fully automated, our system can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character. We compare our results with several state-of-the-art monocular real-time facial capture techniques and demonstrate compelling animation inference in challenging areas such as eyes and lips.
△ Less
Submitted 2 June, 2017; v1 submitted 21 September, 2016;
originally announced September 2016.