-
UniRes: Universal Image Restoration for Complex Degradations
Authors:
Mo Zhou,
Keren Ye,
Mauricio Delbracio,
Peyman Milanfar,
Vishal M. Patel,
Hossein Talebi
Abstract:
Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitr…
▽ More
Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusionbased framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance
Authors:
Keren Ye,
Ignacio Garcia Dorado,
Michalis Raptis,
Mauricio Delbracio,
Irene Zhu,
Peyman Milanfar,
Hossein Talebi
Abstract:
While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hall…
▽ More
While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text.
To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Reference-Guided Identity Preserving Face Restoration
Authors:
Mo Zhou,
Keren Ye,
Viraj Shah,
Kangfu Mei,
Mauricio Delbracio,
Peyman Milanfar,
Vishal M. Patel,
Hossein Talebi
Abstract:
Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions:…
▽ More
Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions: 1) Composite Context, a comprehensive representation that fuses multi-level (high- and low-level) information from the reference face, offering richer guidance than prior singular representations. 2) Hard Example Identity Loss, a novel loss function that leverages the reference face to address the identity learning inefficiencies found in the existing identity loss. 3) A training-free method to adapt the model to multi-reference inputs during inference. The proposed method demonstrably restores high-quality faces and achieves state-of-the-art identity preserving restoration on benchmarks such as FFHQ-Ref and CelebA-Ref-Test, consistently outperforming previous work.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
The Power of Context: How Multimodality Improves Image Super-Resolution
Authors:
Kangfu Mei,
Hossein Talebi,
Mojtaba Ardakani,
Vishal M. Patel,
Peyman Milanfar,
Mauricio Delbracio
Abstract:
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, seg…
▽ More
Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at https://mmsr.kfmei.com/.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
Authors:
Kangfu Mei,
Zhengzhong Tu,
Mauricio Delbracio,
Hossein Talebi,
Vishal M. Patel,
Peyman Milanfar
Abstract:
We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established te…
▽ More
We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.
△ Less
Submitted 10 December, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
SPIRE: Semantic Prompt-Driven Image Restoration
Authors:
Chenyang Qi,
Zhengzhong Tu,
Keren Ye,
Mauricio Delbracio,
Peyman Milanfar,
Qifeng Chen,
Hossein Talebi
Abstract:
Text-driven diffusion models have become increasingly popular for various image editing tasks, including inpainting, stylization, and object replacement. However, it still remains an open research problem to adopt this language-vision paradigm for more fine-level image processing tasks, such as denoising, super-resolution, deblurring, and compression artifact removal. In this paper, we develop SPI…
▽ More
Text-driven diffusion models have become increasingly popular for various image editing tasks, including inpainting, stylization, and object replacement. However, it still remains an open research problem to adopt this language-vision paradigm for more fine-level image processing tasks, such as denoising, super-resolution, deblurring, and compression artifact removal. In this paper, we develop SPIRE, a Semantic and restoration Prompt-driven Image Restoration framework that leverages natural language as a user-friendly interface to control the image restoration process. We consider the capacity of prompt information in two dimensions. First, we use content-related prompts to enhance the semantic alignment, effectively alleviating identity ambiguity in the restoration outcomes. Second, our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength, without the need for explicit task-specific design. In addition, we introduce a novel fusion mechanism that augments the existing ControlNet architecture by learning to rescale the generative prior, thereby achieving better restoration fidelity. Our extensive experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts, alongside offering the flexibility of text-based control over the restoration effects.
△ Less
Submitted 16 July, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
Authors:
Kangfu Mei,
Mauricio Delbracio,
Hossein Talebi,
Zhengzhong Tu,
Vishal M. Patel,
Peyman Milanfar
Abstract:
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that…
▽ More
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally, a conditional consistency loss enforces consistent predictions across diffusion steps, effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods, achieving a new state-of-the-art in producing high-quality images with very few steps (e.g., 1-4) across multiple tasks, including super-resolution, text-guided image editing, and depth-to-image generation.
△ Less
Submitted 17 February, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
MULLER: Multilayer Laplacian Resizer for Vision
Authors:
Zhengzhong Tu,
Peyman Milanfar,
Hossein Talebi
Abstract:
Image resizing operation is a fundamental preprocessing module in modern computer vision. Throughout the deep learning revolution, researchers have overlooked the potential of alternative resizing methods beyond the commonly used resizers that are readily available, such as nearest-neighbors, bilinear, and bicubic. The key question of our interest is whether the front-end resizer affects the perfo…
▽ More
Image resizing operation is a fundamental preprocessing module in modern computer vision. Throughout the deep learning revolution, researchers have overlooked the potential of alternative resizing methods beyond the commonly used resizers that are readily available, such as nearest-neighbors, bilinear, and bicubic. The key question of our interest is whether the front-end resizer affects the performance of deep vision models? In this paper, we present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer. MULLER has a bandpass nature in that it learns to boost details in certain frequency subbands that benefit the downstream recognition models. We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost. Specifically, we select a state-of-the-art vision Transformer, MaxViT, as the baseline, and show that, if trained with MULLER, MaxViT gains up to 0.6% top-1 accuracy, and meanwhile enjoys 36% inference cost saving to achieve similar top-1 accuracy on ImageNet-1k, as compared to the standard training scheme. Notably, MULLER's performance also scales with model size and training data size such as ImageNet-21k and JFT, and it is widely applicable to multiple vision tasks, including image classification, object detection and segmentation, as well as image quality assessment.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Time Series Anomaly Detection in Smart Homes: A Deep Learning Approach
Authors:
Somayeh Zamani,
Hamed Talebi,
Gunnar Stevens
Abstract:
Fixing energy leakage caused by different anomalies can result in significant energy savings and extended appliance life. Further, it assists grid operators in scheduling their resources to meet the actual needs of end users, while helping end users reduce their energy costs. In this paper, we analyze the patterns pertaining to the power consumption of dishwashers used in two houses of the REFIT d…
▽ More
Fixing energy leakage caused by different anomalies can result in significant energy savings and extended appliance life. Further, it assists grid operators in scheduling their resources to meet the actual needs of end users, while helping end users reduce their energy costs. In this paper, we analyze the patterns pertaining to the power consumption of dishwashers used in two houses of the REFIT dataset. Then two autoencoder (AEs) with 1D-CNN and TCN as backbones are trained to differentiate the normal patterns from the abnormal ones. Our results indicate that TCN outperforms CNN1D in detecting anomalies in energy consumption. Finally, the data from the Fridge_Freezer and the Freezer of house No. 3 in REFIT is also used to evaluate our approach.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Multiscale Structure Guided Diffusion for Image Deblurring
Authors:
Mengwei Ren,
Mauricio Delbracio,
Hossein Talebi,
Guido Gerig,
Peyman Milanfar
Abstract:
Diffusion Probabilistic Models (DPMs) have recently been employed for image deblurring, formulated as an image-conditioned generation process that maps Gaussian noise to the high-quality image, conditioned on the blurry input. Image-conditioned DPMs (icDPMs) have shown more realistic results than regression-based methods when trained on pairwise in-domain data. However, their robustness in restori…
▽ More
Diffusion Probabilistic Models (DPMs) have recently been employed for image deblurring, formulated as an image-conditioned generation process that maps Gaussian noise to the high-quality image, conditioned on the blurry input. Image-conditioned DPMs (icDPMs) have shown more realistic results than regression-based methods when trained on pairwise in-domain data. However, their robustness in restoring images is unclear when presented with out-of-domain images as they do not impose specific degradation models or intermediate constraints. To this end, we introduce a simple yet effective multiscale structure guidance as an implicit bias that informs the icDPM about the coarse structure of the sharp image at the intermediate layers. This guided formulation leads to a significant improvement of the deblurring results, particularly on unseen domain. The guidance is extracted from the latent space of a regression network trained to predict the clean-sharp target at multiple lower resolutions, thus maintaining the most salient sharp structures. With both the blurry input and multiscale guidance, the icDPM model can better understand the blur and recover the clean image. We evaluate a single-dataset trained model on diverse datasets and demonstrate more robust deblurring results with fewer artifacts on unseen data. Our method outperforms existing baselines, achieving state-of-the-art perceptual quality while keeping competitive distortion metrics.
△ Less
Submitted 12 December, 2023; v1 submitted 4 December, 2022;
originally announced December 2022.
-
Soft Diffusion: Score Matching for General Corruptions
Authors:
Giannis Daras,
Mauricio Delbracio,
Hossein Talebi,
Alexandros G. Dimakis,
Peyman Milanfar
Abstract:
We define a broader family of corruption processes that generalizes previously known diffusion models. To reverse these general diffusions, we propose a new objective called Soft Score Matching that provably learns the score function for any linear corruption process and yields state of the art results for CelebA. Soft Score Matching incorporates the degradation process in the network. Our new los…
▽ More
We define a broader family of corruption processes that generalizes previously known diffusion models. To reverse these general diffusions, we propose a new objective called Soft Score Matching that provably learns the score function for any linear corruption process and yields state of the art results for CelebA. Soft Score Matching incorporates the degradation process in the network. Our new loss trains the model to predict a clean image, \textit{that after corruption}, matches the diffused observation. We show that our objective learns the gradient of the likelihood under suitable regularity conditions for a family of corruption processes. We further develop a principled way to select the corruption levels for general diffusion processes and a novel sampling method that we call Momentum Sampler. We show experimentally that our framework works for general linear corruption processes, such as Gaussian blur and masking. We achieve state-of-the-art FID score $1.85$ on CelebA-64, outperforming all previous linear diffusion models. We also show significant computational benefits compared to vanilla denoising diffusion.
△ Less
Submitted 4 October, 2022; v1 submitted 12 September, 2022;
originally announced September 2022.
-
MaxViT: Multi-Axis Vision Transformer
Authors:
Zhengzhong Tu,
Hossein Talebi,
Han Zhang,
Feng Yang,
Peyman Milanfar,
Alan Bovik,
Yinxiao Li
Abstract:
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dil…
▽ More
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.
△ Less
Submitted 9 September, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
MAXIM: Multi-Axis MLP for Image Processing
Authors:
Zhengzhong Tu,
Hossein Talebi,
Han Zhang,
Feng Yang,
Peyman Milanfar,
Alan Bovik,
Yinxiao Li
Abstract:
Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the…
▽ More
Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at \url{https://github.com/google-research/maxim}.
△ Less
Submitted 1 April, 2022; v1 submitted 9 January, 2022;
originally announced January 2022.
-
Deblurring via Stochastic Refinement
Authors:
Jay Whang,
Mauricio Delbracio,
Hossein Talebi,
Chitwan Saharia,
Alexandros G. Dimakis,
Peyman Milanfar
Abstract:
Image deblurring is an ill-posed problem with multiple plausible solutions for a given input image. However, most existing methods produce a deterministic estimate of the clean image and are trained to minimize pixel-level distortion. These metrics are known to be poorly correlated with human perception, and often lead to unrealistic reconstructions. We present an alternative framework for blind d…
▽ More
Image deblurring is an ill-posed problem with multiple plausible solutions for a given input image. However, most existing methods produce a deterministic estimate of the clean image and are trained to minimize pixel-level distortion. These metrics are known to be poorly correlated with human perception, and often lead to unrealistic reconstructions. We present an alternative framework for blind deblurring based on conditional diffusion models. Unlike existing techniques, we train a stochastic sampler that refines the output of a deterministic predictor and is capable of producing a diverse set of plausible reconstructions for a given input. This leads to a significant improvement in perceptual quality over existing state-of-the-art methods across multiple standard benchmarks. Our predict-and-refine approach also enables much more efficient sampling compared to typical diffusion models. Combined with a carefully tuned network architecture and inference procedure, our method is competitive in terms of distortion metrics such as PSNR. These results show clear benefits of our diffusion-based method for deblurring and challenge the widely used strategy of producing a single, deterministic reconstruction.
△ Less
Submitted 28 December, 2021; v1 submitted 4 December, 2021;
originally announced December 2021.
-
Exponentiated Gradient Reweighting for Robust Training Under Label Noise and Beyond
Authors:
Negin Majidi,
Ehsan Amid,
Hossein Talebi,
Manfred K. Warmuth
Abstract:
Many learning tasks in machine learning can be viewed as taking a gradient step towards minimizing the average loss of a batch of examples in each training iteration. When noise is prevalent in the data, this uniform treatment of examples can lead to overfitting to noisy examples with larger loss values and result in poor generalization. Inspired by the expert setting in on-line learning, we prese…
▽ More
Many learning tasks in machine learning can be viewed as taking a gradient step towards minimizing the average loss of a batch of examples in each training iteration. When noise is prevalent in the data, this uniform treatment of examples can lead to overfitting to noisy examples with larger loss values and result in poor generalization. Inspired by the expert setting in on-line learning, we present a flexible approach to learning from noisy examples. Specifically, we treat each training example as an expert and maintain a distribution over all examples. We alternate between updating the parameters of the model using gradient descent and updating the example weights using the exponentiated gradient update. Unlike other related methods, our approach handles a general class of loss functions and can be applied to a wide range of noise types and applications. We show the efficacy of our approach for multiple learning settings, namely noisy principal component analysis and a variety of noisy classification problems.
△ Less
Submitted 3 April, 2021;
originally announced April 2021.
-
Learning to Resize Images for Computer Vision Tasks
Authors:
Hossein Talebi,
Peyman Milanfar
Abstract:
For all the ways convolutional neural nets have revolutionized computer vision in recent years, one important aspect has received surprisingly little attention: the effect of image size on the accuracy of tasks being trained for. Typically, to be efficient, the input images are resized to a relatively small spatial resolution (e.g. 224x224), and both training and inference are carried out at this…
▽ More
For all the ways convolutional neural nets have revolutionized computer vision in recent years, one important aspect has received surprisingly little attention: the effect of image size on the accuracy of tasks being trained for. Typically, to be efficient, the input images are resized to a relatively small spatial resolution (e.g. 224x224), and both training and inference are carried out at this resolution. The actual mechanism for this re-scaling has been an afterthought: Namely, off-the-shelf image resizers such as bilinear and bicubic are commonly used in most machine learning software frameworks. But do these resizers limit the on task performance of the trained networks? The answer is yes. Indeed, we show that the typical linear resizer can be replaced with learned resizers that can substantially improve performance. Importantly, while the classical resizers typically result in better perceptual quality of the downscaled images, our proposed learned resizers do not necessarily give better visual quality, but instead improve task performance. Our learned image resizer is jointly trained with a baseline vision model. This learned CNN-based resizer creates machine friendly visual manipulations that lead to a consistent improvement of the end task metric over the baseline model. Specifically, here we focus on the classification task with the ImageNet dataset, and experiment with four different models to learn resizers adapted to each model. Moreover, we show that the proposed resizer can also be useful for fine-tuning the classification baselines for other vision tasks. To this end, we experiment with three different baselines to develop image quality assessment (IQA) models on the AVA dataset.
△ Less
Submitted 17 August, 2021; v1 submitted 17 March, 2021;
originally announced March 2021.
-
Deep Perceptual Image Quality Assessment for Compression
Authors:
Juan Carlos Mier,
Eddie Huang,
Hossein Talebi,
Feng Yang,
Peyman Milanfar
Abstract:
Lossy Image compression is necessary for efficient storage and transfer of data. Typically the trade-off between bit-rate and quality determines the optimal compression level. This makes the image quality metric an integral part of any imaging system. While the existing full-reference metrics such as PSNR and SSIM may be less sensitive to perceptual quality, the recently introduced learning method…
▽ More
Lossy Image compression is necessary for efficient storage and transfer of data. Typically the trade-off between bit-rate and quality determines the optimal compression level. This makes the image quality metric an integral part of any imaging system. While the existing full-reference metrics such as PSNR and SSIM may be less sensitive to perceptual quality, the recently introduced learning methods may fail to generalize to unseen data. In this paper we propose the largest image compression quality dataset to date with human perceptual preferences, enabling the use of deep learning, and we develop a full reference perceptual quality assessment metric for lossy image compression that outperforms the existing state-of-the-art methods. We show that the proposed model can effectively learn from thousands of examples available in the new dataset, and consequently it generalizes better to other unseen datasets of human perceptual preference.
△ Less
Submitted 15 July, 2021; v1 submitted 1 March, 2021;
originally announced March 2021.
-
Projected Distribution Loss for Image Enhancement
Authors:
Mauricio Delbracio,
Hossein Talebi,
Peyman Milanfar
Abstract:
Features obtained from object recognition CNNs have been widely used for measuring perceptual similarities between images. Such differentiable metrics can be used as perceptual learning losses to train image enhancement models. However, the choice of the distance function between input and target features may have a consequential impact on the performance of the trained model. While using the norm…
▽ More
Features obtained from object recognition CNNs have been widely used for measuring perceptual similarities between images. Such differentiable metrics can be used as perceptual learning losses to train image enhancement models. However, the choice of the distance function between input and target features may have a consequential impact on the performance of the trained model. While using the norm of the difference between extracted features leads to limited hallucination of details, measuring the distance between distributions of features may generate more textures; yet also more unrealistic details and artifacts. In this paper, we demonstrate that aggregating 1D-Wasserstein distances between CNN activations is more reliable than the existing approaches, and it can significantly improve the perceptual performance of enhancement models. More explicitly, we show that in imaging applications such as denoising, super-resolution, demosaicing, deblurring and JPEG artifact removal, the proposed learning loss outperforms the current state-of-the-art on reference-based perceptual losses. This means that the proposed learning loss can be plugged into different imaging frameworks and produce perceptually realistic results.
△ Less
Submitted 17 May, 2021; v1 submitted 16 December, 2020;
originally announced December 2020.
-
Rank-smoothed Pairwise Learning In Perceptual Quality Assessment
Authors:
Hossein Talebi,
Ehsan Amid,
Peyman Milanfar,
Manfred K. Warmuth
Abstract:
Conducting pairwise comparisons is a widely used approach in curating human perceptual preference data. Typically raters are instructed to make their choices according to a specific set of rules that address certain dimensions of image quality and aesthetics. The outcome of this process is a dataset of sampled image pairs with their associated empirical preference probabilities. Training a model o…
▽ More
Conducting pairwise comparisons is a widely used approach in curating human perceptual preference data. Typically raters are instructed to make their choices according to a specific set of rules that address certain dimensions of image quality and aesthetics. The outcome of this process is a dataset of sampled image pairs with their associated empirical preference probabilities. Training a model on these pairwise preferences is a common deep learning approach. However, optimizing by gradient descent through mini-batch learning means that the "global" ranking of the images is not explicitly taken into account. In other words, each step of the gradient descent relies only on a limited number of pairwise comparisons. In this work, we demonstrate that regularizing the pairwise empirical probabilities with aggregated rankwise probabilities leads to a more reliable training loss. We show that training a deep image quality assessment model with our rank-smoothed loss consistently improves the accuracy of predicting human preferences.
△ Less
Submitted 21 November, 2020;
originally announced November 2020.
-
Control and implementation of fluid-driven soft gripper with dynamic uncertainty of object
Authors:
Amirhosein Alian,
Mohammad Zareinejad,
Heidar Ali Talebi
Abstract:
Soft grippers, for stable grasping of objects, with high compliance could be considered a suitable candidate for replacement of conventional rigid grippers, and in recent years, they have been emerging exponentially in industries. Not only are these highly adaptable grippers capable of static grasping of an object, but also they can be utilized for performing object manipulation tasks. Plenty of c…
▽ More
Soft grippers, for stable grasping of objects, with high compliance could be considered a suitable candidate for replacement of conventional rigid grippers, and in recent years, they have been emerging exponentially in industries. Not only are these highly adaptable grippers capable of static grasping of an object, but also they can be utilized for performing object manipulation tasks. Plenty of contemporary studies have been emphasizing on static grasping ability of soft grippers. However, in this thesis, planar in-hand object manipulation in a soft gripper which comprises a pair of soft finger is investigated.
△ Less
Submitted 21 November, 2020;
originally announced November 2020.
-
The Rate-Distortion-Accuracy Tradeoff: JPEG Case Study
Authors:
Xiyang Luo,
Hossein Talebi,
Feng Yang,
Michael Elad,
Peyman Milanfar
Abstract:
Handling digital images is almost always accompanied by a lossy compression in order to facilitate efficient transmission and storage. This introduces an unavoidable tension between the allocated bit-budget (rate) and the faithfulness of the resulting image to the original one (distortion). An additional complicating consideration is the effect of the compression on recognition performance by give…
▽ More
Handling digital images is almost always accompanied by a lossy compression in order to facilitate efficient transmission and storage. This introduces an unavoidable tension between the allocated bit-budget (rate) and the faithfulness of the resulting image to the original one (distortion). An additional complicating consideration is the effect of the compression on recognition performance by given classifiers (accuracy). This work aims to explore this rate-distortion-accuracy tradeoff. As a case study, we focus on the design of the quantization tables in the JPEG compression standard. We offer a novel optimal tuning of these tables via continuous optimization, leveraging a differential implementation of both the JPEG encoder-decoder and an entropy estimator. This enables us to offer a unified framework that considers the interplay between rate, distortion and classification accuracy. In all these fronts, we report a substantial boost in performance by a simple and easily implemented modification of these tables.
△ Less
Submitted 2 August, 2020;
originally announced August 2020.
-
Super-Resolving Commercial Satellite Imagery Using Realistic Training Data
Authors:
Xiang Zhu,
Hossein Talebi,
Xinwei Shi,
Feng Yang,
Peyman Milanfar
Abstract:
In machine learning based single image super-resolution, the degradation model is embedded in training data generation. However, most existing satellite image super-resolution methods use a simple down-sampling model with a fixed kernel to create training images. These methods work fine on synthetic data, but do not perform well on real satellite images. We propose a realistic training data genera…
▽ More
In machine learning based single image super-resolution, the degradation model is embedded in training data generation. However, most existing satellite image super-resolution methods use a simple down-sampling model with a fixed kernel to create training images. These methods work fine on synthetic data, but do not perform well on real satellite images. We propose a realistic training data generation model for commercial satellite imagery products, which includes not only the imaging process on satellites but also the post-process on the ground. We also propose a convolutional neural network optimized for satellite images. Experiments show that the proposed training data generation model is able to improve super-resolution performance on real satellite images.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
Better Compression with Deep Pre-Editing
Authors:
Hossein Talebi,
Damien Kelly,
Xiyang Luo,
Ignacio Garcia Dorado,
Feng Yang,
Peyman Milanfar,
Michael Elad
Abstract:
Could we compress images via standard codecs while avoiding visible artifacts? The answer is obvious -- this is doable as long as the bit budget is generous enough. What if the allocated bit-rate for compression is insufficient? Then unfortunately, artifacts are a fact of life. Many attempts were made over the years to fight this phenomenon, with various degrees of success. In this work we aim to…
▽ More
Could we compress images via standard codecs while avoiding visible artifacts? The answer is obvious -- this is doable as long as the bit budget is generous enough. What if the allocated bit-rate for compression is insufficient? Then unfortunately, artifacts are a fact of life. Many attempts were made over the years to fight this phenomenon, with various degrees of success. In this work we aim to break the unholy connection between bit-rate and image quality, and propose a way to circumvent compression artifacts by pre-editing the incoming image and modifying its content to fit the given bits. We design this editing operation as a learned convolutional neural network, and formulate an optimization problem for its training. Our loss takes into account a proximity between the original image and the edited one, a bit-budget penalty over the proposed image, and a no-reference image quality measure for forcing the outcome to be visually pleasing. The proposed approach is demonstrated on the popular JPEG compression, showing savings in bits and/or improvements in visual quality, obtained with intricate editing effects.
△ Less
Submitted 23 July, 2021; v1 submitted 31 January, 2020;
originally announced February 2020.
-
High Accuracy Classification of White Blood Cells using TSLDA Classifier and Covariance Features
Authors:
Hamed Talebi,
Amin Ranjbar,
Alireza Davoudi,
Hamed Gholami,
Mohammad Bagher Menhaj
Abstract:
creating automated processes in different areas of medical science with the application of engineering tools is a highly growing field over recent decades. In this context, many medical image processing and analyzing researchers use worthwhile methods in artificial intelligence, which can reduce necessary human power while increases accuracy of results. Among various medical images, blood microsco…
▽ More
creating automated processes in different areas of medical science with the application of engineering tools is a highly growing field over recent decades. In this context, many medical image processing and analyzing researchers use worthwhile methods in artificial intelligence, which can reduce necessary human power while increases accuracy of results. Among various medical images, blood microscopic images play a vital role in heart failure diagnosis, e.g., blood cancers. The prominent component in blood cancer diagnosis is white blood cells (WBCs) which due to its general characteristics in microscopic images sometimes make difficulties in recognition and classification tasks such as non-uniform colors/illuminances, different shapes, sizes, and textures. Moreover, overlapped WBCs in bone marrow images and neighboring to red blood cells are identified as reasons for errors in the classification task. In this paper, we have endeavored to segment various parts in medical images via Naïve Bayes clustering method and in next stage via TSLDA classifier, which is supplied by features acquired from covariance descriptor results in the accuracy of 98.02%. It seems that this result is delightful in WBCs recognition.
△ Less
Submitted 16 July, 2019; v1 submitted 12 June, 2019;
originally announced June 2019.
-
Learned Perceptual Image Enhancement
Authors:
Hossein Talebi,
Peyman Milanfar
Abstract:
Learning a typical image enhancement pipeline involves minimization of a loss function between enhanced and reference images. While L1 and L2 losses are perhaps the most widely used functions for this purpose, they do not necessarily lead to perceptually compelling results. In this paper, we show that adding a learned no-reference image quality metric to the loss can significantly improve enhancem…
▽ More
Learning a typical image enhancement pipeline involves minimization of a loss function between enhanced and reference images. While L1 and L2 losses are perhaps the most widely used functions for this purpose, they do not necessarily lead to perceptually compelling results. In this paper, we show that adding a learned no-reference image quality metric to the loss can significantly improve enhancement operators. This metric is implemented using a CNN (convolutional neural network) trained on a large-scale dataset labelled with aesthetic preferences of human raters. This loss allows us to conveniently perform back-propagation in our learning framework to simultaneously optimize for similarity to a given ground truth reference and perceptual quality. This perceptual loss is only used to train parameters of image processing operators, and does not impose any extra complexity at inference time. Our experiments demonstrate that this loss can be effective for tuning a variety of operators such as local tone mapping and dehazing.
△ Less
Submitted 7 December, 2017;
originally announced December 2017.
-
Stability and Transparency Analysis of a Bilateral Teleoperation in Presence of Data Loss
Authors:
A. Bakhshi,
H. A. Talebi,
A. A. Suratgar,
M. Abdeetedal
Abstract:
This paper presents a novel approach for stability and transparency analysis for bilateral teleoperation in the presence of data loss in communication media. A new model for data loss is proposed based on a set of periodic continuous pulses and its finite series representation. The passivity of the overall system is shown using wave variable approach including the newly defined model for data loss…
▽ More
This paper presents a novel approach for stability and transparency analysis for bilateral teleoperation in the presence of data loss in communication media. A new model for data loss is proposed based on a set of periodic continuous pulses and its finite series representation. The passivity of the overall system is shown using wave variable approach including the newly defined model for data loss. Simulation results are presented to show the effectiveness of the proposed approach.
△ Less
Submitted 9 November, 2017;
originally announced November 2017.
-
NIMA: Neural Image Assessment
Authors:
Hossein Talebi,
Peyman Milanfar
Abstract:
Automatically learned quality assessment for images has recently become a hot topic due to its usefulness in a wide variety of applications such as evaluating image capture pipelines, storage techniques and sharing media. Despite the subjective nature of this problem, most existing methods only predict the mean opinion score provided by datasets such as AVA [1] and TID2013 [2]. Our approach differ…
▽ More
Automatically learned quality assessment for images has recently become a hot topic due to its usefulness in a wide variety of applications such as evaluating image capture pipelines, storage techniques and sharing media. Despite the subjective nature of this problem, most existing methods only predict the mean opinion score provided by datasets such as AVA [1] and TID2013 [2]. Our approach differs from others in that we predict the distribution of human opinion scores using a convolutional neural network. Our architecture also has the advantage of being significantly simpler than other methods with comparable performance. Our proposed approach relies on the success (and retraining) of proven, state-of-the-art deep object recognition networks. Our resulting network can be used to not only score images reliably and with high correlation to human perception, but also to assist with adaptation and optimization of photo editing/enhancement algorithms in a photographic pipeline. All this is done without need for a "golden" reference image, consequently allowing for single-image, semantic- and perceptually-aware, no-reference quality assessment.
△ Less
Submitted 26 April, 2018; v1 submitted 15 September, 2017;
originally announced September 2017.
-
Fast Multi-Layer Laplacian Enhancement
Authors:
Hossein Talebi,
Peyman Milanfar
Abstract:
A novel, fast and practical way of enhancing images is introduced in this paper. Our approach builds on Laplacian operators of well-known edge-aware kernels, such as bilateral and nonlocal means, and extends these filter's capabilities to perform more effective and fast image smoothing, sharpening and tone manipulation. We propose an approximation of the Laplacian, which does not require normaliza…
▽ More
A novel, fast and practical way of enhancing images is introduced in this paper. Our approach builds on Laplacian operators of well-known edge-aware kernels, such as bilateral and nonlocal means, and extends these filter's capabilities to perform more effective and fast image smoothing, sharpening and tone manipulation. We propose an approximation of the Laplacian, which does not require normalization of the kernel weights. Multiple Laplacians of the affinity weights endow our method with progressive detail decomposition of the input image from fine to coarse scale. These image components are blended by a structure mask, which avoids noise/artifact magnification or detail loss in the output image. Contributions of the proposed method to existing image editing tools are: (1) Low computational and memory requirements, making it appropriate for mobile device implementations (e.g. as a finish step in a camera pipeline), (2) A range of filtering applications from detail enhancement to denoising with only a few control parameters, enabling the user to apply a combination of various (and even opposite) filtering effects.
△ Less
Submitted 23 June, 2016;
originally announced June 2016.