Search | arXiv e-print repository

Do image and video quality metrics model low-level human vision?

Authors: Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, Rafał K. Mantiuk

Abstract: Image and video quality metrics, such as SSIM, LPIPS, and VMAF, are aimed to predict the perceived quality of the evaluated content and are often claimed to be "perceptual". Yet, few metrics directly model human visual perception, and most rely on hand-crafted formulas or training datasets to achieve alignment with perceptual data. In this paper, we propose a set of tests for full-reference qualit… ▽ More Image and video quality metrics, such as SSIM, LPIPS, and VMAF, are aimed to predict the perceived quality of the evaluated content and are often claimed to be "perceptual". Yet, few metrics directly model human visual perception, and most rely on hand-crafted formulas or training datasets to achieve alignment with perceptual data. In this paper, we propose a set of tests for full-reference quality metrics that examine their ability to model several aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. The tests are meant to provide additional scrutiny for newly proposed metrics. We use our tests to analyze 33 existing image and video quality metrics and find their strengths and weaknesses, such as the ability of LPIPS and MS-SSIM to predict contrast masking and poor performance of VMAF in this task. We further find that the popular SSIM metric overemphasizes differences in high spatial frequencies, but its multi-scale counterpart, MS-SSIM, addresses this shortcoming. Such findings cannot be easily made using existing evaluation protocols. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2204.12022 [pdf, other]

Estimating the Resize Parameter in End-to-end Learned Image Compression

Authors: Li-Heng Chen, Christos G. Bampis, Zhi Li, Lukáš Krasula, Alan C. Bovik

Abstract: We describe a search-free resizing framework that can further improve the rate-distortion tradeoff of recent learned image compression models. Our approach is simple: compose a pair of differentiable downsampling/upsampling layers that sandwich a neural compression model. To determine resize factors for different inputs, we utilize another neural network jointly trained with the compression model,… ▽ More We describe a search-free resizing framework that can further improve the rate-distortion tradeoff of recent learned image compression models. Our approach is simple: compose a pair of differentiable downsampling/upsampling layers that sandwich a neural compression model. To determine resize factors for different inputs, we utilize another neural network jointly trained with the compression model, with the end goal of minimizing the rate-distortion objective. Our results suggest that "compression friendly" downsampled representations can be quickly determined during encoding by using an auxiliary network and differentiable image warping. By conducting extensive experimental tests on existing deep image compression models, we show results that our new resizing parameter estimation framework can provide Bjøntegaard-Delta rate (BD-rate) improvement of about 10% against leading perceptual quality engines. We also carried out a subjective quality study, the results of which show that our new approach yields favorable compressed images. To facilitate reproducible research in this direction, the implementation used in this paper is being made freely available online at: https://github.com/treammm/ResizeCompression. △ Less

Submitted 25 April, 2022; originally announced April 2022.

arXiv:2202.11038 [pdf, other]

Banding vs. Quality: Perceptual Impact and Objective Assessment

Authors: Lukáš Krasula, Zhi Li, Christos G. Bampis, Mariana Afonso, Nil Fons Miret, Joel Sole

Abstract: Staircase-like contours introduced to a video by quantization in flat areas, commonly known as banding, have been a long-standing problem in both video processing and quality assessment communities. The fact that even a relatively small change of the original pixel values can result in a strong impact on perceived quality makes banding especially difficult to be detected by objective quality metri… ▽ More Staircase-like contours introduced to a video by quantization in flat areas, commonly known as banding, have been a long-standing problem in both video processing and quality assessment communities. The fact that even a relatively small change of the original pixel values can result in a strong impact on perceived quality makes banding especially difficult to be detected by objective quality metrics. In this paper, we study how banding annoyance compares to more commonly studied scaling and compression artifacts with respect to the overall perceptual quality. We further propose a simple combination of VMAF and the recently developed banding index, CAMBI, into a banding-aware video quality metric showing improved correlation with overall perceived quality. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: Submitted to IEEE International Conference on Image Processing (ICIP) 2022

arXiv:2105.09999 [pdf, other]

Convolutional Block Design for Learned Fractional Downsampling

Authors: Li-Heng Chen, Christos G. Bampis, Zhi Li, Chao Chen, Alan C. Bovik

Abstract: The layers of convolutional neural networks (CNNs) can be used to alter the resolution of their inputs, but the scaling factors are limited to integer values. However, in many image and video processing applications, the ability to resize by a fractional factor would be advantageous. One example is conversion between resolutions standardized for video compression, such as from 1080p to 720p. To so… ▽ More The layers of convolutional neural networks (CNNs) can be used to alter the resolution of their inputs, but the scaling factors are limited to integer values. However, in many image and video processing applications, the ability to resize by a fractional factor would be advantageous. One example is conversion between resolutions standardized for video compression, such as from 1080p to 720p. To solve this problem, we propose an alternative building block, formulated as a conventional convolutional layer followed by a differentiable resizer. More concretely, the convolutional layer preserves the resolution of the input, while the resizing operation is fully handled by the resizer. In this way, any CNN architecture can be adapted for non-integer resizing. As an application, we replace the resizing convolutional layer of a modern deep downsampling model by the proposed building block, and apply it to an adaptive bitrate video streaming scenario. Our experimental results show that an improvement in coding efficiency over the conventional Lanczos algorithm is attained, in terms of PSNR, SSIM, and VMAF on test videos. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: 4 pages conference paper

arXiv:2102.00088 [pdf]

doi 10.1109/TIP.2021.3137658

A Subjective and Objective Study of Space-Time Subsampled Video Quality

Authors: Dae Yeol Lee, Somdyuti Paul, Christos G. Bampis, Hyunsuk Ko, Jongho Kim, Se Yoon Jeong, Blake Homan, Alan C. Bovik

Abstract: Video dimensions are continuously increasing to provide more realistic and immersive experiences to global streaming and social media viewers. However, increments in video parameters such as spatial resolution and frame rate are inevitably associated with larger data volumes. Transmitting increasingly voluminous videos through limited bandwidth networks in a perceptually optimal way is a current c… ▽ More Video dimensions are continuously increasing to provide more realistic and immersive experiences to global streaming and social media viewers. However, increments in video parameters such as spatial resolution and frame rate are inevitably associated with larger data volumes. Transmitting increasingly voluminous videos through limited bandwidth networks in a perceptually optimal way is a current challenge affecting billions of viewers. One recent practice adopted by video service providers is space-time resolution adaptation in conjunction with video compression. Consequently, it is important to understand how different levels of space-time subsampling and compression affect the perceptual quality of videos. Towards making progress in this direction, we constructed a large new resource, called the ETRI-LIVE Space-Time Subsampled Video Quality (ETRI-LIVE STSVQ) database, containing 437 videos generated by applying various levels of combined space-time subsampling and video compression on 15 diverse video contents. We also conducted a large-scale human study on the new dataset, collecting about 15,000 subjective judgments of video quality. We provide a rate-distortion analysis of the collected subjective scores, enabling us to investigate the perceptual impact of space-time subsampling at different bit rates. We also evaluated and compared the performance of leading video quality models on the new database. △ Less

Submitted 29 January, 2021; originally announced February 2021.

arXiv:2009.11203 [pdf, other]

doi 10.1109/TIP.2020.3043127

Perceptual Video Quality Prediction Emphasizing Chroma Distortions

Authors: Li-Heng Chen, Christos G. Bampis, Zhi Li, Joel Sole, Alan C. Bovik

Abstract: Measuring the quality of digital videos viewed by human observers has become a common practice in numerous multimedia applications, such as adaptive video streaming, quality monitoring, and other digital TV applications. Here we explore a significant, yet relatively unexplored problem: measuring perceptual quality on videos arising from both luma and chroma distortions from compression. Toward inv… ▽ More Measuring the quality of digital videos viewed by human observers has become a common practice in numerous multimedia applications, such as adaptive video streaming, quality monitoring, and other digital TV applications. Here we explore a significant, yet relatively unexplored problem: measuring perceptual quality on videos arising from both luma and chroma distortions from compression. Toward investigating this problem, it is important to understand the kinds of chroma distortions that arise, how they relate to luma compression distortions, and how they can affect perceived quality. We designed and carried out a subjective experiment to measure subjective video quality on both luma and chroma distortions, introduced both in isolation as well as together. Specifically, the new subjective dataset comprises a total of $210$ videos afflicted by distortions caused by varying levels of luma quantization commingled with different amounts of chroma quantization. The subjective scores were evaluated by $34$ subjects in a controlled environmental setting. Using the newly collected subjective data, we were able to demonstrate important shortcomings of existing video quality models, especially in regards to chroma distortions. Further, we designed an objective video quality model which builds on existing video quality algorithms, by considering the fidelity of chroma channels in a principled way. We also found that this quality analysis implies that there is room for reducing bitrate consumption in modern video codecs by creatively increasing the compression factor on chroma channels. We believe that this work will both encourage further research in this direction, as well as advance progress on the ultimate goal of jointly optimizing luma and chroma compression in modern video encoders. △ Less

Submitted 24 September, 2020; v1 submitted 23 September, 2020; originally announced September 2020.

Comments: 14 pages

arXiv:2007.02711 [pdf, other]

Perceptually Optimizing Deep Image Compression

Authors: Li-Heng Chen, Christos G. Bampis, Zhi Li, Andrey Norkin, Alan C. Bovik

Abstract: Mean squared error (MSE) and $\ell_p$ norms have largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess visual information loss, these simple norms are not highly consistent with human perception. Here, we propose a different proxy approach to optimize image analysis networks against quantitative perceptual model… ▽ More Mean squared error (MSE) and $\ell_p$ norms have largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess visual information loss, these simple norms are not highly consistent with human perception. Here, we propose a different proxy approach to optimize image analysis networks against quantitative perceptual models. Specifically, we construct a proxy network, which mimics the perceptual model while serving as a loss layer of the network.We experimentally demonstrate how this optimization framework can be applied to train an end-to-end optimized image compression network. By building on top of a modern deep image compression models, we are able to demonstrate an averaged bitrate reduction of $28.7\%$ over MSE optimization, given a specified perceptual quality (VMAF) level. △ Less

Submitted 9 July, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 7 pages, 6 figures. arXiv admin note: substantial text overlap with arXiv:1910.08845

arXiv:2004.02943 [pdf, other]

doi 10.1109/TIP.2021.3107213

Predicting the Quality of Compressed Videos with Pre-Existing Distortions

Authors: Xiangxu Yu, Neil Birkbeck, Yilin Wang, Christos G. Bampis, Balu Adsumilli, Alan C. Bovik

Abstract: Over the past decade, the online video industry has greatly expanded the volume of visual data that is streamed and shared over the Internet. Moreover, because of the increasing ease of video capture, many millions of consumers create and upload large volumes of User-Generated-Content (UGC) videos. Unlike streaming television or cinematic content produced by professional videographers and cinemagr… ▽ More Over the past decade, the online video industry has greatly expanded the volume of visual data that is streamed and shared over the Internet. Moreover, because of the increasing ease of video capture, many millions of consumers create and upload large volumes of User-Generated-Content (UGC) videos. Unlike streaming television or cinematic content produced by professional videographers and cinemagraphers, UGC videos are most commonly captured by naive users having limited skills and imperfect technique, and often are afflicted by highly diverse and mixed in-capture distortions. These UGC videos are then often uploaded for sharing onto cloud servers, where they further compressed for storage and transmission. Our paper tackles the highly practical problem of predicting the quality of compressed videos (perhaps during the process of compression, to help guide it), with only (possibly severely) distorted UGC videos as references. To address this problem, we have developed a novel Video Quality Assessment (VQA) framework that we call 1stepVQA (to distinguish it from two-step methods that we discuss). 1stepVQA overcomes limitations of Full-Reference, Reduced-Reference and No-Reference VQA models by exploiting the statistical regularities of both natural videos and distorted videos. We show that 1stepVQA is able to more accurately predict the quality of compressed videos, given imperfect reference videos. We also describe a new dedicated video database which includes (typically distorted) UGC reference videos, and a large number of compressed versions of them. We show that the 1stepVQA model outperforms other VQA models in this scenario. We are providing the dedicated new database free of charge at https://live.ece.utexas.edu/research/onestep/index.html △ Less

Submitted 6 April, 2020; originally announced April 2020.

arXiv:2004.02067 [pdf, other]

A Simple Model for Subject Behavior in Subjective Experiments

Authors: Zhi Li, Christos G. Bampis, Lukáš Krasula, Lucjan Janowski, Ioannis Katsavounidis

Abstract: In a subjective experiment to evaluate the perceptual audiovisual quality of multimedia and television services, raw opinion scores collected from test subjects are often noisy and unreliable. To produce the final mean opinion scores (MOS), recommendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913 standardize post-test screening procedures to clean up the raw opinion scores, using techniqu… ▽ More In a subjective experiment to evaluate the perceptual audiovisual quality of multimedia and television services, raw opinion scores collected from test subjects are often noisy and unreliable. To produce the final mean opinion scores (MOS), recommendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913 standardize post-test screening procedures to clean up the raw opinion scores, using techniques such as subject outlier rejection and bias removal. In this paper, we analyze the prior standardized techniques to demonstrate their weaknesses. As an alternative, we propose a simple model to account for two of the most dominant behaviors of subject inaccuracy: bias and inconsistency. We further show that this model can also effectively deal with inattentive subjects that give random scores. We propose to use maximum likelihood estimation to jointly solve the model parameters, and present two numeric solvers: the first based on the Newton-Raphson method, and the second based on an alternating projection (AP). We show that the AP solver generalizes the ITU-T P.913 post-test screening procedure by weighing a subject's contribution to the true quality score by her consistency (thus, the quality scores estimated can be interpreted as bias-subtracted consistency-weighted MOS). We compare the proposed methods with the standardized techniques using real datasets and synthetic simulations, and demonstrate that the proposed methods are the most valuable when the test conditions are challenging (for example, crowdsourcing and cross-lab studies), offering advantages such as better model-data fit, tighter confidence intervals, better robustness against subject outliers, the absence of hard coded parameters and thresholds, and auxiliary information on test subjects. The code for this work is open-sourced at https://github.com/Netflix/sureal. △ Less

Submitted 6 May, 2021; v1 submitted 4 April, 2020; originally announced April 2020.

Comments: 14 pages, updated version of the original paper published in Human Vision and Electronic Imaging (HVEI) 2020

arXiv:1910.08845 [pdf, other]

doi 10.1109/TIP.2020.3036752

ProxIQA: A Proxy Approach to Perceptual Optimization of Learned Image Compression

Authors: Li-Heng Chen, Christos G. Bampis, Zhi Li, Andrey Norkin, Alan C. Bovik

Abstract: The use of $\ell_p$ $(p=1,2)$ norms has largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess the loss of visual information, these simple norms are not very consistent with human perception. Here, we describe a different "proximal" approach to optimize image analysis networks against quantitative perceptual mod… ▽ More The use of $\ell_p$ $(p=1,2)$ norms has largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess the loss of visual information, these simple norms are not very consistent with human perception. Here, we describe a different "proximal" approach to optimize image analysis networks against quantitative perceptual models. Specifically, we construct a proxy network, broadly termed ProxIQA, which mimics the perceptual model while serving as a loss layer of the network. We experimentally demonstrate how this optimization framework can be applied to train an end-to-end optimized image compression network. By building on top of an existing deep image compression model, we are able to demonstrate a bitrate reduction of as much as $31\%$ over MSE optimization, given a specified perceptual quality (VMAF) level. △ Less

Submitted 29 October, 2020; v1 submitted 19 October, 2019; originally announced October 2019.

Comments: 12 pages, 12 figures, 5 tables

arXiv:1811.10673 [pdf, other]

Adversarial Video Compression Guided by Soft Edge Detection

Authors: Sungsoo Kim, Jin Soo Park, Christos G. Bampis, Jaeseong Lee, Mia K. Markey, Alexandros G. Dimakis, Alan C. Bovik

Abstract: We propose a video compression framework using conditional Generative Adversarial Networks (GANs). We rely on two encoders: one that deploys a standard video codec and another which generates low-level maps via a pipeline of down-sampling, a newly devised soft edge detector, and a novel lossless compression scheme. For decoding, we use a standard video decoder as well as a neural network based one… ▽ More We propose a video compression framework using conditional Generative Adversarial Networks (GANs). We rely on two encoders: one that deploys a standard video codec and another which generates low-level maps via a pipeline of down-sampling, a newly devised soft edge detector, and a novel lossless compression scheme. For decoding, we use a standard video decoder as well as a neural network based one, which is trained using a conditional GAN. Recent "deep" approaches to video compression require multiple videos to pre-train generative networks to conduct interpolation. In contrast to this prior work, our scheme trains a generative decoder on pairs of a very limited number of key frames taken from a single video and corresponding low-level maps. The trained decoder produces reconstructed frames relying on a guidance of low-level maps, without any interpolation. Experiments on a diverse set of 131 videos demonstrate that our proposed GAN-based compression engine achieves much higher quality reconstructions at very low bitrates than prevailing standard codecs such as H.264 or HEVC. △ Less

Submitted 26 November, 2018; originally announced November 2018.

arXiv:1808.03898 [pdf, other]

Towards Perceptually Optimized End-to-end Adaptive Video Streaming

Authors: Christos G. Bampis, Zhi Li, Ioannis Katsavounidis, Te-Yuan Huang, Chaitanya Ekanadham, Alan C. Bovik

Abstract: Measuring Quality of Experience (QoE) and integrating these measurements into video streaming algorithms is a multi-faceted problem that fundamentally requires the design of comprehensive subjective QoE databases and metrics. To achieve this goal, we have recently designed the LIVE-NFLX-II database, a highly-realistic database which contains subjective QoE responses to various design dimensions, s… ▽ More Measuring Quality of Experience (QoE) and integrating these measurements into video streaming algorithms is a multi-faceted problem that fundamentally requires the design of comprehensive subjective QoE databases and metrics. To achieve this goal, we have recently designed the LIVE-NFLX-II database, a highly-realistic database which contains subjective QoE responses to various design dimensions, such as bitrate adaptation algorithms, network conditions and video content. Our database builds on recent advancements in content-adaptive encoding and incorporates actual network traces to capture realistic network variations on the client device. Using our database, we study the effects of multiple streaming dimensions on user experience and evaluate video quality and quality of experience models. We believe that the tools introduced here will help inspire further progress on the development of perceptually-optimized client adaptation and video streaming strategies. The database is publicly available at http://live.ece.utexas.edu/research/LIVE_NFLX_II/live_nflx_plus.html. △ Less

Submitted 12 August, 2018; originally announced August 2018.

arXiv:1804.04813 [pdf, ps, other]

SpatioTemporal Feature Integration and Model Fusion for Full Reference Video Quality Assessment

Authors: Christos G. Bampis, Zhi Li, Alan C. Bovik

Abstract: Perceptual video quality assessment models are either frame-based or video-based, i.e., they apply spatiotemporal filtering or motion estimation to capture temporal video distortions. Despite their good performance on video quality databases, video-based approaches are time-consuming and harder to efficiently deploy. To balance between high performance and computational efficiency, Netflix develop… ▽ More Perceptual video quality assessment models are either frame-based or video-based, i.e., they apply spatiotemporal filtering or motion estimation to capture temporal video distortions. Despite their good performance on video quality databases, video-based approaches are time-consuming and harder to efficiently deploy. To balance between high performance and computational efficiency, Netflix developed the Video Multi-method Assessment Fusion (VMAF) framework, which integrates multiple quality-aware features to predict video quality. Nevertheless, this fusion framework does not fully exploit temporal video quality measurements which are relevant to temporal video distortions. To this end, we propose two improvements to the VMAF framework: SpatioTemporal VMAF and Ensemble VMAF. Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models. To train our models, we designed a large subjective database and evaluated the proposed models against state-of-the-art approaches. The compared algorithms will be made available as part of the open source package in https://github.com/Netflix/vmaf. △ Less

Submitted 13 April, 2018; originally announced April 2018.

arXiv:1801.02016 [pdf, other]

Predicting Encoded Picture Quality in Two Steps is a Better Way

Authors: Xiangxu Yu, Christos G. Bampis, Praful Gupta, Alan C. Bovik

Abstract: Full-reference (FR) image quality assessment (IQA) models assume a high quality "pristine" image as a reference against which to measure perceptual image quality. In many applications, however, the assumption that the reference image is of high quality may be untrue, leading to incorrect perceptual quality predictions. To address this, we propose a new two-step image quality prediction approach wh… ▽ More Full-reference (FR) image quality assessment (IQA) models assume a high quality "pristine" image as a reference against which to measure perceptual image quality. In many applications, however, the assumption that the reference image is of high quality may be untrue, leading to incorrect perceptual quality predictions. To address this, we propose a new two-step image quality prediction approach which integrates both no-reference (NR) and full-reference perceptual quality measurements into the quality prediction process. The no-reference module accounts for the possibly imperfect quality of the source (reference) image, while the full-reference component measures the quality differences between the source image and its possibly further distorted version. A simple, yet very efficient, multiplication step fuses the two sources of information into a reliable objective prediction score. We evaluated our two-step approach on a recently designed subjective image database and achieved standout performance compared to full-reference approaches, especially when the reference images were of low quality. The proposed approach is made publicly available at https://github.com/xiangxuyu/2stepQA △ Less

Submitted 9 February, 2018; v1 submitted 6 January, 2018; originally announced January 2018.

Comments: fix the link in the abstract

Showing 1–14 of 14 results for author: Bampis, C G