-
NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution
Authors:
Marcos V. Conde,
Radu Timofte,
Zihao Lu,
Xiangyu Kong,
Xiaoxia Xing,
Fan Wang,
Suejin Han,
MinKyu Park,
Tianyu Zhang,
Xin Luo,
Yeda Chen,
Dong Liu,
Li Pang,
Yuhang Yang,
Hongzhong Wang,
Xiangyong Cao,
Ruixuan Jiang,
Senyan Xu,
Siyuan Jiang,
Xueyang Fu,
Zheng-Jun Zha,
Tianyu Hao,
Yuhong He,
Ruoqi Li,
Yueqi Yang
, et al. (14 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and…
▽ More
This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration.
△ Less
Submitted 4 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report
Authors:
Marcos V. Conde,
Radu Timofte,
Radu Berdan,
Beril Besbinar,
Daisuke Iso,
Pengzhou Ji,
Xiong Dun,
Zeying Fan,
Chen Wu,
Zhansheng Wang,
Pengbo Zhang,
Jiazi Huang,
Qinglin Liu,
Wei Yu,
Shengping Zhang,
Xiangyang Ji,
Kyungsik Kim,
Minkyung Kim,
Hwalmin Lee,
Hekun Ma,
Huan Zheng,
Yanyan Wei,
Zhao Zhang,
Jing Fang,
Meilin Gao
, et al. (8 additional authors not shown)
Abstract:
Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW…
▽ More
Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
NTIRE 2025 the 2nd Restore Any Image Model (RAIM) in the Wild Challenge
Authors:
Jie Liang,
Radu Timofte,
Qiaosi Yi,
Zhengqiang Zhang,
Shuaizheng Liu,
Lingchen Sun,
Rongyuan Wu,
Xindong Zhang,
Hui Zeng,
Lei Zhang
Abstract:
In this paper, we present a comprehensive overview of the NTIRE 2025 challenge on the 2nd Restore Any Image Model (RAIM) in the Wild. This challenge established a new benchmark for real-world image restoration, featuring diverse scenarios with and without reference ground truth. Participants were tasked with restoring real-captured images suffering from complex and unknown degradations, where both…
▽ More
In this paper, we present a comprehensive overview of the NTIRE 2025 challenge on the 2nd Restore Any Image Model (RAIM) in the Wild. This challenge established a new benchmark for real-world image restoration, featuring diverse scenarios with and without reference ground truth. Participants were tasked with restoring real-captured images suffering from complex and unknown degradations, where both perceptual quality and fidelity were critically evaluated. The challenge comprised two tracks: (1) the low-light joint denoising and demosaicing (JDD) task, and (2) the image detail enhancement/generation task. Each track included two sub-tasks. The first sub-task involved paired data with available ground truth, enabling quantitative evaluation. The second sub-task dealt with real-world yet unpaired images, emphasizing restoration efficiency and subjective quality assessed through a comprehensive user study. In total, the challenge attracted nearly 300 registrations, with 51 teams submitting more than 600 results. The top-performing methods advanced the state of the art in image restoration and received unanimous recognition from all 20+ expert judges. The datasets used in Track 1 and Track 2 are available at https://drive.google.com/drive/folders/1Mgqve-yNcE26IIieI8lMIf-25VvZRs_J and https://drive.google.com/drive/folders/1UB7nnzLwqDZOwDmD9aT8J0KVg2ag4Qae, respectively. The official challenge pages for Track 1 and Track 2 can be found at https://codalab.lisn.upsaclay.fr/competitions/21334#learn_the_details and https://codalab.lisn.upsaclay.fr/competitions/21623#learn_the_details.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Higher fidelity perceptual image and video compression with a latent conditioned residual denoising diffusion model
Authors:
Jonas Brenig,
Radu Timofte
Abstract:
Denoising diffusion models achieved impressive results on several image generation tasks often outperforming GAN based models. Recently, the generative capabilities of diffusion models have been employed for perceptual image compression, such as in CDC. A major drawback of these diffusion-based methods is that, while producing impressive perceptual quality images they are dropping in fidelity/incr…
▽ More
Denoising diffusion models achieved impressive results on several image generation tasks often outperforming GAN based models. Recently, the generative capabilities of diffusion models have been employed for perceptual image compression, such as in CDC. A major drawback of these diffusion-based methods is that, while producing impressive perceptual quality images they are dropping in fidelity/increasing the distortion to the original uncompressed images when compared with other traditional or learned image compression schemes aiming for fidelity. In this paper, we propose a hybrid compression scheme optimized for perceptual quality, extending the approach of the CDC model with a decoder network in order to reduce the impact on distortion metrics such as PSNR. After using the decoder network to generate an initial image, optimized for distortion, the latent conditioned diffusion model refines the reconstruction for perceptual quality by predicting the residual. On standard benchmarks, we achieve up to +2dB PSNR fidelity improvements while maintaining comparable LPIPS and FID perceptual scores when compared with CDC. Additionally, the approach is easily extensible to video compression, where we achieve similar results.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results
Authors:
Sangmin Lee,
Eunpil Park,
Angel Canelo,
Hyunhee Park,
Youngjo Kim,
Hyung-Ju Chun,
Xin Jin,
Chongyi Li,
Chun-Le Guo,
Radu Timofte,
Qi Wu,
Tianheng Qiu,
Yuchun Dong,
Shenglin Ding,
Guanghua Pan,
Weiyu Zhou,
Tao Hu,
Yixu Feng,
Duwei Dai,
Yu Cao,
Peng Wu,
Wei Dong,
Yanning Zhang,
Qingsen Yan,
Simon J. Larsen
, et al. (11 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effect…
▽ More
This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Authors:
Xin Li,
Yeying Jin,
Xin Jin,
Zongwei Wu,
Bingchen Li,
Yufei Wang,
Wenhan Yang,
Yu Li,
Zhibo Chen,
Bihan Wen,
Robby T. Tan,
Radu Timofte,
Qiyu Rong,
Hongyuan Jing,
Mengmeng Zhang,
Jinglong Li,
Xiangyu Lu,
Yi Ren,
Yuting Liu,
Meng Zhang,
Xiang Chen,
Qiyuan Guan,
Jiangxin Dong,
Jinshan Pan,
Conglin Gou
, et al. (112 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ…
▽ More
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
△ Less
Submitted 19 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution
Authors:
Runyi Li,
Bin Chen,
Jian Zhang,
Radu Timofte
Abstract:
Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminar…
▽ More
Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminary experiments, we observe that a linear combination of multiple models outperforms individual models, motivating us to harness the strengths of different models for a more effective trade-off. Based on this insight, we propose a distillation-based approach that leverages the geometric decomposition of both fidelity and realness, alongside the performance advantages of multiple teacher models, to strike a more balanced trade-off. Furthermore, we explore the controllability of this trade-off, enabling a flexible and adjustable super-resolution process, which we call CTSR (Controllable Trade-off Super-Resolution). Experiments conducted on several real-world image super-resolution benchmarks demonstrate that our method surpasses existing state-of-the-art approaches, achieving superior performance across both fidelity and realness metrics.
△ Less
Submitted 19 March, 2025; v1 submitted 18 March, 2025;
originally announced March 2025.
-
AdaptSR: Low-Rank Adaptation for Efficient and Scalable Real-World Super-Resolution
Authors:
Cansu Korkmaz,
Nancy Mehta,
Radu Timofte
Abstract:
Recovering high-frequency details and textures from low-resolution images remains a fundamental challenge in super-resolution (SR), especially when real-world degradations are complex and unknown. While GAN-based methods enhance realism, they suffer from training instability and introduce unnatural artifacts. Diffusion models, though promising, demand excessive computational resources, often requi…
▽ More
Recovering high-frequency details and textures from low-resolution images remains a fundamental challenge in super-resolution (SR), especially when real-world degradations are complex and unknown. While GAN-based methods enhance realism, they suffer from training instability and introduce unnatural artifacts. Diffusion models, though promising, demand excessive computational resources, often requiring multiple GPU days, even for single-step variants. Rather than naively fine-tuning entire models or adopting unstable generative approaches, we introduce AdaptSR, a low-rank adaptation (LoRA) framework that efficiently repurposes bicubic-trained SR models for real-world tasks. AdaptSR leverages architecture-specific insights and selective layer updates to optimize real SR adaptation. By updating only lightweight LoRA layers while keeping the pretrained backbone intact, it captures domain-specific adjustments without adding inference cost, as the adapted layers merge seamlessly post-training. This efficient adaptation not only reduces memory and compute requirements but also makes real-world SR feasible on lightweight hardware. Our experiments demonstrate that AdaptSR outperforms GAN and diffusion-based SR methods by up to 4 dB in PSNR and 2% in perceptual scores on real SR benchmarks. More impressively, it matches or exceeds full model fine-tuning while training 92% fewer parameters, enabling rapid adaptation to real SR tasks within minutes.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
INRetouch: Context Aware Implicit Neural Representation for Photography Retouching
Authors:
Omar Elezabi,
Marcos V. Conde,
Zongwei Wu,
Radu Timofte
Abstract:
Professional photo editing remains challenging, requiring extensive knowledge of imaging pipelines and significant expertise. While recent deep learning approaches, particularly style transfer methods, have attempted to automate this process, they often struggle with output fidelity, editing control, and complex retouching capabilities. We propose a novel retouch transfer approach that learns from…
▽ More
Professional photo editing remains challenging, requiring extensive knowledge of imaging pipelines and significant expertise. While recent deep learning approaches, particularly style transfer methods, have attempted to automate this process, they often struggle with output fidelity, editing control, and complex retouching capabilities. We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs, enabling precise replication of complex editing operations. We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context, and is capable of learning from a single example. Our method extracts implicit transformations from reference edits and adaptively applies them to new images. To facilitate this research direction, we introduce a comprehensive Photo Retouching Dataset comprising 100,000 high-quality images edited using over 170 professional Adobe Lightroom presets. Through extensive evaluation, we demonstrate that our approach not only surpasses existing methods in photo retouching but also enhances performance in related image reconstruction tasks like Gamut Mapping and Raw Reconstruction. By bridging the gap between professional editing capabilities and automated solutions, our work presents a significant step toward making sophisticated photo editing more accessible while maintaining high-fidelity results. Check the Project Page at https://omaralezaby.github.io/inretouch for more Results and information about Code and Dataset availability.
△ Less
Submitted 6 April, 2025; v1 submitted 4 December, 2024;
originally announced December 2024.
-
Enhanced Super-Resolution Training via Mimicked Alignment for Real-World Scenes
Authors:
Omar Elezabi,
Zongwei Wu,
Radu Timofte
Abstract:
Image super-resolution methods have made significant strides with deep learning techniques and ample training data. However, they face challenges due to inherent misalignment between low-resolution (LR) and high-resolution (HR) pairs in real-world datasets. In this study, we propose a novel plug-and-play module designed to mitigate these misalignment issues by aligning LR inputs with HR images dur…
▽ More
Image super-resolution methods have made significant strides with deep learning techniques and ample training data. However, they face challenges due to inherent misalignment between low-resolution (LR) and high-resolution (HR) pairs in real-world datasets. In this study, we propose a novel plug-and-play module designed to mitigate these misalignment issues by aligning LR inputs with HR images during training. Specifically, our approach involves mimicking a novel LR sample that aligns with HR while preserving the degradation characteristics of the original LR samples. This module seamlessly integrates with any SR model, enhancing robustness against misalignment. Importantly, it can be easily removed during inference, therefore without introducing any parameters on the conventional SR models. We comprehensively evaluate our method on synthetic and real-world datasets, demonstrating its effectiveness across a spectrum of SR models, including traditional CNNs and state-of-the-art Transformers. The source codes will be publicly made available at https://github.com/omarAlezaby/Mimicked_Ali .
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
AIM 2024 Challenge on Video Super-Resolution Quality Assessment: Methods and Results
Authors:
Ivan Molodetskikh,
Artem Borisov,
Dmitriy Vatolin,
Radu Timofte,
Jianzhao Liu,
Tianwu Zhi,
Yabin Zhang,
Yang Li,
Jingwen Xu,
Yiting Liao,
Qing Luo,
Ao-Xiang Zhang,
Peng Zhang,
Haibo Lei,
Linyan Jiang,
Yaqing Li,
Yuqin Cao,
Wei Sun,
Weixia Zhang,
Yinan Sun,
Ziheng Jia,
Yuxin Zhu,
Xiongkuo Min,
Guangtao Zhai,
Weihua Luo
, et al. (2 additional authors not shown)
Abstract:
This paper presents the Video Super-Resolution (SR) Quality Assessment (QA) Challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. The task of this challenge was to develop an objective QA method for videos upscaled 2x and 4x by modern image- and video-SR algorithms. QA methods were evaluated by comparing their output with aggregate subjec…
▽ More
This paper presents the Video Super-Resolution (SR) Quality Assessment (QA) Challenge that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. The task of this challenge was to develop an objective QA method for videos upscaled 2x and 4x by modern image- and video-SR algorithms. QA methods were evaluated by comparing their output with aggregate subjective scores collected from >150,000 pairwise votes obtained through crowd-sourced comparisons across 52 SR methods and 1124 upscaled videos. The goal was to advance the state-of-the-art in SR QA, which had proven to be a challenging problem with limited applicability of traditional QA methods. The challenge had 29 registered participants, and 5 teams had submitted their final results, all outperforming the current state-of-the-art. All data, including the private test subset, has been made publicly available on the challenge homepage at https://challenges.videoprocessing.ai/challenges/super-resolution-metrics-challenge.html
△ Less
Submitted 5 October, 2024;
originally announced October 2024.
-
Toward Efficient Deep Blind RAW Image Restoration
Authors:
Marcos V. Conde,
Florin Vasluianu,
Radu Timofte
Abstract:
Multiple low-vision tasks such as denoising, deblurring and super-resolution depart from RGB images and further reduce the degradations, improving the quality. However, modeling the degradations in the sRGB domain is complicated because of the Image Signal Processor (ISP) transformations. Despite of this known issue, very few methods in the literature work directly with sensor RAW images. In this…
▽ More
Multiple low-vision tasks such as denoising, deblurring and super-resolution depart from RGB images and further reduce the degradations, improving the quality. However, modeling the degradations in the sRGB domain is complicated because of the Image Signal Processor (ISP) transformations. Despite of this known issue, very few methods in the literature work directly with sensor RAW images. In this work we tackle image restoration directly in the RAW domain. We design a new realistic degradation pipeline for training deep blind RAW restoration models. Our pipeline considers realistic sensor noise, motion blur, camera shake, and other common degradations. The models trained with our pipeline and data from multiple sensors, can successfully reduce noise and blur, and recover details in RAW images captured from different cameras. To the best of our knowledge, this is the most exhaustive analysis on RAW image restoration. Code available at https://github.com/mv-lab/AISP
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
AIM 2024 Challenge on Efficient Video Super-Resolution for AV1 Compressed Content
Authors:
Marcos V Conde,
Zhijun Lei,
Wen Li,
Christos Bampis,
Ioannis Katsavounidis,
Radu Timofte
Abstract:
Video super-resolution (VSR) is a critical task for enhancing low-bitrate and low-resolution videos, particularly in streaming applications. While numerous solutions have been developed, they often suffer from high computational demands, resulting in low frame rates (FPS) and poor power efficiency, especially on mobile platforms. In this work, we compile different methods to address these challeng…
▽ More
Video super-resolution (VSR) is a critical task for enhancing low-bitrate and low-resolution videos, particularly in streaming applications. While numerous solutions have been developed, they often suffer from high computational demands, resulting in low frame rates (FPS) and poor power efficiency, especially on mobile platforms. In this work, we compile different methods to address these challenges, the solutions are end-to-end real-time video super-resolution frameworks optimized for both high performance and low runtime. We also introduce a new test set of high-quality 4K videos to further validate the approaches. The proposed solutions tackle video up-scaling for two applications: 540p to 4K (x4) as a general case, and 360p to 1080p (x3) more tailored towards mobile devices. In both tracks, the solutions have a reduced number of parameters and operations (MACs), allow high FPS, and improve VMAF and PSNR over interpolation baselines. This report gauges some of the most efficient video super-resolution methods to date.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Streaming Neural Images
Authors:
Marcos V. Conde,
Andy Bigos,
Radu Timofte
Abstract:
Implicit Neural Representations (INRs) are a novel paradigm for signal representation that have attracted considerable interest for image compression. INRs offer unprecedented advantages in signal resolution and memory efficiency, enabling new possibilities for compression techniques. However, the existing limitations of INRs for image compression have not been sufficiently addressed in the litera…
▽ More
Implicit Neural Representations (INRs) are a novel paradigm for signal representation that have attracted considerable interest for image compression. INRs offer unprecedented advantages in signal resolution and memory efficiency, enabling new possibilities for compression techniques. However, the existing limitations of INRs for image compression have not been sufficiently addressed in the literature. In this work, we explore the critical yet overlooked limiting factors of INRs, such as computational cost, unstable performance, and robustness. Through extensive experiments and empirical analysis, we provide a deeper and more nuanced understanding of implicit neural image compression methods such as Fourier Feature Networks and Siren. Our work also offers valuable insights for future research in this area.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Compressed Depth Map Super-Resolution and Restoration: AIM 2024 Challenge Results
Authors:
Marcos V. Conde,
Florin-Alexandru Vasluianu,
Jinhui Xiong,
Wei Ye,
Rakesh Ranjan,
Radu Timofte
Abstract:
The increasing demand for augmented reality (AR) and virtual reality (VR) applications highlights the need for efficient depth information processing. Depth maps, essential for rendering realistic scenes and supporting advanced functionalities, are typically large and challenging to stream efficiently due to their size. This challenge introduces a focus on developing innovative depth upsampling te…
▽ More
The increasing demand for augmented reality (AR) and virtual reality (VR) applications highlights the need for efficient depth information processing. Depth maps, essential for rendering realistic scenes and supporting advanced functionalities, are typically large and challenging to stream efficiently due to their size. This challenge introduces a focus on developing innovative depth upsampling techniques to reconstruct high-quality depth maps from compressed data. These techniques are crucial for overcoming the limitations posed by depth compression, which often degrades quality, loses scene details and introduces artifacts. By enhancing depth upsampling methods, this challenge aims to improve the efficiency and quality of depth map reconstruction. Our goal is to advance the state-of-the-art in depth processing technologies, thereby enhancing the overall user experience in AR and VR applications.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results
Authors:
Maksim Smirnov,
Aleksandr Gushchin,
Anastasia Antsiferova,
Dmitry Vatolin,
Radu Timofte,
Ziheng Jia,
Zicheng Zhang,
Wei Sun,
Jiaying Qian,
Yuqin Cao,
Yinan Sun,
Yuxin Zhu,
Xiongkuo Min,
Guangtao Zhai,
Kanjar De,
Qing Luo,
Ao-Xiang Zhang,
Peng Zhang,
Haibo Lei,
Linyan Jiang,
Yaqing Li,
Wenhui Meng,
Zhenzhong Chen,
Zhengxue Cheng,
Jiahao Xiao
, et al. (7 additional authors not shown)
Abstract:
Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dat…
▽ More
Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dataset of 459 videos, encoded with 14 codecs of various compression standards (AVC/H.264, HEVC/H.265, AV1, and VVC/H.266) and containing a comprehensive collection of compression artifacts. To measure the methods performance, we employed traditional correlation coefficients between their predictions and subjective scores, which were collected via large-scale crowdsourced pairwise human comparisons. For training purposes, participants were provided with the Compressed Video Quality Assessment Dataset (CVQAD), a previously developed dataset of 1022 videos. Up to 30 participating teams registered for the challenge, while we report the results of 6 teams, which submitted valid final solutions and code for reproducing the results. Moreover, we calculated and present the performance of state-of-the-art VQA methods on the developed dataset, providing a comprehensive benchmark for future research. The dataset, results, and online leaderboard are publicly available at https://challenges.videoprocessing.ai/challenges/compressedvideo-quality-assessment.html.
△ Less
Submitted 22 October, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
Authors:
Maxime Burchi,
Krishna C. Puvvada,
Jagadeesh Balam,
Boris Ginsburg,
Radu Timofte
Abstract:
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt…
▽ More
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute average-WER reduction of 11.9% in comparison to the original baseline. Finally, we demonstrate the ability of the proposed model to perform audio-only, visual-only, and audio-visual speech recognition at test time.
△ Less
Submitted 13 March, 2024;
originally announced May 2024.
-
NTIRE 2024 Restore Any Image Model (RAIM) in the Wild Challenge
Authors:
Jie Liang,
Radu Timofte,
Qiaosi Yi,
Shuaizheng Liu,
Lingchen Sun,
Rongyuan Wu,
Xindong Zhang,
Hui Zeng,
Lei Zhang
Abstract:
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where gener…
▽ More
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at https://drive.google.com/file/d/1DqbxUoiUqkAIkExu3jZAqoElr_nu1IXb/view?usp=sharing and the homepage of this challenge is at https://codalab.lisn.upsaclay.fr/competitions/17632.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
Authors:
Marcos V. Conde,
Zhijun Lei,
Wen Li,
Cosmin Stejerean,
Ioannis Katsavounidis,
Radu Timofte,
Kihwan Yoon,
Ganzorig Gankhuyag,
Jiangtao Lv,
Long Sun,
Jinshan Pan,
Jiangxin Dong,
Jinhui Tang,
Zhiyuan Li,
Hao Wei,
Chenyang Ge,
Dongyang Zhang,
Tianle Liu,
Huaian Chen,
Yi Jin,
Menghan Zhou,
Yiqiang Yan,
Si Gao,
Biao Wu,
Shaoli Liu
, et al. (50 additional authors not shown)
Abstract:
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod…
▽ More
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Deep RAW Image Super-Resolution. A NTIRE 2024 Challenge Survey
Authors:
Marcos V. Conde,
Florin-Alexandru Vasluianu,
Radu Timofte,
Jianxing Zhang,
Jia Li,
Fan Wang,
Xiaopeng Li,
Zikun Liu,
Hyunhee Park,
Sejun Song,
Changho Kim,
Zhijuan Huang,
Hongyuan Yu,
Cheng Wan,
Wending Xiang,
Jiamin Lin,
Hang Zhong,
Qiaosong Zhang,
Yue Sun,
Xuanwu Yin,
Kunlong Zuo,
Senyan Xu,
Siyuan Jiang,
Zhijing Sun,
Jiaying Zhu
, et al. (10 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as nois…
▽ More
This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. The performance of the top-5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in RAW Image Super-Resolution.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Simple Image Signal Processing using Global Context Guidance
Authors:
Omar Elezabi,
Marcos V. Conde,
Radu Timofte
Abstract:
In modern smartphone cameras, the Image Signal Processor (ISP) is the core element that converts the RAW readings from the sensor into perceptually pleasant RGB images for the end users. The ISP is typically proprietary and handcrafted and consists of several blocks such as white balance, color correction, and tone mapping. Deep learning-based ISPs aim to transform RAW images into DSLR-like RGB im…
▽ More
In modern smartphone cameras, the Image Signal Processor (ISP) is the core element that converts the RAW readings from the sensor into perceptually pleasant RGB images for the end users. The ISP is typically proprietary and handcrafted and consists of several blocks such as white balance, color correction, and tone mapping. Deep learning-based ISPs aim to transform RAW images into DSLR-like RGB images using deep neural networks. However, most learned ISPs are trained using patches (small regions) due to computational limitations. Such methods lack global context, which limits their efficacy on full-resolution images and harms their ability to capture global properties such as color constancy or illumination. First, we propose a novel module that can be integrated into any neural ISP to capture the global context information from the full RAW images. Second, we propose an efficient and simple neural ISP that utilizes our proposed module. Our model achieves state-of-the-art results on different benchmarks using diverse and real smartphone images.
△ Less
Submitted 25 September, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Yajing Pei,
Yiting Lu,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Wei Sun,
Haoning Wu,
Zicheng Zhang,
Jun Jia,
Zhichao Zhang,
Linhan Cao,
Qiubo Chen,
Xiongkuo Min,
Weisi Lin,
Guangtao Zhai,
Jianhui Sun,
Tianyi Wang,
Lei Li,
Han Kong,
Wenxuan Wang,
Bing Li,
Cheng Luo
, et al. (43 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The…
▽ More
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs
Authors:
Georgy Perevozchikov,
Nancy Mehta,
Mahmoud Afifi,
Radu Timofte
Abstract:
Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuni…
▽ More
Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuning for each new camera model, as is often the case for nearly every module in traditional ISPs. However, the key challenge with the recent learning-based ISPs is the urge to collect large paired datasets for each distinct camera model due to the influence of intrinsic camera characteristics on the formation of input raw images. This paper tackles this challenge by introducing a novel method for unpaired learning of raw-to-raw translation across diverse cameras. Specifically, we propose Rawformer, an unsupervised Transformer-based encoder-decoder method for raw-to-raw translation. It accurately maps raw images captured by a certain camera to the target camera, facilitating the generalization of learnable ISPs to new unseen cameras. Our method demonstrates superior performance on real camera datasets, achieving higher accuracy compared to previous state-of-the-art techniques, and preserving a more robust correlation between the original and translated raw images. The codes and the pretrained models are available at https://github.com/gosha20777/rawformer.
△ Less
Submitted 15 July, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Yawei Li,
Nancy Mehta,
Radu Timofte,
Hongyuan Yu,
Cheng Wan,
Yuxin Hong,
Bingnan Han,
Zhuoyuan Wu,
Yajun Zou,
Yuqing Liu,
Jizhe Li,
Keji He,
Chao Fan,
Heng Zhang,
Xiaolin Zhang,
Xuanwu Yin,
Kunlong Zuo,
Bohao Liao,
Peizhe Xia,
Long Peng,
Zhibo Du,
Xin Di,
Wangkai Li,
Yang Wang
, et al. (109 additional authors not shown)
Abstract:
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such…
▽ More
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
△ Less
Submitted 25 June, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
See More Details: Efficient Image Super-Resolution by Experts Mining
Authors:
Eduard Zamfir,
Zongwei Wu,
Nancy Mehta,
Yulun Zhang,
Radu Timofte
Abstract:
Reconstructing high-resolution (HR) images from low-resolution (LR) inputs poses a significant challenge in image super-resolution (SR). While recent approaches have demonstrated the efficacy of intricate operations customized for various objectives, the straightforward stacking of these disparate operations can result in a substantial computational burden, hampering their practical utility. In re…
▽ More
Reconstructing high-resolution (HR) images from low-resolution (LR) inputs poses a significant challenge in image super-resolution (SR). While recent approaches have demonstrated the efficacy of intricate operations customized for various objectives, the straightforward stacking of these disparate operations can result in a substantial computational burden, hampering their practical utility. In response, we introduce SeemoRe, an efficient SR model employing expert mining. Our approach strategically incorporates experts at different levels, adopting a collaborative methodology. At the macro scale, our experts address rank-wise and spatial-wise informative features, providing a holistic understanding. Subsequently, the model delves into the subtleties of rank choice by leveraging a mixture of low-rank experts. By tapping into experts specialized in distinct key factors crucial for accurate SR, our model excels in uncovering intricate intra-feature details. This collaborative approach is reminiscent of the concept of "see more", allowing our model to achieve an optimal performance with minimal computational costs in efficient settings. The source will be publicly made available at https://github.com/eduardzamfir/seemoredetails
△ Less
Submitted 6 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
InstructIR: High-Quality Image Restoration Following Human Instructions
Authors:
Marcos V. Conde,
Gregor Geigle,
Radu Timofte
Abstract:
Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions…
▽ More
Image restoration is a fundamental problem that involves recovering a high-quality clean image from its degraded observation. All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model. In this work, we present the first approach that uses human-written instructions to guide the image restoration model. Given natural language prompts, our model can recover high-quality images from their degraded counterparts, considering multiple degradation types. Our method, InstructIR, achieves state-of-the-art results on several restoration tasks including image denoising, deraining, deblurring, dehazing, and (low-light) image enhancement. InstructIR improves +1dB over previous all-in-one restoration methods. Moreover, our dataset and results represent a novel benchmark for new research on text-guided image restoration and enhancement. Our code, datasets and models are available at: https://github.com/mv-lab/InstructIR
△ Less
Submitted 25 September, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
BSRAW: Improving Blind RAW Image Super-Resolution
Authors:
Marcos V. Conde,
Florin Vasluianu,
Radu Timofte
Abstract:
In smartphones and compact cameras, the Image Signal Processor (ISP) transforms the RAW sensor image into a human-readable sRGB image. Most popular super-resolution methods depart from a sRGB image and upscale it further, improving its quality. However, modeling the degradations in the sRGB domain is complicated because of the non-linear ISP transformations. Despite this known issue, only a few me…
▽ More
In smartphones and compact cameras, the Image Signal Processor (ISP) transforms the RAW sensor image into a human-readable sRGB image. Most popular super-resolution methods depart from a sRGB image and upscale it further, improving its quality. However, modeling the degradations in the sRGB domain is complicated because of the non-linear ISP transformations. Despite this known issue, only a few methods work directly with RAW images and tackle real-world sensor degradations. We tackle blind image super-resolution in the RAW domain. We design a realistic degradation pipeline tailored specifically for training models with raw sensor data. Our approach considers sensor noise, defocus, exposure, and other common issues. Our BSRAW models trained with our pipeline can upscale real-scene RAW images and improve their quality. As part of this effort, we also present a new DSLM dataset and benchmark for this task.
△ Less
Submitted 24 December, 2023;
originally announced December 2023.
-
MoVideo: Motion-Aware Video Generation with Diffusion Models
Authors:
Jingyun Liang,
Yuchen Fan,
Kai Zhang,
Radu Timofte,
Luc Van Gool,
Rakesh Ranjan
Abstract:
While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspe…
▽ More
While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
△ Less
Submitted 29 July, 2024; v1 submitted 19 November, 2023;
originally announced November 2023.
-
A Study of Forward-Forward Algorithm for Self-Supervised Learning
Authors:
Jonas Brenig,
Radu Timofte
Abstract:
Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a se…
▽ More
Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation.
In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw.
Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.
△ Less
Submitted 8 December, 2023; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Neural Gradient Regularizer
Authors:
Shuang Xu,
Yifan Wang,
Zixiang Zhao,
Jiangjun Peng,
Xiangyong Cao,
Deyu Meng,
Yulun Zhang,
Radu Timofte,
Luc Van Gool
Abstract:
Owing to its significant success, the prior imposed on gradient maps has consistently been a subject of great interest in the field of image processing. Total variation (TV), one of the most representative regularizers, is known for its ability to capture the intrinsic sparsity prior underlying gradient maps. Nonetheless, TV and its variants often underestimate the gradient maps, leading to the we…
▽ More
Owing to its significant success, the prior imposed on gradient maps has consistently been a subject of great interest in the field of image processing. Total variation (TV), one of the most representative regularizers, is known for its ability to capture the intrinsic sparsity prior underlying gradient maps. Nonetheless, TV and its variants often underestimate the gradient maps, leading to the weakening of edges and details whose gradients should not be zero in the original image (i.e., image structures is not describable by sparse priors of gradient maps). Recently, total deep variation (TDV) has been introduced, assuming the sparsity of feature maps, which provides a flexible regularization learned from large-scale datasets for a specific task. However, TDV requires to retrain the network with image/task variations, limiting its versatility. To alleviate this issue, in this paper, we propose a neural gradient regularizer (NGR) that expresses the gradient map as the output of a neural network. Unlike existing methods, NGR does not rely on any subjective sparsity or other prior assumptions on image gradient maps, thereby avoiding the underestimation of gradient maps. NGR is applicable to various image types and different image processing tasks, functioning in a zero-shot learning fashion, making it a versatile and plug-and-play regularizer. Extensive experimental results demonstrate the superior performance of NGR over state-of-the-art counterparts for a range of different tasks, further validating its effectiveness and versatility.
△ Less
Submitted 13 September, 2023; v1 submitted 31 August, 2023;
originally announced August 2023.
-
NTIRE 2023 Quality Assessment of Video Enhancement Challenge
Authors:
Xiaohong Liu,
Xiongkuo Min,
Wei Sun,
Yulun Zhang,
Kai Zhang,
Radu Timofte,
Guangtao Zhai,
Yixuan Gao,
Yuqin Cao,
Tengchuan Kou,
Yunlong Dong,
Ziheng Jia,
Yilin Li,
Wei Wu,
Shuming Hu,
Sibin Deng,
Pengxiang Xiao,
Ying Chen,
Kai Li,
Kai Zhao,
Kun Yuan,
Ming Sun,
Heng Cong,
Hao Wang,
Lingzhi Fu
, et al. (47 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual…
▽ More
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Denoising Diffusion Models for Plug-and-Play Image Restoration
Authors:
Yuanzhi Zhu,
Kai Zhang,
Jingyun Liang,
Jiezhang Cao,
Bihan Wen,
Radu Timofte,
Luc Van Gool
Abstract:
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to ser…
▽ More
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to serve as a generative denoiser prior to the plug-and-play IR methods remains to be further explored. While several other attempts have been made to adopt diffusion models for image restoration, they either fail to achieve satisfactory results or typically require an unacceptable number of Neural Function Evaluations (NFEs) during inference. This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework. Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models. Experimental results on three representative IR tasks, including super-resolution, image deblurring, and inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and perceptual quality with no more than 100 NFEs. The source code is available at {\url{https://github.com/yuanzhi-zhu/DiffPIR}}
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Audio-Visual Efficient Conformer for Robust Speech Recognition
Authors:
Maxime Burchi,
Radu Timofte
Abstract:
End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a…
▽ More
End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Knowledge Distillation based Degradation Estimation for Blind Super-Resolution
Authors:
Bin Xia,
Yulun Zhang,
Yitong Wang,
Yapeng Tian,
Wenming Yang,
Radu Timofte,
Luc Van Gool
Abstract:
Blind image super-resolution (Blind-SR) aims to recover a high-resolution (HR) image from its corresponding low-resolution (LR) input image with unknown degradations. Most of the existing works design an explicit degradation estimator for each degradation to guide SR. However, it is infeasible to provide concrete labels of multiple degradation combinations (e.g., blur, noise, jpeg compression) to…
▽ More
Blind image super-resolution (Blind-SR) aims to recover a high-resolution (HR) image from its corresponding low-resolution (LR) input image with unknown degradations. Most of the existing works design an explicit degradation estimator for each degradation to guide SR. However, it is infeasible to provide concrete labels of multiple degradation combinations (e.g., blur, noise, jpeg compression) to supervise the degradation estimator training. In addition, these special designs for certain degradation, such as blur, impedes the models from being generalized to handle different degradations. To this end, it is necessary to design an implicit degradation estimator that can extract discriminative degradation representation for all degradations without relying on the supervision of degradation ground-truth. In this paper, we propose a Knowledge Distillation based Blind-SR network (KDSR). It consists of a knowledge distillation based implicit degradation estimator network (KD-IDE) and an efficient SR network. To learn the KDSR model, we first train a teacher network: KD-IDE$_{T}$. It takes paired HR and LR patches as inputs and is optimized with the SR network jointly. Then, we further train a student network KD-IDE$_{S}$, which only takes LR images as input and learns to extract the same implicit degradation representation (IDR) as KD-IDE$_{T}$. In addition, to fully use extracted IDR, we design a simple, strong, and efficient IDR based dynamic convolution residual block (IDR-DCRB) to build an SR network. We conduct extensive experiments under classic and real-world degradation settings. The results show that KDSR achieves SOTA performance and can generalize to various degradation processes. The source codes and pre-trained models will be released.
△ Less
Submitted 16 February, 2023; v1 submitted 30 November, 2022;
originally announced November 2022.
-
Real-Time Under-Display Cameras Image Restoration and HDR on Mobile Devices
Authors:
Marcos V. Conde,
Florin Vasluianu,
Sabari Nathan,
Radu Timofte
Abstract:
The new trend of full-screen devices implies positioning the camera behind the screen to bring a larger display-to-body ratio, enhance eye contact, and provide a notch-free viewing experience on smartphones, TV or tablets. On the other hand, the images captured by under-display cameras (UDCs) are degraded by the screen in front of them. Deep learning methods for image restoration can significantly…
▽ More
The new trend of full-screen devices implies positioning the camera behind the screen to bring a larger display-to-body ratio, enhance eye contact, and provide a notch-free viewing experience on smartphones, TV or tablets. On the other hand, the images captured by under-display cameras (UDCs) are degraded by the screen in front of them. Deep learning methods for image restoration can significantly reduce the degradation of captured images, providing satisfying results for the human eyes. However, most proposed solutions are unreliable or efficient enough to be used in real-time on mobile devices.
In this paper, we aim to solve this image restoration problem using efficient deep learning methods capable of processing FHD images in real-time on commercial smartphones while providing high-quality results. We propose a lightweight model for blind UDC Image Restoration and HDR, and we also provide a benchmark comparing the performance and runtime of different methods on smartphones. Our models are competitive on UDC benchmarks while using x4 less operations than others. To the best of our knowledge, we are the first work to approach and analyze this real-world single image restoration problem from the efficiency and production point of view.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
Advancing Learned Video Compression with In-loop Frame Prediction
Authors:
Ren Yang,
Radu Timofte,
Luc Van Gool
Abstract:
Recent years have witnessed an increasing interest in end-to-end learned video compression. Most previous works explore temporal redundancy by detecting and compressing a motion map to warp the reference frame towards the target frame. Yet, it failed to adequately take advantage of the historical priors in the sequential reference frames. In this paper, we propose an Advanced Learned Video Compres…
▽ More
Recent years have witnessed an increasing interest in end-to-end learned video compression. Most previous works explore temporal redundancy by detecting and compressing a motion map to warp the reference frame towards the target frame. Yet, it failed to adequately take advantage of the historical priors in the sequential reference frames. In this paper, we propose an Advanced Learned Video Compression (ALVC) approach with the in-loop frame prediction module, which is able to effectively predict the target frame from the previously compressed frames, without consuming any bit-rate. The predicted frame can serve as a better reference than the previously compressed frame, and therefore it benefits the compression performance. The proposed in-loop prediction module is a part of the end-to-end video compression and is jointly optimized in the whole framework. We propose the recurrent and the bi-directional in-loop prediction modules for compressing P-frames and B-frames, respectively. The experiments show the state-of-the-art performance of our ALVC approach in learned video compression. We also outperform the default hierarchical B mode of x265 in terms of PSNR and beat the slowest mode of the SSIM-tuned x265 on MS-SSIM. The project page: https://github.com/RenYang-home/ALVC.
△ Less
Submitted 18 November, 2022; v1 submitted 13 November, 2022;
originally announced November 2022.
-
MicroISP: Processing 32MP Photos on Mobile Devices with Deep Learning
Authors:
Andrey Ignatov,
Anastasia Sycheva,
Radu Timofte,
Yu Tseng,
Yu-Syuan Xu,
Po-Hsiang Yu,
Cheng-Ming Chiang,
Hsien-Kai Kuo,
Min-Hung Chen,
Chia-Ming Cheng,
Luc Van Gool
Abstract:
While neural networks-based photo processing solutions can provide a better image quality compared to the traditional ISP systems, their application to mobile devices is still very limited due to their very high computational complexity. In this paper, we present a novel MicroISP model designed specifically for edge devices, taking into account their computational and memory limitations. The propo…
▽ More
While neural networks-based photo processing solutions can provide a better image quality compared to the traditional ISP systems, their application to mobile devices is still very limited due to their very high computational complexity. In this paper, we present a novel MicroISP model designed specifically for edge devices, taking into account their computational and memory limitations. The proposed solution is capable of processing up to 32MP photos on recent smartphones using the standard mobile ML libraries and requiring less than 1 second to perform the inference, while for FullHD images it achieves real-time performance. The architecture of the model is flexible, allowing to adjust its complexity to devices of different computational power. To evaluate the performance of the model, we collected a novel Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The experiments demonstrated that, despite its compact size, the MicroISP model is able to provide comparable or better visual results than the traditional mobile ISP systems, while outperforming the previously proposed efficient deep learning based solutions. Finally, this model is also compatible with the latest mobile AI accelerators, achieving good runtime and low power consumption on smartphone NPUs and APUs. The code, dataset and pre-trained models are available on the project website: https://people.ee.ethz.ch/~ihnatova/microisp.html
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Realistic Bokeh Effect Rendering on Mobile GPUs, Mobile AI & AIM 2022 challenge: Report
Authors:
Andrey Ignatov,
Radu Timofte,
Jin Zhang,
Feng Zhang,
Gaocheng Yu,
Zhe Ma,
Hongbin Wang,
Minsu Kwon,
Haotian Qian,
Wentao Tong,
Pan Mu,
Ziping Wang,
Guangjing Yan,
Brian Lee,
Lei Fei,
Huaijin Chen,
Hyebin Cho,
Byeongjun Kwon,
Munchurl Kim,
Mingyang Qian,
Huixin Ma,
Yanan Li,
Xiaotao Wang,
Lei Lei
Abstract:
As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB!…
▽ More
As mobile cameras with compact optics are unable to produce a strong bokeh effect, lots of interest is now devoted to deep learning-based solutions for this task. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based bokeh effect rendering approach that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The runtime of the resulting models was evaluated on the Kirin 9000's Mali GPU that provides excellent acceleration results for the majority of common deep learning ops. A detailed description of all models developed in this challenge is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks
Authors:
Andrey Ignatov,
Grigory Malivenko,
Radu Timofte,
Yu Tseng,
Yu-Syuan Xu,
Po-Hsiang Yu,
Cheng-Ming Chiang,
Hsien-Kai Kuo,
Min-Hung Chen,
Chia-Ming Cheng,
Luc Van Gool
Abstract:
The increased importance of mobile photography created a need for fast and performant RAW image processing pipelines capable of producing good visual results in spite of the mobile camera sensor limitations. While deep learning-based approaches can efficiently solve this problem, their computational requirements usually remain too large for high-resolution on-device image processing. To address th…
▽ More
The increased importance of mobile photography created a need for fast and performant RAW image processing pipelines capable of producing good visual results in spite of the mobile camera sensor limitations. While deep learning-based approaches can efficiently solve this problem, their computational requirements usually remain too large for high-resolution on-device image processing. To address this limitation, we propose a novel PyNET-V2 Mobile CNN architecture designed specifically for edge devices, being able to process RAW 12MP photos directly on mobile phones under 1.5 second and producing high perceptual photo quality. To train and to evaluate the performance of the proposed solution, we use the real-world Fujifilm UltraISP dataset consisting on thousands of RAW-RGB image pairs captured with a professional medium-format 102MP Fujifilm camera and a popular Sony mobile camera sensor. The results demonstrate that the PyNET-V2 Mobile model can substantially surpass the quality of tradition ISP pipelines, while outperforming the previously introduced neural network-based solutions designed for fast image processing. Furthermore, we show that the proposed architecture is also compatible with the latest mobile AI accelerators such as NPUs or APUs that can be used to further reduce the latency of the model to as little as 0.5 second. The dataset, code and pre-trained models used in this paper are available on the project website: https://github.com/gmalivenko/PyNET-v2
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report
Authors:
Andrey Ignatov,
Radu Timofte,
Maurizio Denna,
Abdel Younes,
Ganzorig Gankhuyag,
Jingang Huh,
Myeong Kyun Kim,
Kihwan Yoon,
Hyeon-Cheol Moon,
Seungho Lee,
Yoonsik Choe,
Jinwoo Jeong,
Sungjei Kim,
Maciej Smyl,
Tomasz Latkowski,
Pawel Kubik,
Michal Sokolski,
Yujie Ma,
Jiahao Chao,
Zhou Zhou,
Hongfan Gao,
Zhengfeng Yang,
Zhenbing Zeng,
Zhengyang Zhuge,
Chenghua Li
, et al. (71 additional authors not shown)
Abstract:
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose…
▽ More
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report
Authors:
Andrey Ignatov,
Radu Timofte,
Cheng-Ming Chiang,
Hsien-Kai Kuo,
Yu-Syuan Xu,
Man-Yu Lee,
Allen Lu,
Chia-Ming Cheng,
Chih-Cheng Chen,
Jia-Ying Yong,
Hong-Han Shuai,
Wen-Huang Cheng,
Zhuang Jia,
Tianyu Xu,
Yijian Zhang,
Long Bao,
Heng Sun,
Diankai Zhang,
Si Gao,
Shaoli Liu,
Biao Wu,
Xiaofeng Zhang,
Chengjian Zheng,
Kaidi Lu,
Ning Wang
, et al. (29 additional authors not shown)
Abstract:
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this prob…
▽ More
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report
Authors:
Andrey Ignatov,
Grigory Malivenko,
Radu Timofte,
Lukasz Treszczotko,
Xin Chang,
Piotr Ksiazek,
Michal Lopuszynski,
Maciej Pioro,
Rafal Rudnicki,
Maciej Smyl,
Yujie Ma,
Zhenyu Li,
Zehui Chen,
Jialei Xu,
Xianming Liu,
Junjun Jiang,
XueChao Shi,
Difan Xu,
Yanan Li,
Xiaotao Wang,
Lei Lei,
Ziyu Zhang,
Yicheng Wang,
Zilong Huang,
Guozhong Luo
, et al. (14 additional authors not shown)
Abstract:
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth es…
▽ More
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report
Authors:
Andrey Ignatov,
Radu Timofte,
Shuai Liu,
Chaoyu Feng,
Furui Bai,
Xiaotao Wang,
Lei Lei,
Ziyao Yi,
Yan Xiang,
Zibin Liu,
Shaoqing Li,
Keming Shi,
Dehui Kong,
Ke Xu,
Minsu Kwon,
Yaqi Wu,
Jiesi Zheng,
Zhihao Fan,
Xun Wu,
Feng Zhang,
Albert No,
Minhyeok Cho,
Zewen Chen,
Xiaze Zhang,
Ran Li
, et al. (13 additional authors not shown)
Abstract:
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. Th…
▽ More
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Perceptual Image Enhancement for Smartphone Real-Time Applications
Authors:
Marcos V. Conde,
Florin Vasluianu,
Javier Vazquez-Corral,
Radu Timofte
Abstract:
Recent advances in camera designs and imaging pipelines allow us to capture high-quality images using smartphones. However, due to the small size and lens limitations of the smartphone cameras, we commonly find artifacts or degradation in the processed images. The most common unpleasant effects are noise artifacts, diffraction artifacts, blur, and HDR overexposure. Deep learning methods for image…
▽ More
Recent advances in camera designs and imaging pipelines allow us to capture high-quality images using smartphones. However, due to the small size and lens limitations of the smartphone cameras, we commonly find artifacts or degradation in the processed images. The most common unpleasant effects are noise artifacts, diffraction artifacts, blur, and HDR overexposure. Deep learning methods for image restoration can successfully remove these artifacts. However, most approaches are not suitable for real-time applications on mobile devices due to their heavy computation and memory requirements. In this paper, we propose LPIENet, a lightweight network for perceptual image enhancement, with the focus on deploying it on smartphones. Our experiments show that, with much fewer parameters and operations, our model can deal with the mentioned artifacts and achieve competitive performance compared with state-of-the-art methods on standard benchmarks. Moreover, to prove the efficiency and reliability of our approach, we deployed the model directly on commercial smartphones and evaluated its performance. Our model can process 2K resolution images under 1 second in mid-level commercial smartphones.
△ Less
Submitted 22 November, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report
Authors:
Marcos V. Conde,
Radu Timofte,
Yibin Huang,
Jingyang Peng,
Chang Chen,
Cheng Li,
Eduardo Pérez-Pellitero,
Fenglong Song,
Furui Bai,
Shuai Liu,
Chaoyu Feng,
Xiaotao Wang,
Lei Lei,
Yu Zhu,
Chenghua Li,
Yingying Jiang,
Yong A,
Peisong Wang,
Cong Leng,
Jian Cheng,
Xiaoyu Liu,
Zhicun Yin,
Zhilu Zhang,
Junyi Li,
Ming Liu
, et al. (18 additional authors not shown)
Abstract:
Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image data…
▽ More
Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public RGB datasets.
This paper introduces the AIM 2022 Challenge on Reversed Image Signal Processing and RAW Reconstruction. We aim to recover raw sensor images from the corresponding RGBs without metadata and, by doing this, "reverse" the ISP transformation. The proposed methods and benchmark establish the state-of-the-art for this low-level vision inverse problem, and generating realistic raw sensor readings can potentially benefit other tasks such as denoising and super-resolution.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
AIM 2022 Challenge on Instagram Filter Removal: Methods and Results
Authors:
Furkan Kınlı,
Sami Menteş,
Barış Özcan,
Furkan Kıraç,
Radu Timofte,
Yi Zuo,
Zitao Wang,
Xiaowen Zhang,
Yu Zhu,
Chenghua Li,
Cong Leng,
Jian Cheng,
Shuai Liu,
Chaoyu Feng,
Furui Bai,
Xiaotao Wang,
Lei Lei,
Tianzhi Ma,
Zihan Gao,
Wenxin He,
Woon-Ha Yeo,
Wang-Taek Oh,
Young-Il Kim,
Han-Cheol Ryu,
Gang He
, et al. (8 additional authors not shown)
Abstract:
This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realis…
▽ More
This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realistic and visually plausible images where the impact of the filters applied is mitigated while preserving the content. The proposed solutions are ranked in terms of the PSNR value with respect to the original images. There are two prior studies on this task as the baseline, and a total of 9 teams have competed in the final phase of the challenge. The comparison of qualitative results of the proposed solutions and the benchmark for the challenge are presented in this report.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Basic Binary Convolution Unit for Binarized Image Restoration Network
Authors:
Bin Xia,
Yulun Zhang,
Yitong Wang,
Yapeng Tian,
Wenming Yang,
Radu Timofte,
Luc Van Gool
Abstract:
Lighter and faster image restoration (IR) models are crucial for the deployment on resource-limited devices. Binary neural network (BNN), one of the most promising model compression methods, can dramatically reduce the computations and parameters of full-precision convolutional neural networks (CNN). However, there are different properties between BNN and full-precision CNN, and we can hardly use…
▽ More
Lighter and faster image restoration (IR) models are crucial for the deployment on resource-limited devices. Binary neural network (BNN), one of the most promising model compression methods, can dramatically reduce the computations and parameters of full-precision convolutional neural networks (CNN). However, there are different properties between BNN and full-precision CNN, and we can hardly use the experience of designing CNN to develop BNN. In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for IR tasks. We conduct systematic analyses to explain each component's role in binary convolution and discuss the pitfalls. Specifically, we find that residual connection can reduce the information loss caused by binarization; BatchNorm can solve the value range gap between residual connection and binary convolution; The position of the activation function dramatically affects the performance of BNN. Based on our findings and analyses, we design a simple yet efficient basic binary convolution unit (BBCU). Furthermore, we divide IR networks into four parts and specially design variants of BBCU for each part to explore the benefit of binarizing these parts. We conduct experiments on different IR tasks, and our BBCU significantly outperforms other BNNs and lightweight models, which shows that BBCU can serve as a basic unit for binarized IR networks. All codes and models will be released.
△ Less
Submitted 16 February, 2023; v1 submitted 1 October, 2022;
originally announced October 2022.
-
Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration
Authors:
Marcos V. Conde,
Ui-Jin Choi,
Maxime Burchi,
Radu Timofte
Abstract:
Compression plays an important role on the efficient transmission and storage of images and videos through band-limited systems such as streaming services, virtual reality or videogames. However, compression unavoidably leads to artifacts and the loss of the original information, which may severely degrade the visual quality. For these reasons, quality enhancement of compressed images has become a…
▽ More
Compression plays an important role on the efficient transmission and storage of images and videos through band-limited systems such as streaming services, virtual reality or videogames. However, compression unavoidably leads to artifacts and the loss of the original information, which may severely degrade the visual quality. For these reasons, quality enhancement of compressed images has become a popular research topic. While most state-of-the-art image restoration methods are based on convolutional neural networks, other transformers-based methods such as SwinIR, show impressive performance on these tasks.
In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution, and in particular, the compressed input scenario. Using this method we can tackle the major issues in training transformer vision models, such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data. We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution. Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR, and is a top-5 solution at the "AIM 2022 Challenge on Super-Resolution of Compressed Image and Video".
△ Less
Submitted 22 September, 2022;
originally announced September 2022.