Search | arXiv e-print repository

arXiv:2005.01996 [pdf, other]

NTIRE 2020 Challenge on Real-World Image Super-Resolution: Methods and Results

Authors: Andreas Lugmayr, Martin Danelljan, Radu Timofte, Namhyuk Ahn, Dongwoon Bai, Jie Cai, Yun Cao, Junyang Chen, Kaihua Cheng, SeYoung Chun, Wei Deng, Mostafa El-Khamy, Chiu Man Ho, Xiaozhong Ji, Amin Kheradmand, Gwantae Kim, Hanseok Ko, Kanghyu Lee, Jungwon Lee, Hao Li, Ziluan Liu, Zhi-Song Liu, Shuai Liu, Yunhua Lu, Zibo Meng , et al. (21 additional authors not shown)

Abstract: This paper reviews the NTIRE 2020 challenge on real world super-resolution. It focuses on the participating methods and final results. The challenge addresses the real world setting, where paired true high and low-resolution images are unavailable. For training, only one set of source input images is therefore provided along with a set of unpaired high-quality target images. In Track 1: Image Proc… ▽ More This paper reviews the NTIRE 2020 challenge on real world super-resolution. It focuses on the participating methods and final results. The challenge addresses the real world setting, where paired true high and low-resolution images are unavailable. For training, only one set of source input images is therefore provided along with a set of unpaired high-quality target images. In Track 1: Image Processing artifacts, the aim is to super-resolve images with synthetically generated image processing artifacts. This allows for quantitative benchmarking of the approaches \wrt a ground-truth image. In Track 2: Smartphone Images, real low-quality smart phone images have to be super-resolved. In both tracks, the ultimate goal is to achieve the best perceptual quality, evaluated using a human study. This is the second challenge on the subject, following AIM 2019, targeting to advance the state-of-the-art in super-resolution. To measure the performance we use the benchmark protocol from AIM 2019. In total 22 teams competed in the final testing phase, demonstrating new and innovative solutions to the problem. △ Less

Submitted 5 May, 2020; originally announced May 2020.

arXiv:2003.00830 [pdf, other]

GSANet: Semantic Segmentation with Global and Selective Attention

Authors: Qingfeng Liu, Mostafa El-Khamy, Dongwoon Bai, Jungwon Lee

Abstract: This paper proposes a novel deep learning architecture for semantic segmentation. The proposed Global and Selective Attention Network (GSANet) features Atrous Spatial Pyramid Pooling (ASPP) with a novel sparsemax global attention and a novel selective attention that deploys a condensation and diffusion mechanism to aggregate the multi-scale contextual information from the extracted deep features.… ▽ More This paper proposes a novel deep learning architecture for semantic segmentation. The proposed Global and Selective Attention Network (GSANet) features Atrous Spatial Pyramid Pooling (ASPP) with a novel sparsemax global attention and a novel selective attention that deploys a condensation and diffusion mechanism to aggregate the multi-scale contextual information from the extracted deep features. A selective attention decoder is also proposed to process the GSA-ASPP outputs for optimizing the softmax volume. We are the first to benchmark the performance of semantic segmentation networks with the low-complexity feature extraction network (FXN) MobileNetEdge, that is optimized for low latency on edge devices. We show that GSANet can result in more accurate segmentation with MobileNetEdge, as well as with strong FXNs, such as Xception. GSANet improves the state-of-art semantic segmentation accuracy on both the ADE20k and the Cityscapes datasets. △ Less

Submitted 13 February, 2020; originally announced March 2020.

arXiv:1910.10707 [pdf, other]

End-to-End Multi-Task Denoising for the Joint Optimization of Perceptual Speech Metrics

Authors: Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee

Abstract: Although supervised learning based on a deep neural network has recently achieved substantial improvement on speech enhancement, the existing schemes have either of two critical issues: spectrum or metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fo… ▽ More Although supervised learning based on a deep neural network has recently achieved substantial improvement on speech enhancement, the existing schemes have either of two critical issues: spectrum or metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). The metric mismatch is that a conventional mean square error (MSE) loss function is typically sub-optimal to maximize perceptual speech measure such as signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This paper presents a new end-to-end denoising framework. First, the network optimization is performed on the time-domain signals after ISTFT to avoid the spectrum mismatch. Second, three loss functions based on SDR, PESQ and STOI are proposed to minimize the metric mismatch. The experimental result showed the proposed denoising scheme significantly improved SDR, PESQ and STOI performance over the existing methods. Moreover, the proposed scheme also provided good generalization performance over generative denoising models on the perceptual speech metrics not used as a loss function during training. △ Less

Submitted 5 May, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: 5 pages, submitted to Interspeech 2020. arXiv admin note: substantial text overlap with arXiv:1901.09146

arXiv:1910.06762 [pdf, other]

T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement

Authors: Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee

Abstract: Transformer neural networks (TNN) demonstrated state-of-art performance on many natural language processing (NLP) tasks, replacing recurrent neural networks (RNNs), such as LSTMs or GRUs. However, TNNs did not perform well in speech enhancement, whose contextual nature is different than NLP tasks, like machine translation. Self-attention is a core building block of the Transformer, which not only… ▽ More Transformer neural networks (TNN) demonstrated state-of-art performance on many natural language processing (NLP) tasks, replacing recurrent neural networks (RNNs), such as LSTMs or GRUs. However, TNNs did not perform well in speech enhancement, whose contextual nature is different than NLP tasks, like machine translation. Self-attention is a core building block of the Transformer, which not only enables parallelization of sequence computation, but also provides the constant path length between symbols that is essential to learning long-range dependencies. In this paper, we propose a Transformer with Gaussian-weighted self-attention (T-GSA), whose attention weights are attenuated according to the distance between target and context symbols. The experimental results show that the proposed T-GSA has significantly improved speech-enhancement performance, compared to the Transformer and RNNs. △ Less

Submitted 11 February, 2020; v1 submitted 13 October, 2019; originally announced October 2019.

Comments: 5 pages, Submitted to ICASSP 2020

arXiv:1909.04802 [pdf, other]

Variable Rate Deep Image Compression With a Conditional Autoencoder

Authors: Yoojin Choi, Mostafa El-Khamy, Jungwon Lee

Abstract: In this paper, we propose a novel variable-rate learned image compression framework with a conditional autoencoder. Previous learning-based image compression methods mostly require training separate networks for different compression rates so they can yield compressed images of varying quality. In contrast, we train and deploy only one variable-rate image compression network implemented with a con… ▽ More In this paper, we propose a novel variable-rate learned image compression framework with a conditional autoencoder. Previous learning-based image compression methods mostly require training separate networks for different compression rates so they can yield compressed images of varying quality. In contrast, we train and deploy only one variable-rate image compression network implemented with a conditional autoencoder. We provide two rate control parameters, i.e., the Lagrange multiplier and the quantization bin size, which are given as conditioning variables to the network. Coarse rate adaptation to a target is performed by changing the Lagrange multiplier, while the rate can be further fine-tuned by adjusting the bin size used in quantizing the encoded representation. Our experimental results show that the proposed scheme provides a better rate-distortion trade-off than the traditional variable-rate image compression codecs such as JPEG2000 and BPG. Our model also shows comparable and sometimes better performance than the state-of-the-art learned image compression models that deploy multiple networks trained for varying rates. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: ICCV 2019

arXiv:1901.09146 [pdf, other]

End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization

Authors: Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee

Abstract: Supervised learning based on a deep neural network recently has achieved substantial improvement on speech enhancement. Denoising networks learn mapping from noisy speech to clean one directly, or to a spectrum mask which is the ratio between clean and noisy spectra. In either case, the network is optimized by minimizing mean square error (MSE) between ground-truth labels and time-domain or spectr… ▽ More Supervised learning based on a deep neural network recently has achieved substantial improvement on speech enhancement. Denoising networks learn mapping from noisy speech to clean one directly, or to a spectrum mask which is the ratio between clean and noisy spectra. In either case, the network is optimized by minimizing mean square error (MSE) between ground-truth labels and time-domain or spectrum output. However, existing schemes have either of two critical issues: spectrum and metric mismatches. The spectrum mismatch is a well known issue that any spectrum modification after short-time Fourier transform (STFT), in general, cannot be fully recovered after inverse short-time Fourier transform (ISTFT). The metric mismatch is that a conventional MSE metric is sub-optimal to maximize our target metrics, signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ). This paper presents a new end-to-end denoising framework with the goal of joint SDR and PESQ optimization. First, the network optimization is performed on the time-domain signals after ISTFT to avoid spectrum mismatch. Second, two loss functions which have improved correlations with SDR and PESQ metrics are proposed to minimize metric mismatch. The experimental result showed that the proposed denoising scheme significantly improved both SDR and PESQ performance over the existing methods. △ Less

Submitted 8 March, 2023; v1 submitted 25 January, 2019; originally announced January 2019.

arXiv:1810.06766 [pdf, other]

DN-ResNet: Efficient Deep Residual Network for Image Denoising

Authors: Haoyu Ren, Mostafa El-Khamy, Jungwon Lee

Abstract: A deep learning approach to blind denoising of images without complete knowledge of the noise statistics is considered. We propose DN-ResNet, which is a deep convolutional neural network (CNN) consisting of several residual blocks (ResBlocks). With cascade training, DN-ResNet is more accurate and more computationally efficient than the state of art denoising networks. An edge-aware loss function i… ▽ More A deep learning approach to blind denoising of images without complete knowledge of the noise statistics is considered. We propose DN-ResNet, which is a deep convolutional neural network (CNN) consisting of several residual blocks (ResBlocks). With cascade training, DN-ResNet is more accurate and more computationally efficient than the state of art denoising networks. An edge-aware loss function is further utilized in training DN-ResNet, so that the denoising results have better perceptive quality compared to conventional loss function. Next, we introduce the depthwise separable DN-ResNet (DS-DN-ResNet) utilizing the proposed Depthwise Seperable ResBlock (DS-ResBlock) instead of standard ResBlock, which has much less computational cost. DS-DN-ResNet is incrementally evolved by replacing the ResBlocks in DN-ResNet by DS-ResBlocks stage by stage. As a result, high accuracy and good computational efficiency are achieved concurrently. Whereas previous state of art deep learning methods focused on denoising either Gaussian or Poisson corrupted images, we consider denoising images having the more practical Poisson with additive Gaussian noise as well. The results show that DN-ResNets are more efficient, robust, and perform better denoising than current state of art deep learning methods, as well as the popular variants of the BM3D algorithm, in cases of blind and non-blind denoising of images corrupted with Poisson, Gaussian or Poisson-Gaussian noise. Our network also works well for other image enhancement task such as compressed image restoration. △ Less

Submitted 15 October, 2018; originally announced October 2018.

Journal ref: Asian Conference of Computer Vision 2018

arXiv:1710.10224 [pdf, other]

BridgeNets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and its Application to Distant Speech Recognition

Authors: Jaeyoung Kim, Mostafa El-Khamy, Jungwon Lee

Abstract: Despite the remarkable progress achieved on automatic speech recognition, recognizing far-field speeches mixed with various noise sources is still a challenging task. In this paper, we introduce novel student-teacher transfer learning, BridgeNet which can provide a solution to improve distant speech recognition. There are two key features in BridgeNet. First, BridgeNet extends traditional student-… ▽ More Despite the remarkable progress achieved on automatic speech recognition, recognizing far-field speeches mixed with various noise sources is still a challenging task. In this paper, we introduce novel student-teacher transfer learning, BridgeNet which can provide a solution to improve distant speech recognition. There are two key features in BridgeNet. First, BridgeNet extends traditional student-teacher frameworks by providing multiple hints from a teacher network. Hints are not limited to the soft labels from a teacher network. Teacher's intermediate feature representations can better guide a student network to learn how to denoise or dereverberate noisy input. Second, the proposed recursive architecture in the BridgeNet can iteratively improve denoising and recognition performance. The experimental results of BridgeNet showed significant improvements in tackling the distant speech recognition problem, where it achieved up to 13.24% relative WER reductions on AMI corpus compared to a baseline neural network without teacher's hints. △ Less

Submitted 21 February, 2018; v1 submitted 27 October, 2017; originally announced October 2017.

Comments: Accepted to 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018)

Showing 1–8 of 8 results for author: El-Khamy, M