Search | arXiv e-print repository

Single channel speech enhancement by colored spectrograms

Authors: Sania Gul, Muhammad Salman Khan, Muhammad Fazeel

Abstract: Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented, using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and trai… ▽ More Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented, using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1% in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1% deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 18 pages, 6 figures, 5 tables

arXiv:2208.05184 [pdf]

Preserving the beamforming effect for spatial cue-based pseudo-binaural dereverberation of a single source

Authors: Sania Gul, Muhammad Salman Khan, Syed Waqar Shah

Abstract: Reverberations are unavoidable in enclosures, resulting in reduced intelligibility for hearing impaired and non native listeners and even for the normal hearing listeners in noisy circumstances. It also degrades the performance of machine listening applications. In this paper, we propose a novel approach of binaural dereverberation of a single speech source, using the differences in the interaural… ▽ More Reverberations are unavoidable in enclosures, resulting in reduced intelligibility for hearing impaired and non native listeners and even for the normal hearing listeners in noisy circumstances. It also degrades the performance of machine listening applications. In this paper, we propose a novel approach of binaural dereverberation of a single speech source, using the differences in the interaural cues of the direct path signal and the reverberations. Two beamformers, spaced at an interaural distance, are used to extract the reverberations from the reverberant speech. The interaural cues generated by these reverberations and those generated by the direct path signal act as a two class dataset, used for the training of U-Net (a deep convolutional neural network). After its training, the beamformers are removed and the trained U-Net along with the maximum likelihood estimation (MLE) algorithm is used to discriminate between the direct path cues from the reverberation cues, when the system is exposed to the interaural spectrogram of the reverberant speech signal. Our proposed model has outperformed the classical signal processing dereverberation model weighted prediction error in terms of cepstral distance (CEP), frequency weighted segmental signal to noise ratio (FWSEGSNR) and signal to reverberation modulation energy ratio (SRMR) by 1.4 points, 8 dB and 0.6dB. It has achieved better performance than the deep learning based dereverberation model by gaining 1.3 points improvement in CEP with comparable FWSEGSNR, using training dataset which is almost 8 times smaller than required for that model. The proposed model also sustained its performance under relatively similar unseen acoustic conditions and at positions in the vicinity of its training position. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 25 pages, 7 figures

arXiv:2208.04626 [pdf]

Recycling an anechoic pre-trained speech separation deep neural network for binaural dereverberation of a single source

Authors: Sania Gul, Muhammad Salman Khan, Syed Waqar Shah, Ata Ur-Rehman

Abstract: Reverberation results in reduced intelligibility for both normal and hearing-impaired listeners. This paper presents a novel psychoacoustic approach of dereverberation of a single speech source by recycling a pre-trained binaural anechoic speech separation neural network. As training the deep neural network (DNN) is a lengthy and computationally expensive process, the advantage of using a pre-trai… ▽ More Reverberation results in reduced intelligibility for both normal and hearing-impaired listeners. This paper presents a novel psychoacoustic approach of dereverberation of a single speech source by recycling a pre-trained binaural anechoic speech separation neural network. As training the deep neural network (DNN) is a lengthy and computationally expensive process, the advantage of using a pre-trained separation network for dereverberation is that the network does not need to be retrained, saving both time and computational resources. The interaural cues of a reverberant source are given to this pretrained neural network to discriminate between the direct path signal and the reverberant speech. The results show an average improvement of 1.3% in signal intelligibility, 0.83 dB in SRMR (signal to reverberation energy ratio) and 0.16 points in perceptual evaluation of speech quality (PESQ) over other state-of-the-art signal processing dereverberation algorithms and 14% in intelligibility and 0.35 points in quality over orthogonal matching pursuit with spectral subtraction (OSS), a machine learning based dereverberation algorithm. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: 15 pages, 4 figures

arXiv:2107.03056 [pdf, ps, other]

Position Constrained, Adaptive Control of Robotic Manipulators without Velocity Measurements

Authors: Samet Gul, Erkan Zergeroglu, Enver Tatlicioglu

Abstract: This work presents the design and the corresponding stability analysis of a model based, joint position tracking error constrained, adaptive output feedback controller for robot manipulators. Specifically, provided that the initial joint position tracking error starts within a predefined region, the proposed controller algorithm ensures that the joint tracking error remains inside this region and… ▽ More This work presents the design and the corresponding stability analysis of a model based, joint position tracking error constrained, adaptive output feedback controller for robot manipulators. Specifically, provided that the initial joint position tracking error starts within a predefined region, the proposed controller algorithm ensures that the joint tracking error remains inside this region and asymptotically approaches to zero, despite the lack of joint velocity measurements and uncertainties associated with the system dynamics. The need for the joint velocity measurements are removed via the use of a surrogate filter formulation in conjunction with the use of desired model compensation. The stability and the convergence of the closed loop system are proved via a barrier Lyapunov function based argument. A simulation performed on a two-link robotic manipulator is provided in order to illustrate the feasibility and effectiveness of the proposed method. △ Less

Submitted 7 July, 2021; originally announced July 2021.

Comments: 10 pages, 3 figures

arXiv:2102.13334 [pdf]

Integration of deep learning with expectation maximization for spatial cue based speech separation in reverberant conditions

Authors: Sania Gul, Muhammad Salman Khan, Syed Waqar Shah

Abstract: In this paper, we formulate a blind source separation (BSS) framework, which allows integrating U-Net based deep learning source separation network with probabilistic spatial machine learning expectation maximization (EM) algorithm for separating speech in reverberant conditions. Our proposed model uses a pre-trained deep learning convolutional neural network, U-Net, for clustering the interaural… ▽ More In this paper, we formulate a blind source separation (BSS) framework, which allows integrating U-Net based deep learning source separation network with probabilistic spatial machine learning expectation maximization (EM) algorithm for separating speech in reverberant conditions. Our proposed model uses a pre-trained deep learning convolutional neural network, U-Net, for clustering the interaural level difference (ILD) cues and machine learning expectation maximization (EM) algorithm for clustering the interaural phase difference (IPD) cues. The integrated model exploits the complementary strengths of the two approaches to BSS: the strong modeling power of supervised neural networks and the ease of unsupervised machine learning algorithms, whose few parameters can be estimated on as little as a single segment of an audio mixture. The results show an average improvement of 4.3 dB in signal to distortion ratio (SDR) and 4.3% in short time speech intelligibility (STOI) over the EM based source separation algorithm MESSL-GS (model-based expectation-maximization source separation and localization with garbage source) and 4.5 dB in SDR and 8% in STOI over deep learning convolutional neural network (U-Net) based speech separation algorithm SONET under the reverberant conditions ranging from anechoic to those mostly encountered in the real world. △ Less

Submitted 26 February, 2021; originally announced February 2021.

arXiv:2012.01900 [pdf, other]

Light-field view synthesis using convolutional block attention module

Authors: M. Shahzeb Khan Gul, Umair Mukati, Michel Bätz, Søren Forchhammer, Joachim Keinert

Abstract: Consumer light-field (LF) cameras suffer from a low or limited resolution because of the angular-spatial trade-off. To alleviate this drawback, we propose a novel learning-based approach utilizing attention mechanism to synthesize novel views of a light-field image using a sparse set of input views (i.e., 4 corner views) from a camera array. In the proposed method, we divide the process into three… ▽ More Consumer light-field (LF) cameras suffer from a low or limited resolution because of the angular-spatial trade-off. To alleviate this drawback, we propose a novel learning-based approach utilizing attention mechanism to synthesize novel views of a light-field image using a sparse set of input views (i.e., 4 corner views) from a camera array. In the proposed method, we divide the process into three stages, stereo-feature extraction, disparity estimation, and final image refinement. We use three sequential convolutional neural networks for each stage. A residual convolutional block attention module (CBAM) is employed for final adaptive image refinement. Attention modules are helpful in learning and focusing more on the important features of the image and are thus sequentially applied in the channel and spatial dimensions. Experimental results show the robustness of the proposed method. Our proposed network outperforms the state-of-the-art learning-based light-field view synthesis methods on two challenging real-world datasets by 0.5 dB on average. Furthermore, we provide an ablation study to substantiate our findings. △ Less

Submitted 31 May, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

arXiv:2007.14084 [pdf, other]

doi 10.1145/3394171.3413699

Kalman Filter-based Head Motion Prediction for Cloud-based Mixed Reality

Authors: Serhan Gül, Sebastian Bosse, Dimitri Podborski, Thomas Schierl, Cornelius Hellge

Abstract: Volumetric video allows viewers to experience highly-realistic 3D content with six degrees of freedom in mixed reality (MR) environments. Rendering complex volumetric videos can require a prohibitively high amount of computational power for mobile devices. A promising technique to reduce the computational burden on mobile devices is to perform the rendering at a cloud server. However, cloud-based… ▽ More Volumetric video allows viewers to experience highly-realistic 3D content with six degrees of freedom in mixed reality (MR) environments. Rendering complex volumetric videos can require a prohibitively high amount of computational power for mobile devices. A promising technique to reduce the computational burden on mobile devices is to perform the rendering at a cloud server. However, cloud-based rendering systems suffer from an increased interaction (motion-to-photon) latency that may cause registration errors in MR environments. One way of reducing the effective latency is to predict the viewer's head pose and render the corresponding view from the volumetric video in advance. In this paper, we design a Kalman filter for head motion prediction in our cloud-based volumetric video streaming system. We analyze the performance of our approach using recorded head motion traces and compare its performance to an autoregression model for different prediction intervals (look-ahead times). Our results show that the Kalman filter can predict head orientations 0.5 degrees more accurately than the autoregression model for a look-ahead time of 60 ms. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: Accepted at the ACM Multimedia Conference (ACMMM) 2020. 9 pages, 9 figures

Journal ref: Proceedings of the 28th ACM International Conference on Multimedia (2020) 3632-3641

arXiv:2005.11413 [pdf, ps, other]

FPGA based design for online computation of Multivariate EMD (MEMD)

Authors: Sikender Gul, Muhammad Faisal Siddiqui, Naveed Ur Rehman

Abstract: Multivariate or multichannel data have become ubiquitous in many modern scientific and engineering applications, e.g., biomedical engineering, owing to recent advances in sensor and computing technology. Processing these data sets is challenging owing to: i) their large size and multidimensional nature, thus requiring specialized algorithms and efficient hardware designs for on-line and real-time… ▽ More Multivariate or multichannel data have become ubiquitous in many modern scientific and engineering applications, e.g., biomedical engineering, owing to recent advances in sensor and computing technology. Processing these data sets is challenging owing to: i) their large size and multidimensional nature, thus requiring specialized algorithms and efficient hardware designs for on-line and real-time processing; ii) the nonstationary nature of data arising in many real life applications demanding new extensions of standard multiscale non-stationary signal processing tools. In this paper, we address the former issue by proposing a fully FPGA based hardware architecture of a popular multi-scale and multivariate signal processing algorithm, termed as multivariate empirical mode decomposition (MEMD). MEMD is a data-driven method that extends the functionality of standard empirical mode decomposition (EMD) algorithm to multichannel or multivariate data sets. Since its inception in 2010, the algorithm has found wide spread applications spanning different engineering related fields. Yet, no parallel FPGA based hardware design of the algorithm is available for its on-line and real-time processing. Our proposed architecture for MEMD uses fixed point operations and employs cubic spline interpolation within the sifting process. Finally, examples of decomposition of multivariate synthetic and real world biological signals are provided. △ Less

Submitted 22 May, 2020; originally announced May 2020.

ACM Class: B.7.0; B.6.1

arXiv:2004.11277 [pdf, ps, other]

Desired Model Compensation based Position Constrained Control of Robotic Manipulators

Authors: Samet Gul, Erkan Zergeroglu, Enver Tatlicioglu, Mesih Veysi Kilinc

Abstract: This work presents the design and the corresponding stability analysis of desired model based, joint position constrained, robot controller. Specifically, provided that the initial joint position tracking error signal starts below some predefined value, the proposed controller ensures that the joint tracking error signal remains inside the region (defined by predefined upper--bound) and approaches… ▽ More This work presents the design and the corresponding stability analysis of desired model based, joint position constrained, robot controller. Specifically, provided that the initial joint position tracking error signal starts below some predefined value, the proposed controller ensures that the joint tracking error signal remains inside the region (defined by predefined upper--bound) and approaches to zero asymptotically. △ Less

Submitted 24 April, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

Comments: 3 figures 2 tables and total 13 pages

arXiv:2003.02526 [pdf, other]

doi 10.1145/3339825.3393583

Cloud Rendering-based Volumetric Video Streaming System for Mixed Reality Services

Authors: Serhan Gül, Dimitri Podborski, Jangwoo Son, Gurdeep Singh Bhullar, Thomas Buchholz, Thomas Schierl, Cornelius Hellge

Abstract: Volumetric video is an emerging technology for immersive representation of 3D spaces that captures objects from all directions using multiple cameras and creates a dynamic 3D model of the scene. However, processing volumetric content requires high amounts of processing power and is still a very demanding task for today's mobile devices. To mitigate this, we propose a volumetric video streaming sys… ▽ More Volumetric video is an emerging technology for immersive representation of 3D spaces that captures objects from all directions using multiple cameras and creates a dynamic 3D model of the scene. However, processing volumetric content requires high amounts of processing power and is still a very demanding task for today's mobile devices. To mitigate this, we propose a volumetric video streaming system that offloads the rendering to a powerful cloud/edge server and only sends the rendered 2D view to the client instead of the full volumetric content. We use 6DoF head movement prediction techniques, WebRTC protocol and hardware video encoding to ensure low-latency in different parts of the processing chain. We demonstrate our system using both a browser-based client and a Microsoft HoloLens client. Our application contains generic interfaces that allow for easy deployment of various augmented/mixed reality clients using the same server implementation. △ Less

Submitted 16 July, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: 4 pages, 2 figures

Journal ref: 11th ACM Multimedia Systems Conference (MMSys) 2020

arXiv:2001.06466 [pdf, other]

doi 10.1145/3386290.3396933

Low-latency Cloud-based Volumetric Video Streaming Using Head Motion Prediction

Authors: Serhan Gül, Dimitri Podborski, Thomas Buchholz, Thomas Schierl, Cornelius Hellge

Abstract: Volumetric video is an emerging key technology for immersive representation of 3D spaces and objects. Rendering volumetric video requires lots of computational power which is challenging especially for mobile devices. To mitigate this, we developed a streaming system that renders a 2D view from the volumetric video at a cloud server and streams a 2D video stream to the client. However, such networ… ▽ More Volumetric video is an emerging key technology for immersive representation of 3D spaces and objects. Rendering volumetric video requires lots of computational power which is challenging especially for mobile devices. To mitigate this, we developed a streaming system that renders a 2D view from the volumetric video at a cloud server and streams a 2D video stream to the client. However, such network-based processing increases the motion-to-photon (M2P) latency due to the additional network and processing delays. In order to compensate the added latency, prediction of the future user pose is necessary. We developed a head motion prediction model and investigated its potential to reduce the M2P latency for different look-ahead times. Our results show that the presented model reduces the rendering errors caused by the M2P latency compared to a baseline system in which no prediction is performed. △ Less

Submitted 16 July, 2020; v1 submitted 17 January, 2020; originally announced January 2020.

Comments: 7 pages, 4 figures

Journal ref: 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV) 2020

arXiv:1909.10583 [pdf, other]

doi 10.1016/j.jksues.2019.07.001

High Impedance Fault Detection and Isolation in Power Distribution Networks using Support Vector Machines

Authors: Muhammad Sarwar, Faisal Mehmood, Muhammad Abid, Abdul Qayyum Khan, Sufi Tabassum Gul, Adil Sarwar Khan

Abstract: This paper proposes an accurate High Impedance Fault (HIF) detection and isolation scheme in a power distribution network. The proposed schemes utilize the data available from voltage and current sensors. The technique employs multiple algorithms consisting of Principal Component Analysis, Fisher Discriminant Analysis, Binary and Multiclass Support Vector Machine for detection and identification o… ▽ More This paper proposes an accurate High Impedance Fault (HIF) detection and isolation scheme in a power distribution network. The proposed schemes utilize the data available from voltage and current sensors. The technique employs multiple algorithms consisting of Principal Component Analysis, Fisher Discriminant Analysis, Binary and Multiclass Support Vector Machine for detection and identification of the high impedance fault. These data driven techniques have been tested on IEEE 13-node distribution network for detection and identification of high impedance faults with broken and unbroken conductor. Further, the robustness of machine learning techniques has also been analysed by examining their performance with variation in loads for different faults. Simulation results for different faults at various locations have shown that proposed methods are fast and accurate in diagnosing high impedance faults. Multiclass Support Vector Machine gives the best result to detect and locate High Impedance Fault accurately. It ensures reliability, security and dependability of the distribution network. △ Less

Submitted 9 August, 2019; originally announced September 2019.

Comments: 16 pages, 19 figures, published in a journal

Journal ref: Journal of King Saud University - Engineering Sciences, July 2019

Showing 1–12 of 12 results for author: Gul, S