-
Multivariate Probabilistic Assessment of Speech Quality
Authors:
Fredrik Cumlin,
Xinyu Liang,
Victor Ungureanu,
Chandan K. A. Reddy,
Christian Schüldt,
Saikat Chatterjee
Abstract:
The mean opinion score (MOS) is a standard metric for assessing speech quality, but its singular focus fails to identify specific distortions when low scores are observed. The NISQA dataset addresses this limitation by providing ratings across four additional dimensions: noisiness, coloration, discontinuity, and loudness, alongside MOS. In this paper, we extend the explored univariate MOS estimati…
▽ More
The mean opinion score (MOS) is a standard metric for assessing speech quality, but its singular focus fails to identify specific distortions when low scores are observed. The NISQA dataset addresses this limitation by providing ratings across four additional dimensions: noisiness, coloration, discontinuity, and loudness, alongside MOS. In this paper, we extend the explored univariate MOS estimation to a multivariate framework by modeling these dimensions jointly using a multivariate Gaussian distribution. Our approach utilizes Cholesky decomposition to predict covariances without imposing restrictive assumptions and extends probabilistic affine transformations to a multivariate context. Experimental results show that our model performs on par with state-of-the-art methods in point estimation, while uniquely providing uncertainty and correlation estimates across speech quality dimensions. This enables better diagnosis of poor speech quality and informs targeted improvements.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Impairments are Clustered in Latents of Deep Neural Network-based Speech Quality Models
Authors:
Fredrik Cumlin,
Xinyu Liang,
Victor Ungureanu,
Chandan K. A. Reddy,
Christian Schüldt,
Saikat Chatterjee
Abstract:
In this article, we provide an experimental observation: Deep neural network (DNN) based speech quality assessment (SQA) models have inherent latent representations where many types of impairments are clustered. While DNN-based SQA models are not trained for impairment classification, our experiments show good impairment classification results in an appropriate SQA latent representation. We invest…
▽ More
In this article, we provide an experimental observation: Deep neural network (DNN) based speech quality assessment (SQA) models have inherent latent representations where many types of impairments are clustered. While DNN-based SQA models are not trained for impairment classification, our experiments show good impairment classification results in an appropriate SQA latent representation. We investigate the clustering of impairments using various kinds of audio degradations that include different types of noises, waveform clipping, gain transition, pitch shift, compression, reverberation, etc. To visualize the clusters we perform classification of impairments in the SQA-latent representation domain using a standard k-nearest neighbor (kNN) classifier. We also develop a new DNN-based SQA model, named DNSMOS+, to examine whether an improvement in SQA leads to an improvement in impairment classification. The classification accuracy is 94% for LibriAugmented dataset with 16 types of impairments and 54% for ESC-50 dataset with 50 types of real noises.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables
Authors:
Artem Dementyev,
Chandan K. A. Reddy,
Scott Wisdom,
Navin Chatlani,
John R. Hershey,
Richard F. Lyon
Abstract:
Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.2…
▽ More
Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 626k parameters to generate FIR taps. Using a real hardware implementation on a low-power DSP, our system can run with 376 MIPS and a mean end-to-end latency of 3.35 ms. In addition, we provide a comparison with existing low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.
△ Less
Submitted 7 March, 2025; v1 submitted 26 September, 2024;
originally announced September 2024.
-
EEG-Based Reaction Time Prediction with Fuzzy Common Spatial Patterns and Phase Cohesion using Deep Autoencoder Based Data Fusion
Authors:
Vivek Singh,
Tharun Kumar Reddy
Abstract:
Drowsiness state of a driver is a topic of extensive discussion due to its significant role in causing traffic accidents. This research presents a novel approach that combines Fuzzy Common Spatial Patterns (CSP) optimised Phase Cohesive Sequence (PCS) representations and fuzzy CSP-optimized signal amplitude representations. The research aims to examine alterations in Electroencephalogram (EEG) syn…
▽ More
Drowsiness state of a driver is a topic of extensive discussion due to its significant role in causing traffic accidents. This research presents a novel approach that combines Fuzzy Common Spatial Patterns (CSP) optimised Phase Cohesive Sequence (PCS) representations and fuzzy CSP-optimized signal amplitude representations. The research aims to examine alterations in Electroencephalogram (EEG) synchronisation between a state of alertness and drowsiness, forecast drivers' reaction times by analysing EEG data, and subsequently identify the presence of drowsiness. The study's findings indicate that this approach successfully distinguishes between alert and drowsy mental states. By employing a Deep Autoencoder-based data fusion technique and a regression model such as Support Vector Regression (SVR) or Least Absolute Shrinkage and Selection Operator (LASSO), the proposed method outperforms using individual feature sets in combination with a regressor model. This superiority is measured by evaluating the Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Correlation Coefficient (CC). In other words, the fusion of autoencoder-based amplitude EEG power features and PCS features, when used in regression, outperforms using either of these features alone in a regressor model. Specifically, the proposed data fusion method achieves a 14.36% reduction in RMSE, a 25.12% reduction in MAPE, and a 10.12% increase in CC compared to the baseline model using only individual amplitude EEG power features and regression.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
A Comparative Study of Filters and Deep Learning Models to predict Diabetic Retinopathy
Authors:
Roshan Vasu Muddaluru,
Sharvaani Ravikumar Thoguluva,
Shruti Prabha,
Tanuja Konda Reddy,
Suja Palaniswamy
Abstract:
The retina is an essential component of the visual system, and maintaining eyesight depends on the timely and accurate detection of disorders. The early-stage detection and severity classification of Diabetic Retinopathy (DR), a significant risk to the public's health is the primary goal of this work. This study compares the outcomes of various deep learning models, including InceptionNetV3, Dense…
▽ More
The retina is an essential component of the visual system, and maintaining eyesight depends on the timely and accurate detection of disorders. The early-stage detection and severity classification of Diabetic Retinopathy (DR), a significant risk to the public's health is the primary goal of this work. This study compares the outcomes of various deep learning models, including InceptionNetV3, DenseNet121, and other CNN-based models, utilizing a variety of image filters, including Gaussian, grayscale, and Gabor. These models could detect subtle pathological alterations and use that information to estimate the risk of retinal illnesses. The objective is to improve the diagnostic processes for DR, the primary cause of diabetes-related blindness, by utilizing deep learning models. A comparative analysis between Greyscale, Gaussian and Gabor filters has been provided after applying these filters on the retinal images. The Gaussian filter has been identified as the most promising filter by resulting in 96% accuracy using InceptionNetV3.
△ Less
Submitted 9 January, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
-
MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP
Authors:
Prajwal Ganugula,
Y S S S Santosh Kumar,
N K Sagar Reddy,
Prabhath Chellingi,
Avinash Thakur,
Neeraj Kasera,
C Shyam Anand
Abstract:
Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which…
▽ More
Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which is not addressed by the current state-of-the-art approaches. On the other hand, diffusion style transfer methods also suffer from the same issue because the regional stylization control over the stylized output is ineffective. To address this problem, We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can apply styles to different objects in the image based on the context extracted from the input prompt. Text-based segmentation and stylization modules which are based on vision transformer architecture, were used to segment and stylize the objects. Our method can extend to any arbitrary objects, styles and produce high-quality images compared to the current state of art methods. To our knowledge, this is the first attempt to perform text-guided arbitrary object-wise stylization. We demonstrate the effectiveness of our approach through qualitative and quantitative analysis, showing that it can generate visually appealing stylized images with enhanced control over stylization and the ability to generalize to unseen object classes.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Wavefront Engineering: Realizing Efficient Terahertz Band Communications in 6G and Beyond
Authors:
Arjun Singh,
Vitaly Petrov,
Hichem Guerboukha,
Innem V. A. K. Reddy,
Edward W. Knightly,
Daniel M. Mittleman,
Josep M. Jornet
Abstract:
Terahertz (THz) band communications is envisioned as a key technology for future wireless standards. Substantial progress has been made in this field, with advances in hardware design, channel models, and signal processing. High-rate backhaul links operating at sub-THz frequencies have been experimentally demonstrated. However, there are inherent challenges in making the next great leap for adopti…
▽ More
Terahertz (THz) band communications is envisioned as a key technology for future wireless standards. Substantial progress has been made in this field, with advances in hardware design, channel models, and signal processing. High-rate backhaul links operating at sub-THz frequencies have been experimentally demonstrated. However, there are inherent challenges in making the next great leap for adopting the THz band in widespread communication systems, such as cellular access and wireless local area networks. Primarily, such systems have to be both: (i) wideband, to maintain desired data rate and sensing resolution; and, more importantly, (ii) operate in the massive near field of the high-gain devices required to overcome the propagation losses. In this article, it is first explained why the state-of-the-art techniques from lower frequencies, including millimeter-wave, cannot be simply repurposed to realize THz band communication systems. Then, a vision of wavefront engineering is presented to address these shortfalls. Further, it is illustrated how novel implementations of specific wavefronts, such as Bessel beams and Airy beams, offer attractive advantages in creating THz links over state-of-the-art far-field beamforming and near-field beamfocusing techniques. The paper ends by discussing novel problems and challenges in this new and exciting research area.
Index Terms - Terahertz Communications; 6G; Wavefront Engineering; Bessel beams; Near field; Orbital Angular Momentum
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
Identifying TBI Physiological States by Clustering Multivariate Clinical Time-Series Data
Authors:
Hamid Ghaderi,
Brandon Foreman,
Amin Nayebi,
Sindhu Tipirneni,
Chandan K. Reddy,
Vignesh Subbian
Abstract:
Determining clinically relevant physiological states from multivariate time series data with missing values is essential for providing appropriate treatment for acute conditions such as Traumatic Brain Injury (TBI), respiratory failure, and heart failure. Utilizing non-temporal clustering or data imputation and aggregation techniques may lead to loss of valuable information and biased analyses. In…
▽ More
Determining clinically relevant physiological states from multivariate time series data with missing values is essential for providing appropriate treatment for acute conditions such as Traumatic Brain Injury (TBI), respiratory failure, and heart failure. Utilizing non-temporal clustering or data imputation and aggregation techniques may lead to loss of valuable information and biased analyses. In our study, we apply the SLAC-Time algorithm, an innovative self-supervision-based approach that maintains data integrity by avoiding imputation or aggregation, offering a more useful representation of acute patient states. By using SLAC-Time to cluster data in a large research dataset, we identified three distinct TBI physiological states and their specific feature profiles. We employed various clustering evaluation metrics and incorporated input from a clinical domain expert to validate and interpret the identified physiological states. Further, we discovered how specific clinical events and interventions can influence patient states and state transitions.
△ Less
Submitted 17 July, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
Evince the artifacts of Spoof Speech by blending Vocal Tract and Voice Source Features
Authors:
Tadipatri Uday Kiran Reddy,
Sahukari Chaitanya Varun,
Kota Pranav Kumar Sankala Sreekanth,
Kodukula Sri Rama Murty
Abstract:
With the rapid advancement in synthetic speech generation technologies, great interest in differentiating spoof speech from the natural speech is emerging in the research community. The identification of these synthetic signals is a difficult task not only for the cutting-edge classification models but also for humans themselves. To prevent potential adverse effects, it becomes crucial to detect s…
▽ More
With the rapid advancement in synthetic speech generation technologies, great interest in differentiating spoof speech from the natural speech is emerging in the research community. The identification of these synthetic signals is a difficult task not only for the cutting-edge classification models but also for humans themselves. To prevent potential adverse effects, it becomes crucial to detect spoof signals. From a forensics perspective, it is also important to predict the algorithm which generated them to identify the forger. This needs an understanding of the underlying attributes of spoof signals which serve as a signature for the synthesizer. This study emphasizes the segments of speech signals critical in identifying their authenticity by utilizing the Vocal Tract System(\textit{VTS}) and Voice Source(\textit{VS}) features.
In this paper, we propose a system that detects spoof signals as well as identifies the corresponding speech-generating algorithm. We achieve 99.58\% in algorithm classification accuracy. From experiments, we found that a VS feature-based system gives more attention to the transition of phonemes, while, a VTS feature-based system gives more attention to stationary segments of speech signals. We perform model fusion techniques on the VS-based and VTS-based systems to exploit the complementary information to develop a robust classifier. Upon analyzing the confusion plots we found that WaveRNN is poorly classified depicting more naturalness. On the other hand, we identified that synthesizer like Waveform Concatenation, and Neural Source Filter is classified with the highest accuracy. Practical implications of this work can aid researchers from both forensics (leverage artifacts) and the speech communities (mitigate artifacts).
△ Less
Submitted 4 December, 2022;
originally announced December 2022.
-
Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors:
Michael Chinen,
Jan Skoglund,
Chandan K A Reddy,
Alessandro Ragano,
Andrew Hines
Abstract:
Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality mo…
▽ More
Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality
Authors:
Alessandro Ragano,
Emmanouil Benetos,
Michael Chinen,
Helard B. Martinez,
Chandan K. A. Reddy,
Jan Skoglund,
Andrew Hines
Abstract:
Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influ…
▽ More
Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.
△ Less
Submitted 24 November, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
Authors:
Chandan K. A. Reddy,
Vishak Gopa,
Harishchandra Dubey,
Sergiy Matusevych,
Ross Cutler,
Robert Aichner
Abstract:
With the recent growth of remote work, online meetings often encounter challenging audio contexts such as background noise, music, and echo. Accurate real-time detection of music events can help to improve the user experience. In this paper, we present MusicNet, a compact neural model for detecting background music in the real-time communications pipeline. In video meetings, music frequently co-oc…
▽ More
With the recent growth of remote work, online meetings often encounter challenging audio contexts such as background noise, music, and echo. Accurate real-time detection of music events can help to improve the user experience. In this paper, we present MusicNet, a compact neural model for detecting background music in the real-time communications pipeline. In video meetings, music frequently co-occurs with speech and background noises, making the accurate classification quite challenging. We propose a compact convolutional neural network core preceded by an in-model featurization layer. MusicNet takes 9 seconds of raw audio as input and does not require any model-specific featurization in the product stack. We train our model on the balanced subset of the Audio Set~\cite{gemmeke2017audio} data and validate it on 1000 crowd-sourced real test clips. Finally, we compare MusicNet performance with 20 state-of-the-art models. MusicNet has a true positive rate (TPR) of 81.3% at a 0.1% false positive rate (FPR), which is significantly better than state-of-the-art models included in our study. MusicNet is also 10x smaller and has 4x faster inference than the best performing models we benchmarked.
△ Less
Submitted 15 April, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors
Authors:
Chandan K A Reddy,
Vishak Gopal,
Ross Cutler
Abstract:
Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. We have recently developed a non-intrusive speech quality metric called Deep Noise Suppression Mean Opinion Score (DNSMOS) using the scores from ITU-T Rec. P.808 subjective evaluation. The P.808 scores reflect the overall q…
▽ More
Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. We have recently developed a non-intrusive speech quality metric called Deep Noise Suppression Mean Opinion Score (DNSMOS) using the scores from ITU-T Rec. P.808 subjective evaluation. The P.808 scores reflect the overall quality of the audio clip. ITU-T Rec. P.835 subjective evaluation framework gives the standalone quality scores of speech and background noise in addition to the overall quality. In this work, we train an objective metric based on P.835 human ratings that outputs 3 scores: i) speech quality (SIG), ii) background noise quality (BAK), and iii) the overall quality (OVRL) of the audio. The developed metric is highly correlated with human ratings, with a Pearson's Correlation Coefficient (PCC)=0.94 for SIG and PCC=0.98 for BAK and OVRL. This is the first non-intrusive P.835 predictor we are aware of. DNSMOS P.835 is made publicly available as an Azure service.
△ Less
Submitted 4 February, 2022; v1 submitted 4 October, 2021;
originally announced October 2021.
-
The CORSMAL benchmark for the prediction of the properties of containers
Authors:
Alessio Xompero,
Santiago Donaher,
Vladimir Iashin,
Francesca Palermo,
Gökhan Solak,
Claudio Coppola,
Reina Ishikawa,
Yuichi Nagao,
Ryo Hachiuma,
Qi Liu,
Fan Feng,
Chuanlin Lan,
Rosa H. M. Chan,
Guilherme Christmann,
Jyun-Ting Song,
Gonuguntla Neeharika,
Chinnakotla Krishna Teja Reddy,
Dinesh Jain,
Bakhtawar Ur Rehman,
Andrea Cavallaro
Abstract:
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmar…
▽ More
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the filling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework.
△ Less
Submitted 21 April, 2022; v1 submitted 27 July, 2021;
originally announced July 2021.
-
Turbo Coded Single User Massive MIMO
Authors:
K. Vasudevan,
A. Phani Kumar Reddy,
Gyanesh Kumar Pathak,
Mahmoud Albreem
Abstract:
This work deals with turbo coded single user massive multiple input multiple output (SU-MMIMO) systems, with and without precoding. SU-MMIMO has a much higher spectral efficiency compared to multi-user massive MIMO (MU-MMIMO) since independent signals are transmitted from each of the antenna elements (spatial multiplexing). MU-MMIMO that uses beamforming has a much lower spectral efficiency, since…
▽ More
This work deals with turbo coded single user massive multiple input multiple output (SU-MMIMO) systems, with and without precoding. SU-MMIMO has a much higher spectral efficiency compared to multi-user massive MIMO (MU-MMIMO) since independent signals are transmitted from each of the antenna elements (spatial multiplexing). MU-MMIMO that uses beamforming has a much lower spectral efficiency, since the same signal (with a delay) is transmitted from each of the antenna elements. In this work, expressions for the upper bound on the average signal-to-noise ratio (SNR) per bit and spectral efficiency are derived for SU-MMIMO with and without precoding. We propose a performance index $f(N_t)$, which is a function of the number of transmit antennas $N_t$. Here $f(N_t)$ is the sum of the upper bound on the average SNR per bit and the spectral efficiency. We demonstrate that when the total number of antennas ($N_{\mathrm{tot}}$) in the transmitter and receiver is fixed, there exists a minimum value of $f(N_t)$, which has to be avoided. Computer simulations show that the bit-error-rate (BER) is nearly insensitive to a wide range of the number of transmit antennas and re-transmissions, when $N_{\mathrm{tot}}$ is large and kept constant. Thus, the spectral efficiency can be made as large as possible, for a given BER and $N_{\mathrm{tot}}$.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Towards efficient models for real-time deep noise suppression
Authors:
Sebastian Braun,
Hannes Gamper,
Chandan K. A. Reddy,
Ivan Tashev
Abstract:
With recent research advancements, deep learning models are becoming attractive and powerful choices for speech enhancement in real-time applications. While state-of-the-art models can achieve outstanding results in terms of speech quality and background noise reduction, the main challenge is to obtain compact enough models, which are resource efficient during inference time. An important but ofte…
▽ More
With recent research advancements, deep learning models are becoming attractive and powerful choices for speech enhancement in real-time applications. While state-of-the-art models can achieve outstanding results in terms of speech quality and background noise reduction, the main challenge is to obtain compact enough models, which are resource efficient during inference time. An important but often neglected aspect for data-driven methods is that results can be only convincing when tested on real-world data and evaluated with useful metrics. In this work, we investigate reasonably small recurrent and convolutional-recurrent network architectures for speech enhancement, trained on a large dataset considering also reverberation. We show interesting tradeoffs between computational complexity and the achievable speech quality, measured on real recordings using a highly accurate MOS estimator. It is shown that the achievable speech quality is a function of network complexity, and show which models have better tradeoffs.
△ Less
Submitted 19 May, 2021; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Interspeech 2021 Deep Noise Suppression Challenge
Authors:
Chandan K A Reddy,
Harishchandra Dubey,
Kazuhito Koishida,
Arun Nair,
Vishak Gopal,
Ross Cutler,
Sebastian Braun,
Hannes Gamper,
Robert Aichner,
Sriram Srinivasan
Abstract:
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, wh…
▽ More
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was also used to evaluate participants of the challenge. Many researchers from academia and industry made significant contributions to push the field forward, yet even the best noise suppressor was far from achieving superior speech quality in challenging scenarios. In this version of the challenge organized at INTERSPEECH 2021, we are expanding both our training and test datasets to accommodate full band scenarios. The two tracks in this challenge will focus on real-time denoising for (i) wide band, and(ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric called DNSMOS for the participants to use during their development phase.
△ Less
Submitted 4 April, 2021; v1 submitted 6 January, 2021;
originally announced January 2021.
-
DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors
Authors:
Chandan K A Reddy,
Vishak Gopal,
Ross Cutler
Abstract:
Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. The conventional and widely used metrics require a reference clean speech signal, which is unavailable in real recordings. The no-reference approaches correlate poorly with human ratings and are not widely adopted in the re…
▽ More
Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. The conventional and widely used metrics require a reference clean speech signal, which is unavailable in real recordings. The no-reference approaches correlate poorly with human ratings and are not widely adopted in the research community. One of the biggest use cases of these perceptual objective metrics is to evaluate noise suppression algorithms. This paper introduces a multi-stage self-teaching based perceptual objective metric that is designed to evaluate noise suppressors. The proposed method generalizes well in challenging test conditions with a high correlation to human ratings.
△ Less
Submitted 10 February, 2021; v1 submitted 28 October, 2020;
originally announced October 2020.
-
Latency Analysis for IMT-2020 Radio Interface Technology Evaluation
Authors:
A. Phani Kumar Reddy,
Navin Kumar,
Sri Sai Apoorva Tirumalasetty,
Srinivasan S,
Vinosh Babu James J
Abstract:
The International Telecommunication Union (ITU) is currently deliberating on the finalization of candidate radio interface technologies (RITs) for IMT-2020 (International Mobile Telecommunications) suitability. The candidate technologies are currently being evaluated and after a couple of ITU-Radiocommunication sector (ITU-R) working party (WP) meetings, they will become official. Although, produc…
▽ More
The International Telecommunication Union (ITU) is currently deliberating on the finalization of candidate radio interface technologies (RITs) for IMT-2020 (International Mobile Telecommunications) suitability. The candidate technologies are currently being evaluated and after a couple of ITU-Radiocommunication sector (ITU-R) working party (WP) meetings, they will become official. Although, products based on the candidate technology from 3GPP (5G new radio (NR)) is already commercial in several operator networks, the ITU is yet to officially declare it as IMT-2020 qualified. Along with evaluation of the 3GPP 5G NR specifications, our group has evaluated many other proponent technologies. 3GPP entire specifications were examined and evaluated through simulation using Matlab and using own developed simulator which is based on the Go-language. The simulator can evaluate complete 5G NR performance using the IMT-2020 evaluation framework. In this work, we are presenting latency parameters which has shown some minor differences from the 3GPP report. Especially, for time division duplexing (TDD) mode of operation, the differences are observed. It might be possible that the differences are due to assumptions made outside the scope of the evaluation. However, we considered the worst case parameter. Although, the report is submitted to ITU but it is also important for the research community to understand why the differences and what were the assumptions in scenario for which differences are observed.
△ Less
Submitted 26 September, 2020;
originally announced September 2020.
-
ICASSP 2021 Deep Noise Suppression Challenge
Authors:
Chandan K A Reddy,
Harishchandra Dubey,
Vishak Gopal,
Ross Cutler,
Sebastian Braun,
Hannes Gamper,
Robert Aichner,
Sriram Srinivasan
Abstract:
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020. We open sourced training and test datasets for researchers to train their noise suppression models. We also open sourced a subjective evaluation framework and used the t…
▽ More
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020. We open sourced training and test datasets for researchers to train their noise suppression models. We also open sourced a subjective evaluation framework and used the tool to evaluate and pick the final winners. Many researchers from academia and industry made significant contributions to push the field forward. We also learned that as a research community, we still have a long way to go in achieving excellent speech quality in challenging noisy real-time conditions. In this challenge, we are expanding both our training and test datasets. There are two tracks with one focusing on real-time denoising and the other focusing on real-time personalized deep noise suppression. We also make a non-intrusive objective speech quality metric called DNSMOS available for participants to use during their development stages. The final evaluation will be based on subjective tests.
△ Less
Submitted 26 October, 2020; v1 submitted 13 September, 2020;
originally announced September 2020.
-
Guided Policy Search Based Control of a High Dimensional Advanced Manufacturing Process
Authors:
Amit Surana,
Kishore Reddy,
Matthew Siopis
Abstract:
In this paper we apply guided policy search (GPS) based reinforcement learning framework for a high dimensional optimal control problem arising in an additive manufacturing process. The problem comprises of controlling the process parameters so that layer-wise deposition of material leads to desired geometric characteristics of the resulting part surface while minimizing the material deposited. A…
▽ More
In this paper we apply guided policy search (GPS) based reinforcement learning framework for a high dimensional optimal control problem arising in an additive manufacturing process. The problem comprises of controlling the process parameters so that layer-wise deposition of material leads to desired geometric characteristics of the resulting part surface while minimizing the material deposited. A realistic simulation model of the deposition process along with carefully selected set of guiding distributions generated based on iterative Linear Quadratic Regulator is used to train a neural network policy using GPS. A closed loop control based on the trained policy and in-situ measurement of the deposition profile is tested experimentally, and shows promising performance.
△ Less
Submitted 12 September, 2020;
originally announced September 2020.
-
Turbo Coded Single User Massive MIMO with Precoding
Authors:
K. Vasudevan,
Gyanesh Kumar Pathak,
A. Phani Kumar Reddy
Abstract:
Precoding is a method of compensating the channel at the transmitter. This work presents a novel method of data detection in turbo coded single user massive multiple input multiple output (MIMO) systems using precoding. We show via computer simulations that, when precoding is used, re-transmitting the data does not result in significant reduction in bit-error-rate (BER), thus increasing the spectr…
▽ More
Precoding is a method of compensating the channel at the transmitter. This work presents a novel method of data detection in turbo coded single user massive multiple input multiple output (MIMO) systems using precoding. We show via computer simulations that, when precoding is used, re-transmitting the data does not result in significant reduction in bit-error-rate (BER), thus increasing the spectral efficiency, compared to the case without precoding. Moreover, increasing the number of transmit and receive antennas results in improved BER.
△ Less
Submitted 31 July, 2020;
originally announced July 2020.
-
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results
Authors:
Chandan K. A. Reddy,
Vishak Gopal,
Ross Cutler,
Ebrahim Beyrami,
Roger Cheng,
Harishchandra Dubey,
Sergiy Matusevych,
Robert Aichner,
Ashkan Aazami,
Sebastian Braun,
Puneet Rana,
Sriram Srinivasan,
Johannes Gehrke
Abstract:
The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. While the performanc…
▽ More
The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. While the performance is good on the synthetic test set, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-sourced a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open-sourced an online subjective test framework based on ITU-T P.808 for researchers to reliably test their developments. We evaluated the results using P.808 on a blind test set. The results and the key learnings from the challenge are discussed. The datasets and scripts can be found here for quick access https://github.com/microsoft/DNS-Challenge.
△ Less
Submitted 18 October, 2020; v1 submitted 16 May, 2020;
originally announced May 2020.
-
A CNN-LSTM Quantifier for Single Access Point CSI Indoor Localization
Authors:
Minh Tu Hoang,
Brosnan Yuen,
Kai Ren,
Xiaodai Dong,
Tao Lu,
Robert Westendorp,
Kishore Reddy
Abstract:
This paper proposes a combined network structure between convolutional neural network (CNN) and long-short term memory (LSTM) quantifier for WiFi fingerprinting indoor localization. In contrast to conventional methods that utilize only spatial data with classification models, our CNN-LSTM network extracts both space and time features of the received channel state information (CSI) from a single ro…
▽ More
This paper proposes a combined network structure between convolutional neural network (CNN) and long-short term memory (LSTM) quantifier for WiFi fingerprinting indoor localization. In contrast to conventional methods that utilize only spatial data with classification models, our CNN-LSTM network extracts both space and time features of the received channel state information (CSI) from a single router. Furthermore, the proposed network builds a quantification model rather than a limited classification model as in most of the literature work, which enables the estimation of testing points that are not identical to the reference points. We analyze the instability of CSI and demonstrate a mitigation solution using a comprehensive filter and normalization scheme. The localization accuracy is investigated through extensive on-site experiments with several mobile devices including mobile phone (Nexus 5) and laptop (Intel 5300 NIC) on hundreds of testing locations. Using only a single WiFi router, our structure achieves an average localization error of 2.5~m with $\mathrm{80\%}$ of the errors under 4~m, which outperforms the other reported algorithms by approximately $\mathrm{50\%}$ under the same test environment.
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
Image Generation Via Minimizing Fréchet Distance in Discriminator Feature Space
Authors:
Khoa D. Doan,
Saurav Manchanda,
Fengjiao Wang,
Sathiya Keerthi,
Avradeep Bhowmik,
Chandan K. Reddy
Abstract:
For a given image generation problem, the intrinsic image manifold is often low dimensional. We use the intuition that it is much better to train the GAN generator by minimizing the distributional distance between real and generated images in a small dimensional feature space representing such a manifold than on the original pixel-space. We use the feature space of the GAN discriminator for such a…
▽ More
For a given image generation problem, the intrinsic image manifold is often low dimensional. We use the intuition that it is much better to train the GAN generator by minimizing the distributional distance between real and generated images in a small dimensional feature space representing such a manifold than on the original pixel-space. We use the feature space of the GAN discriminator for such a representation. For distributional distance, we employ one of two choices: the Fréchet distance or direct optimal transport (OT); these respectively lead us to two new GAN methods: Fréchet-GAN and OT-GAN. The idea of employing Fréchet distance comes from the success of Fréchet Inception Distance as a solid evaluation metric in image generation. Fréchet-GAN is attractive in several ways. We propose an efficient, numerically stable approach to calculate the Fréchet distance and its gradient. The Fréchet distance estimation requires a significantly less computation time than OT; this allows Fréchet-GAN to use much larger mini-batch size in training than OT. More importantly, we conduct experiments on a number of benchmark datasets and show that Fréchet-GAN (in particular) and OT-GAN have significantly better image generation capabilities than the existing representative primal and dual GAN approaches based on the Wasserstein distance.
△ Less
Submitted 30 March, 2020; v1 submitted 26 March, 2020;
originally announced March 2020.
-
Weighted Speech Distortion Losses for Neural-network-based Real-time Speech Enhancement
Authors:
Yangyang Xia,
Sebastian Braun,
Chandan K. A. Reddy,
Harishchandra Dubey,
Ross Cutler,
Ivan Tashev
Abstract:
This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement. Specifically, we focus on a RNN that enhances short-time speech spectra on a single-frame-in, single-frame-out basis, a framework adopted by most classical signal processing methods. We propose two…
▽ More
This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement. Specifically, we focus on a RNN that enhances short-time speech spectra on a single-frame-in, single-frame-out basis, a framework adopted by most classical signal processing methods. We propose two novel mean-squared-error-based learning objectives that enable separate control over the importance of speech distortion versus noise reduction. The proposed loss functions are evaluated by widely accepted objective quality and intelligibility measures and compared to other competitive online methods. In addition, we study the impact of feature normalization and varying batch sequence lengths on the objective quality of enhanced speech. Finally, we show subjective ratings for the proposed approach and a state-of-the-art real-time RNN-based method.
△ Less
Submitted 12 February, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Noise dependent Super Gaussian-Coherence based dual microphone Speech Enhancement for hearing aid application using smartphone
Authors:
Nikhil Shankar,
Gautam S Bhat,
Chandan K A Reddy,
Issa Panahi
Abstract:
In this paper, the coherence between speech and noise signals is used to obtain a Speech Enhancement (SE) gain function, in combination with a Super Gaussian Joint Maximum a Posteriori (SGJMAP) single microphone SE gain function. The proposed SE method can be implemented on a smartphone that works as an assistive device to hearing aids. Although coherence SE gain function suppresses the background…
▽ More
In this paper, the coherence between speech and noise signals is used to obtain a Speech Enhancement (SE) gain function, in combination with a Super Gaussian Joint Maximum a Posteriori (SGJMAP) single microphone SE gain function. The proposed SE method can be implemented on a smartphone that works as an assistive device to hearing aids. Although coherence SE gain function suppresses the background noise well, it distorts the speech. In contrary, SE using SGJMAP improves speech quality with additional musical noise, which we contain by using a post filter. The weighted union of these two gain functions strikes a balance between noise suppression and speech distortion. A 'weighting' parameter is introduced in the derived gain function to allow the smartphone user to control the weighting factor based on different background noise and their comfort level of hearing. Objective and subjective measures of the proposed method show effective improvement in comparison to standard techniques considered in this paper for several noisy conditions at signal to noise ratio levels of -5 dB, 0 dB and 5 dB.
△ Less
Submitted 26 January, 2020;
originally announced January 2020.
-
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework
Authors:
Chandan K. A. Reddy,
Ebrahim Beyrami,
Harishchandra Dubey,
Vishak Gopal,
Roger Cheng,
Ross Cutler,
Sergiy Matusevych,
Robert Aichner,
Ashkan Aazami,
Sebastian Braun,
Puneet Rana,
Sriram Srinivasan,
Johannes Gehrke
Abstract:
The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. Many publications report r…
▽ More
The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. Many publications report reasonable performance on the synthetic test set drawn from the same distribution as that of the training set. However, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-source a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open source an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments. The winners of this challenge will be selected based on subjective evaluation on a representative test set using P.808 framework.
△ Less
Submitted 19 April, 2020; v1 submitted 23 January, 2020;
originally announced January 2020.
-
Semi-Sequential Probabilistic Model For Indoor Localization Enhancement
Authors:
Minh Tu Hoang,
Brosnan Yuen,
Xiaodai Dong,
Tao Lu,
Robert Westendorp,
Kishore Reddy
Abstract:
This paper proposes a semi-sequential probabilistic model (SSP) that applies an additional short term memory to enhance the performance of the probabilistic indoor localization. The conventional probabilistic methods normally treat the locations in the database indiscriminately. In contrast, SSP leverages the information of the previous position to determine the probable location since the user's…
▽ More
This paper proposes a semi-sequential probabilistic model (SSP) that applies an additional short term memory to enhance the performance of the probabilistic indoor localization. The conventional probabilistic methods normally treat the locations in the database indiscriminately. In contrast, SSP leverages the information of the previous position to determine the probable location since the user's speed in an indoor environment is bounded and locations near the previous one have higher probability than the other locations. Although the SSP utilizes the previous location information, it does not require the exact moving speed and direction of the user. On-site experiments using the received signal strength indicator (RSSI) and channel state information (CSI) fingerprints for localization demonstrate that SSP reduces the maximum error and boosts the performance of existing probabilistic approaches by 25% - 30%.
△ Less
Submitted 8 January, 2020;
originally announced January 2020.
-
A scalable noisy speech dataset and online subjective test framework
Authors:
Chandan K. A. Reddy,
Ebrahim Beyrami,
Jamie Pool,
Ross Cutler,
Sriram Srinivasan,
Johannes Gehrke
Abstract:
Background noise is a major source of quality impairments in Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN) calls. Recent work shows the efficacy of deep learning for noise suppression, but the datasets have been relatively small compared to those used in other domains (e.g., ImageNet) and the associated evaluations have been more focused. In order to better facil…
▽ More
Background noise is a major source of quality impairments in Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN) calls. Recent work shows the efficacy of deep learning for noise suppression, but the datasets have been relatively small compared to those used in other domains (e.g., ImageNet) and the associated evaluations have been more focused. In order to better facilitate deep learning research in Speech Enhancement, we present a noisy speech dataset (MS-SNSD) that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired. We show that increasing dataset sizes increases noise suppression performance as expected. In addition, we provide an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing, with a reference algorithm to normalize the results. To demonstrate the dataset and evaluation framework we apply it to several noise suppressors and compare the subjective Mean Opinion Score (MOS) with objective quality measures such as SNR, PESQ, POLQA, and VISQOL and show why MOS is still required. Our subjective MOS evaluation is the first large scale evaluation of Speech Enhancement algorithms that we are aware of.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
DNN-based cross-lingual voice conversion using Bottleneck Features
Authors:
M Kiran Reddy,
K Sreenivasa Rao
Abstract:
Cross-lingual voice conversion (CLVC) is a quite challenging task since the source and target speakers speak different languages. This paper proposes a CLVC framework based on bottleneck features and deep neural network (DNN). In the proposed method, the bottleneck features extracted from a deep auto-encoder (DAE) are used to represent speaker-independent features of speech signals from different…
▽ More
Cross-lingual voice conversion (CLVC) is a quite challenging task since the source and target speakers speak different languages. This paper proposes a CLVC framework based on bottleneck features and deep neural network (DNN). In the proposed method, the bottleneck features extracted from a deep auto-encoder (DAE) are used to represent speaker-independent features of speech signals from different languages. A DNN model is trained to learn the mapping between bottleneck features and the corresponding spectral features of the target speaker. The proposed method can capture speaker-specific characteristics of a target speaker, and hence requires no speech data from source speaker during training. The performance of the proposed method is evaluated using data from three Indian languages: Telugu, Tamil and Malayalam. The experimental results show that the proposed method outperforms the baseline Gaussian mixture model (GMM)-based CLVC approach.
△ Less
Submitted 10 September, 2019; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Multilingual and Multimode Phone Recognition System for Indian Languages
Authors:
Kumud Tripathi,
M. Kiran Reddy,
K. Sreenivasa Rao
Abstract:
The aim of this paper is to develop a flexible framework capable of automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode. In this study, we considered two modes of speech: conversation, and read modes in four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed approach consists of two stages: (1) Automatic speech mode clas…
▽ More
The aim of this paper is to develop a flexible framework capable of automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode. In this study, we considered two modes of speech: conversation, and read modes in four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed approach consists of two stages: (1) Automatic speech mode classification (SMC) and (2) Automatic phonetic recognition using mode-specific multilingual phone recognition system (MPRS). In this work, the vocal tract and excitation source features are considered for speech mode classification (SMC) task. SMC systems are developed using multilayer perceptron (MLP). Further, vocal tract, excitation source, and tandem features are used to build the deep neural network (DNN)-based MPRSs. The performance of the proposed approach is compared with mode-dependent MPRSs. Experimental results show that the proposed approach which combines both SMC and MPRS into a single system outperforms the baseline mode-dependent MPRSs.
△ Less
Submitted 23 August, 2019;
originally announced August 2019.
-
Supervised Classifiers for Audio Impairments with Noisy Labels
Authors:
Chandan K A Reddy,
Ross Cutler,
Johannes Gehrke
Abstract:
Voice-over-Internet-Protocol (VoIP) calls are prone to various speech impairments due to environmental and network conditions resulting in bad user experience. A reliable audio impairment classifier helps to identify the cause for bad audio quality. The user feedback after the call can act as the ground truth labels for training a supervised classifier on a large audio dataset. However, the labels…
▽ More
Voice-over-Internet-Protocol (VoIP) calls are prone to various speech impairments due to environmental and network conditions resulting in bad user experience. A reliable audio impairment classifier helps to identify the cause for bad audio quality. The user feedback after the call can act as the ground truth labels for training a supervised classifier on a large audio dataset. However, the labels are noisy as most of the users lack the expertise to precisely articulate the impairment in the perceived speech. In this paper, we analyze the effects of massive noise in labels in training dense networks and Convolutional Neural Networks (CNN) using engineered features, spectrograms and raw audio samples as inputs. We demonstrate that CNN can generalize better on the training data with a large number of noisy labels and gives remarkably higher test performance. The classifiers were trained both on randomly generated label noise and the label noise introduced by human errors. We also show that training with noisy labels requires a significant increase in the training dataset size, which is in proportion to the amount of noise in the labels.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
Recurrent Neural Networks For Accurate RSSI Indoor Localization
Authors:
Minh Tu Hoang,
Brosnan Yuen,
Xiaodai Dong,
Tao Lu,
Robert Westendorp,
Kishore Reddy
Abstract:
This paper proposes recurrent neuron networks (RNNs) for a fingerprinting indoor localization using WiFi. Instead of locating user's position one at a time as in the cases of conventional algorithms, our RNN solution aims at trajectory positioning and takes into account the relation among the received signal strength indicator (RSSI) measurements in a trajectory. Furthermore, a weighted average fi…
▽ More
This paper proposes recurrent neuron networks (RNNs) for a fingerprinting indoor localization using WiFi. Instead of locating user's position one at a time as in the cases of conventional algorithms, our RNN solution aims at trajectory positioning and takes into account the relation among the received signal strength indicator (RSSI) measurements in a trajectory. Furthermore, a weighted average filter is proposed for both input RSSI data and sequential output locations to enhance the accuracy among the temporal fluctuations of RSSI. The results using different types of RNN including vanilla RNN, long short-term memory (LSTM), gated recurrent unit (GRU) and bidirectional LSTM (BiLSTM) are presented. On-site experiments demonstrate that the proposed structure achieves an average localization error of $0.75$ m with $80\%$ of the errors under $1$ m, which outperforms the conventional KNN algorithms and probabilistic algorithms by approximately $30\%$ under the same test environment.
△ Less
Submitted 22 October, 2019; v1 submitted 27 March, 2019;
originally announced March 2019.
-
Divergence Framework for EEG based Multiclass Motor Imagery Brain Computer Interface
Authors:
Satyam Kumar,
Tharun Kumar Reddy,
Laxmidhar Behera
Abstract:
Similar to most of the real world data, the ubiquitous presence of non-stationarities in the EEG signals significantly perturb the feature distribution thus deteriorating the performance of Brain Computer Interface. In this letter, a novel method is proposed based on Joint Approximate Diagonalization (JAD) to optimize stationarity for multiclass motor imagery Brain Computer Interface (BCI) in an i…
▽ More
Similar to most of the real world data, the ubiquitous presence of non-stationarities in the EEG signals significantly perturb the feature distribution thus deteriorating the performance of Brain Computer Interface. In this letter, a novel method is proposed based on Joint Approximate Diagonalization (JAD) to optimize stationarity for multiclass motor imagery Brain Computer Interface (BCI) in an information theoretic framework. Specifically, in the proposed method, we estimate the subspace which optimizes the discriminability between the classes and simultaneously preserve stationarity within the motor imagery classes. We determine the subspace for the proposed approach through optimization using gradient descent on an orthogonal manifold. The performance of the proposed stationarity enforcing algorithm is compared to that of baseline One-Versus-Rest (OVR)-CSP and JAD on publicly available BCI competition IV dataset IIa. Results show that an improvement in average classification accuracies across the subjects over the baseline algorithms and thus essence of alleviating within session non-stationarities.
△ Less
Submitted 12 January, 2019;
originally announced January 2019.
-
An individualized super Gaussian single microphone Speech Enhancement for hearing aid users with smartphone as an assistive device
Authors:
Chandan K A Reddy,
Nikhil Shankar,
Gautam Bhat,
Ram Charan,
Issa Panahi
Abstract:
In this letter, we derive a new super Gaussian Joint Maximum a Posteriori based single microphone speech enhancement gain function. The developed Speech Enhancement method is implemented on a smartphone, and this arrangement functions as an assistive device to hearing aids. We introduce a tradeoff parameter in the derived gain function that allows the smartphone user to customize their listening p…
▽ More
In this letter, we derive a new super Gaussian Joint Maximum a Posteriori based single microphone speech enhancement gain function. The developed Speech Enhancement method is implemented on a smartphone, and this arrangement functions as an assistive device to hearing aids. We introduce a tradeoff parameter in the derived gain function that allows the smartphone user to customize their listening preference, by controlling the amount of noise suppression and speech distortion in real-time based on their level of hearing comfort perceived in noisy real world acoustic environment. Objective quality and intelligibility measures show the effectiveness of the proposed method in comparison to benchmark techniques considered in this paper. Subjective results reflect the usefulness of the developed Speech Enhancement application in real-world noisy conditions at signal to noise ratio levels of 0 dB and 5 dB.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
A Computationally Efficient and Practically Feasible Two Microphones Blind Speech Separation Method
Authors:
Chandan K A Reddy,
Gautam Bhat,
Nikhil Shankar,
Issa Panahi
Abstract:
Traditionally, Blind Speech Separation techniques are computationally expensive as they update the demixing matrix at every time frame index, making them impractical to use in many Real-Time applications. In this paper, a robust data-driven two-microphone sound source localization method is used as a criterion to reduce the computational complexity of the Independent Vector Analysis (IVA) Blind Sp…
▽ More
Traditionally, Blind Speech Separation techniques are computationally expensive as they update the demixing matrix at every time frame index, making them impractical to use in many Real-Time applications. In this paper, a robust data-driven two-microphone sound source localization method is used as a criterion to reduce the computational complexity of the Independent Vector Analysis (IVA) Blind Speech Separation (BSS) method. IVA is used to separate convolutedly mixed speech and noise sources. The practical feasibility of the proposed method is proved by implementing it on a smartphone device to separate speech and noise in Real-World scenarios for Hearing-Aid applications. The experimental results with objective and subjective tests reveal the practical usability of the developed method in many real-world applications.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
Database Assisted Automatic Modulation Classification Using Sequential Minimal Optimization
Authors:
K. Pavan Kumar Reddy,
K. Lakhan Shiva,
K. Abhilash,
Y. Yoganandam
Abstract:
In this paper, we have proposed a novel algorithm for identifying the modulation scheme of an unknown incoming signal in order to mitigate the interference with primary user in Cognitive Radio systems, which is facilitated by using Automatic Modulation Classification (AMC) at the front end of Software Defined Radio (SDR). In this study, we used computer simulations of analog and digital modulation…
▽ More
In this paper, we have proposed a novel algorithm for identifying the modulation scheme of an unknown incoming signal in order to mitigate the interference with primary user in Cognitive Radio systems, which is facilitated by using Automatic Modulation Classification (AMC) at the front end of Software Defined Radio (SDR). In this study, we used computer simulations of analog and digital modulations belonging to eleven classes. Spectral based features have been used as input features for Sequential Minimal Optimization (SMO). These features of primary users are stored in the database, then it matches the unknown signal's features with those in the database. Built upon recently proposed AMC, our new database approach inherits the benefits of SMO based approach and makes it much more time efficient in classifying an unknown signal, especially in the case of multiple modulation schemes to overcome the issue of intense computations in constructing features. In various applications, primary users own frequent wireless transmissions having limited their feature size and save few computations. The SMO based classification methodology proves to be over 99 \% accurate for SNR of 15 dB and accuracy of classification is over 95 \% for low SNRs around 5dB.
△ Less
Submitted 20 June, 2018;
originally announced June 2018.
-
Time-resolved quantitative visualization of complex flow field emanating from an open-ended shock tube by using wavefront measuring camera
Authors:
Biswajit Medhi,
Gopalakrishna M. Hegde,
Kalidevapura Jagannath Reddy,
Debasish Roy,
Ram Mohan Vasu
Abstract:
Quantitative visualization of shock-induced complex flow field emanating from the open end of a miniaturized hand-driven shock tube (Reddy tube) is presented. During operation, the planar shock wave of Mach number Mi=1.3 is discharged through the low-pressure driven-section, kept open to ambient atmosphere. From the moment of shock discharge, its aftereffects of evolving flow field are recorded qu…
▽ More
Quantitative visualization of shock-induced complex flow field emanating from the open end of a miniaturized hand-driven shock tube (Reddy tube) is presented. During operation, the planar shock wave of Mach number Mi=1.3 is discharged through the low-pressure driven-section, kept open to ambient atmosphere. From the moment of shock discharge, its aftereffects of evolving flow field are recorded quantitatively for 300us near the exit of the tube by using our newly developed high resolution (16Mpixel) in-house developed wavefront measuring camera setup.
△ Less
Submitted 28 May, 2018; v1 submitted 5 May, 2018;
originally announced May 2018.
-
Multi-Agent Q-Learning for Minimizing Demand-Supply Power Deficit in Microgrids
Authors:
Raghuram Bharadwaj Diddigi,
D. Sai Koti Reddy,
Shalabh Bhatnagar
Abstract:
We consider the problem of minimizing the difference in the demand and the supply of power using microgrids. We setup multiple microgrids, that provide electricity to a village. They have access to the batteries that can store renewable power and also the electrical lines from the main grid. During each time period, these microgrids need to take decision on the amount of renewable power to be used…
▽ More
We consider the problem of minimizing the difference in the demand and the supply of power using microgrids. We setup multiple microgrids, that provide electricity to a village. They have access to the batteries that can store renewable power and also the electrical lines from the main grid. During each time period, these microgrids need to take decision on the amount of renewable power to be used from the batteries as well as the amount of power needed from the main grid. We formulate this problem in the framework of Markov Decision Process (MDP), similar to the one discussed in [1]. The power allotment to the village from the main grid is fixed and bounded, whereas the renewable energy generation is uncertain in nature. Therefore we adapt a distributed version of the popular Reinforcement learning technique, Multi-Agent Q-Learning to the problem. Finally, we also consider a variant of this problem where the cost of power production at the main site is taken into consideration. In this scenario the microgrids need to minimize the demand-supply deficit, while maintaining the desired average cost of the power production.
△ Less
Submitted 28 August, 2017; v1 submitted 25 August, 2017;
originally announced August 2017.
-
Generalized Deterministic Perturbations For Stochastic Gradient Search
Authors:
K. Chandramouli,
K. J. Prabuchandran,
D. Sai Koti Reddy,
Shalabh Bhatnagar
Abstract:
Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Directi…
▽ More
Stochastic optimization (SO) considers the problem of optimizing an objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimate by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been studied. Two specific constructions of the deterministic perturbation sequence using lexicographical ordering and Hadamard matrices have been explored and encouraging results have been reported in the literature. In this paper, we characterize the class of deterministic perturbation sequences that can be utilized in the RDKW algorithm. This class expands the set of known deterministic perturbation sequences available in the literature. Using our characterization we propose a construction of a deterministic perturbation sequence that has the least possible cycle length among all deterministic perturbations. Through simulations we illustrate the performance gain of the proposed deterministic perturbation sequence in the RDKW algorithm over the Hadamard and the random perturbation counterparts. We establish the convergence of the RDKW algorithm for the generalized class of deterministic perturbations.
△ Less
Submitted 2 August, 2018; v1 submitted 20 February, 2017;
originally announced February 2017.