-
Towards sound based testing of COVID-19 -- Summary of the first Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge
Authors:
Neeraj Kumar Sharma,
Ananya Muguli,
Prashant Krishnan,
Rohit Kumar,
Srikanth Raj Chetupalli,
Sriram Ganapathy
Abstract:
The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara''…
▽ More
The technology development for point-of-care tests (POCTs) targeting respiratory diseases has witnessed a growing demand in the recent past. Investigating the presence of acoustic biomarkers in modalities such as cough, breathing and speech sounds, and using them for building POCTs can offer fast, contactless and inexpensive testing. In view of this, over the past year, we launched the ``Coswara'' project to collect cough, breathing and speech sound recordings via worldwide crowdsourcing. With this data, a call for development of diagnostic tools was announced in the Interspeech 2021 as a special session titled ``Diagnostics of COVID-19 using Acoustics (DiCOVA) Challenge''. The goal was to bring together researchers and practitioners interested in developing acoustics-based COVID-19 POCTs by enabling them to work on the same set of development and test datasets. As part of the challenge, datasets with breathing, cough, and speech sound samples from COVID-19 and non-COVID-19 individuals were released to the participants. The challenge consisted of two tracks. The Track-1 focused only on cough sounds, and participants competed in a leaderboard setting. In Track-2, breathing and speech samples were provided for the participants, without a competitive leaderboard. The challenge attracted 85 plus registrations with 29 final submissions for Track-1. This paper describes the challenge (datasets, tasks, baseline system), and presents a focused summary of the various systems submitted by the participating teams. An analysis of the results from the top four teams showed that a fusion of the scores from these teams yields an area-under-the-curve of 95.1% on the blind test data. By summarizing the lessons learned, we foresee the challenge overview in this paper to help accelerate technology for acoustic-based POCTs.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms
Authors:
Srikanth Raj Chetupalli,
Prashant Krishnan,
Neeraj Sharma,
Ananya Muguli,
Rohit Kumar,
Viral Nanda,
Lancelot Mark Pinto,
Prasanta Kumar Ghosh,
Sriram Ganapathy
Abstract:
The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a…
▽ More
The research direction of identifying acoustic bio-markers of respiratory diseases has received renewed interest following the onset of COVID-19 pandemic. In this paper, we design an approach to COVID-19 diagnostic using crowd-sourced multi-modal data. The data resource, consisting of acoustic signals like cough, breathing, and speech signals, along with the data of symptoms, are recorded using a web-application over a period of ten months. We investigate the use of statistical descriptors of simple time-frequency features for acoustic signals and binary features for the presence of symptoms. Unlike previous works, we primarily focus on the application of simple linear classifiers like logistic regression and support vector machines for acoustic data while decision tree models are employed on the symptoms data. We show that a multi-modal integration of acoustics and symptoms classifiers achieves an area-under-curve (AUC) of 92.40, a significant improvement over any individual modality. Several ablation experiments are also provided which highlight the acoustic and symptom dimensions that are important for the task of COVID-19 diagnostics.
△ Less
Submitted 5 June, 2021; v1 submitted 1 June, 2021;
originally announced June 2021.
-
DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics
Authors:
Ananya Muguli,
Lancelot Pinto,
Nirmala R.,
Neeraj Sharma,
Prashant Krishnan,
Prasanta Kumar Ghosh,
Rohit Kumar,
Shrirama Bhat,
Srikanth Raj Chetupalli,
Sriram Ganapathy,
Shreyas Ramoji,
Viral Nanda
Abstract:
The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These…
▽ More
The DiCOVA challenge aims at accelerating research in diagnosing COVID-19 using acoustics (DiCOVA), a topic at the intersection of speech and audio processing, respiratory health diagnosis, and machine learning. This challenge is an open call for researchers to analyze a dataset of sound recordings collected from COVID-19 infected and non-COVID-19 individuals for a two-class classification. These recordings were collected via crowdsourcing from multiple countries, through a website application. The challenge features two tracks, one focusing on cough sounds, and the other on using a collection of breath, sustained vowel phonation, and number counting speech recordings. In this paper, we introduce the challenge and provide a detailed description of the task, and present a baseline system for the task.
△ Less
Submitted 17 June, 2021; v1 submitted 16 March, 2021;
originally announced March 2021.
-
Neural PLDA Modeling for End-to-End Speaker Verification
Authors:
Shreyas Ramoji,
Prashant Krishnan,
Sriram Ganapathy
Abstract:
While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computi…
▽ More
While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end (E2E) fashion. This proposed end-to-end model is optimized directly from the acoustic features with a verification cost function and during testing, the model directly outputs the likelihood ratio score. With various experiments using the NIST speaker recognition evaluation (SRE) 2018 and 2019 datasets, we show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling
Authors:
Shareef Babu Kalluri,
Deepu Vijayasenan,
Sriram Ganapathy,
Ragesh Rajan M,
Prashant Krishnan
Abstract:
Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have onl…
▽ More
Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have only partial information for speaker profiling. In this paper, we attempt to overcome this limitation by developing a new dataset which has speech data from five different Indian languages along with English. The metadata information for speaker profiling applications like linguistic information, regional information, and physical characteristics of a speaker are also collected. We call this dataset as NITK-IISc Multilingual Multi-accent Speaker Profiling (NISP) dataset. The description of the dataset, potential applications, and baseline results for speaker profiling on this dataset are provided in this paper.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
Coswara -- A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis
Authors:
Neeraj Sharma,
Prashant Krishnan,
Rohit Kumar,
Shreyas Ramoji,
Srikanth Raj Chetupalli,
Nirmala R.,
Prasanta Kumar Ghosh,
Sriram Ganapathy
Abstract:
The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for…
▽ More
The COVID-19 pandemic presents global challenges transcending boundaries of country, race, religion, and economy. The current gold standard method for COVID-19 detection is the reverse transcription polymerase chain reaction (RT-PCR) testing. However, this method is expensive, time-consuming, and violates social distancing. Also, as the pandemic is expected to stay for a while, there is a need for an alternate diagnosis tool which overcomes these limitations, and is deployable at a large scale. The prominent symptoms of COVID-19 include cough and breathing difficulties. We foresee that respiratory sounds, when analyzed using machine learning techniques, can provide useful insights, enabling the design of a diagnostic tool. Towards this, the paper presents an early effort in creating (and analyzing) a database, called Coswara, of respiratory sounds, namely, cough, breath, and voice. The sound samples are collected via worldwide crowdsourcing using a website application. The curated dataset is released as open access. As the pandemic is evolving, the data collection and analysis is a work in progress. We believe that insights from analysis of Coswara can be effective in enabling sound based technology solutions for point-of-care diagnosis of respiratory infection, and in the near future this can help to diagnose COVID-19.
△ Less
Submitted 11 August, 2020; v1 submitted 21 May, 2020;
originally announced May 2020.
-
NPLDA: A Deep Neural PLDA Model for Speaker Verification
Authors:
Shreyas Ramoji,
Prashant Krishnan,
Sriram Ganapathy
Abstract:
The state-of-art approach for speaker verification consists of a neural network based embedding extractor along with a backend generative model such as the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose a neural network approach for backend modeling in speaker recognition. The likelihood ratio score of the generative PLDA model is posed as a discriminative similarity f…
▽ More
The state-of-art approach for speaker verification consists of a neural network based embedding extractor along with a backend generative model such as the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose a neural network approach for backend modeling in speaker recognition. The likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. The proposed model, termed as neural PLDA (NPLDA), is initialized using the generative PLDA model parameters. The loss function for the NPLDA model is an approximation of the minimum detection cost function (DCF). The speaker recognition experiments using the NPLDA model are performed on the speaker verificiation task in the VOiCES datasets as well as the SITW challenge dataset. In these experiments, the NPLDA model optimized using the proposed loss function improves significantly over the state-of-art PLDA based speaker verification system.
△ Less
Submitted 24 May, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
LEAP System for SRE19 CTS Challenge -- Improvements and Error Analysis
Authors:
Shreyas Ramoji,
Prashant Krishnan,
Bhargavram Mysore,
Prachi Singh,
Sriram Ganapathy
Abstract:
The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (T…
▽ More
The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (TDNN) based x-vector embeddings. The x-vector system in our SRE19 submission used a large pool of training speakers (about 14k speakers). Following the x-vector extraction, we explored a neural network approach to backend score computation that was optimized for a speaker verification cost. The system combination of generative and neural PLDA models resulted in significant improvements for the SRE evaluation dataset. We also found additional gains for the SRE systems based on score normalization and calibration. Subsequent to the evaluations, we have performed a detailed analysis of the submitted systems. The analysis revealed the incremental gains obtained for different training dataset combinations as well as the modeling methods.
△ Less
Submitted 24 May, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.
-
Pairwise Discriminative Neural PLDA for Speaker Verification
Authors:
Shreyas Ramoji,
Prashant Krishnan V,
Prachi Singh,
Sriram Ganapathy
Abstract:
The state-of-art approach to speaker verification involves the extraction of discriminative embeddings like x-vectors followed by a generative model back-end using a probabilistic linear discriminant analysis (PLDA). In this paper, we propose a Pairwise neural discriminative model for the task of speaker verification which operates on a pair of speaker embeddings such as x-vectors/i-vectors and ou…
▽ More
The state-of-art approach to speaker verification involves the extraction of discriminative embeddings like x-vectors followed by a generative model back-end using a probabilistic linear discriminant analysis (PLDA). In this paper, we propose a Pairwise neural discriminative model for the task of speaker verification which operates on a pair of speaker embeddings such as x-vectors/i-vectors and outputs a score that can be considered as a scaled log-likelihood ratio. We construct a differentiable cost function which approximates speaker verification loss, namely the minimum detection cost. The pre-processing steps of linear discriminant analysis (LDA), unit length normalization and within class covariance normalization are all modeled as layers of a neural model and the speaker verification cost functions can be back-propagated through these layers during training. We also explore regularization techniques to prevent overfitting, which is a major concern in using discriminative back-end models for verification tasks. The experiments are performed on the NIST SRE 2018 development and evaluation datasets. We observe average relative improvements of 8% in CMN2 condition and 30% in VAST condition over the PLDA baseline system.
△ Less
Submitted 7 February, 2020; v1 submitted 20 January, 2020;
originally announced January 2020.
-
SURE-fuse WFF: A Multi-resolution Windowed Fourier Analysis for Interferometric Phase Denoising
Authors:
Joshin P. Krishnan,
Mário A. T. Figueiredo,
José M. Bioucas-Dias
Abstract:
Interferometric phase (InPhase) imaging is an important part of many present-day coherent imaging technologies. Often in such imaging techniques, the acquired images, known as interferograms, suffer from two major degradations: 1) phase wrapping caused by the fact that the sensing mechanism can only measure sinusoidal $2π$-periodic functions of the actual phase, and 2) noise introduced by the acqu…
▽ More
Interferometric phase (InPhase) imaging is an important part of many present-day coherent imaging technologies. Often in such imaging techniques, the acquired images, known as interferograms, suffer from two major degradations: 1) phase wrapping caused by the fact that the sensing mechanism can only measure sinusoidal $2π$-periodic functions of the actual phase, and 2) noise introduced by the acquisition process or the system. This work focusses on InPhase denoising which is a fundamental restoration step to many posterior applications of InPhase, namely to phase unwrapping. The presence of sharp fringes that arises from phase wrapping makes InPhase denoising a hard-inverse problem. Motivated by the fact that the InPhase images are often locally sparse in Fourier domain, we propose a multi-resolution windowed Fourier filtering (WFF) analysis that fuses WFF estimates with different resolutions, thus overcoming the WFF fixed resolution limitation. The proposed fusion relies on an unbiased estimate of the mean square error derived using the Stein's lemma adapted to complex-valued signals. This estimate, known as SURE, is minimized using an optimization framework to obtain the fusion weights. Strong experimental evidence, using synthetic and real (InSAR & MRI) data, that the developed algorithm, termed as SURE-fuse WFF, outperforms the best hand-tuned fixed resolution WFF as well as other state-of-the-art InPhase denoising algorithms, is provided.
△ Less
Submitted 26 February, 2019; v1 submitted 9 November, 2018;
originally announced November 2018.
-
Patch-based Interferometric Phase Estimation via Mixture of Gaussian Density Modelling & Non-local Averaging in the Complex Domain
Authors:
Joshin P. Krishnan,
José M. Bioucas-Dias
Abstract:
This paper addresses interferometric phase (InPhase) image denoising, i.e., the denoising of phase modulo-2p images from sinusoidal 2p-periodic and noisy observations. The wrapping discontinuities present in the InPhase images, which are to be preserved carefully, make InPhase denoising a challenging inverse problem. We propose a novel two-step algorithm to tackle this problem by exploiting the no…
▽ More
This paper addresses interferometric phase (InPhase) image denoising, i.e., the denoising of phase modulo-2p images from sinusoidal 2p-periodic and noisy observations. The wrapping discontinuities present in the InPhase images, which are to be preserved carefully, make InPhase denoising a challenging inverse problem. We propose a novel two-step algorithm to tackle this problem by exploiting the non-local self-similarity of the InPhase images. In the first step, the patches of the phase images are modelled using Mixture of Gaussian (MoG) densities in the complex domain. An Expectation Maximization(EM) algorithm is formulated to learn the parameters of the MoG from the noisy data. The learned MoG is used as a prior for estimating the InPhase images from the noisy images using Minimum Mean Square Error (MMSE) estimation. In the second step, an additional exploitation of non-local self-similarity is done by performing a type of non-local mean filtering. Experiments conducted on simulated and real (MRI and InSAR) datasets show results which are competitive with the state-of-the-art techniques.
△ Less
Submitted 24 October, 2018;
originally announced October 2018.
-
Dictionary Learning Phase Retrieval from Noisy Diffraction Patterns
Authors:
Joshin P. Krishnan,
José M. Bioucas-Dias,
Vladimir Katkovnik
Abstract:
This paper proposes a novel algorithm for image phase retrieval, i.e., for recovering complex-valued images from the amplitudes of noisy linear combinations (often the Fourier transform) of the sought complex images. The algorithm is developed using the alternating projection framework and is aimed to obtain high performance for heavily noisy (Poissonian or Gaussian) observations. The estimation o…
▽ More
This paper proposes a novel algorithm for image phase retrieval, i.e., for recovering complex-valued images from the amplitudes of noisy linear combinations (often the Fourier transform) of the sought complex images. The algorithm is developed using the alternating projection framework and is aimed to obtain high performance for heavily noisy (Poissonian or Gaussian) observations. The estimation of the target images is reformulated as a sparse regression, often termed sparse coding, in the complex domain. This is accomplished by learning a complex domain dictionary from the data it represents via matrix factorization with sparsity constraints on the code (i.e., the regression coefficients). Our algorithm, termed dictionary learning phase retrieval (DLPR), jointly learns the referred to dictionary and reconstructs the unknown target image. The effectiveness of DLPR is illustrated through experiments conducted on complex images, simulated and real, where it shows noticeable advantages over the state-of-the-art competitors.
△ Less
Submitted 18 October, 2018;
originally announced October 2018.
-
A Framework for Analysing Driver Interactions with Semi-Autonomous Vehicles
Authors:
Siraj Shaikh,
Padmanabhan Krishnan
Abstract:
Semi-autonomous vehicles are increasingly serving critical functions in various settings from mining to logistics to defence. A key characteristic of such systems is the presence of the human (drivers) in the control loop. To ensure safety, both the driver needs to be aware of the autonomous aspects of the vehicle and the automated features of the vehicle built to enable safer control. In this pap…
▽ More
Semi-autonomous vehicles are increasingly serving critical functions in various settings from mining to logistics to defence. A key characteristic of such systems is the presence of the human (drivers) in the control loop. To ensure safety, both the driver needs to be aware of the autonomous aspects of the vehicle and the automated features of the vehicle built to enable safer control. In this paper we propose a framework to combine empirical models describing human behaviour with the environment and system models. We then analyse, via model checking, interaction between the models for desired safety properties. The aim is to analyse the design for safe vehicle-driver interaction. We demonstrate the applicability of our approach using a case study involving semi-autonomous vehicles where the driver fatigue are factors critical to a safe journey.
△ Less
Submitted 31 December, 2012;
originally announced January 2013.