-
Equivalent-Circuit Thermal Model for Batteries with One-Shot Parameter Identification
Authors:
Myisha A. Chowdhury,
Qiugang Lu
Abstract:
Accurate state of temperature (SOT) estimation for batteries is crucial for regulating their temperature within a desired range to ensure safe operation and optimal performance. The existing measurement-based methods often generate noisy signals and cannot scale up for large-scale battery packs. The electrochemical model-based methods, on the contrary, offer high accuracy but are computationally e…
▽ More
Accurate state of temperature (SOT) estimation for batteries is crucial for regulating their temperature within a desired range to ensure safe operation and optimal performance. The existing measurement-based methods often generate noisy signals and cannot scale up for large-scale battery packs. The electrochemical model-based methods, on the contrary, offer high accuracy but are computationally expensive. To tackle these issues, inspired by the equivalentcircuit voltage model for batteries, this paper presents a novel equivalent-circuit electro-thermal model (ECTM) for modeling battery surface temperature. By approximating the complex heat generation inside batteries with data-driven nonlinear (polynomial) functions of key measurable parameters such as state-of-charge (SOC), current, and terminal voltage, our ECTM is simplified into a linear form that admits rapid solutions. Such simplified ECTM can be readily identified with one single (one-shot) cycle data. The proposed model is extensively validated with benchmark NASA, MIT, and Oxford battery datasets. Simulation results verify the accuracy of the model, despite being identified with one-shot cycle data, in predicting battery temperatures robustly under different battery degradation status and ambient conditions.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Lithium-ion Battery Capacity Prediction via Conditional Recurrent Generative Adversarial Network-based Time-Series Regeneration
Authors:
Myisha A. Chowdhury,
Gift Modekwe,
Qiugang Lu
Abstract:
Accurate capacity prediction is essential for the safe and reliable operation of batteries by anticipating potential failures beforehand. The performance of state-of-the-art capacity prediction methods is significantly hindered by the limited availability of training data, primarily attributed to the expensive experimentation and data sharing restrictions. To tackle this issue, this paper presents…
▽ More
Accurate capacity prediction is essential for the safe and reliable operation of batteries by anticipating potential failures beforehand. The performance of state-of-the-art capacity prediction methods is significantly hindered by the limited availability of training data, primarily attributed to the expensive experimentation and data sharing restrictions. To tackle this issue, this paper presents a recurrent conditional generative adversarial network (RCGAN) scheme to enrich the limited battery data by adding high-fidelity synthetic ones to improve the capacity prediction. The proposed RCGAN scheme consists of a generator network to generate synthetic samples that closely resemble the true data and a discriminator network to differentiate real and synthetic samples. Long shortterm memory (LSTM)-based generator and discriminator are leveraged to learn the temporal and spatial distributions in the multivariate time-series battery data. Moreover, the generator is conditioned on the capacity value to account for changes in battery dynamics due to the degradation over usage cycles. The effectiveness of the RCGAN is evaluated across six batteries from two benchmark datasets (NASA and MIT). The raw data is then augmented with synthetic samples from the RCGAN to train LSTM and gate recurrent unit (GRU) models for capacity prediction. Simulation results show that the models trained with augmented datasets significantly outperform those trained with the original datasets in capacity prediction.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Contextual Checkerboard Denoise -- A Novel Neural Network-Based Approach for Classification-Aware OCT Image Denoising
Authors:
Md. Touhidul Islam,
Md. Abtahi M. Chowdhury,
Sumaiya Salekin,
Aye T. Maung,
Akil A. Taki,
Hafiz Imtiaz
Abstract:
In contrast to non-medical image denoising, where enhancing image clarity is the primary goal, medical image denoising warrants preservation of crucial features without introduction of new artifacts. However, many denoising methods that improve the clarity of the image, inadvertently alter critical information of the denoised images, potentially compromising classification performance and diagnost…
▽ More
In contrast to non-medical image denoising, where enhancing image clarity is the primary goal, medical image denoising warrants preservation of crucial features without introduction of new artifacts. However, many denoising methods that improve the clarity of the image, inadvertently alter critical information of the denoised images, potentially compromising classification performance and diagnostic quality. Additionally, supervised denoising methods are not very practical in medical image domain, since a \emph{ground truth} denoised version of a noisy medical image is often extremely challenging to obtain. In this paper, we tackle both of these problems by introducing a novel neural network based method -- \emph{Contextual Checkerboard Denoising}, that can learn denoising from only a dataset of noisy images, while preserving crucial anatomical details necessary for image classification/analysis. We perform our experimentation on real Optical Coherence Tomography (OCT) images, and empirically demonstrate that our proposed method significantly improves image quality, providing clearer and more detailed OCT images, while enhancing diagnostic accuracy.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
AttDiCNN: Attentive Dilated Convolutional Neural Network for Automatic Sleep Staging using Visibility Graph and Force-directed Layout
Authors:
Md Jobayer,
Md. Mehedi Hasan Shawon,
Tasfin Mahmud,
Md. Borhan Uddin Antor,
Arshad M. Chowdhury
Abstract:
Sleep stages play an essential role in the identification of sleep patterns and the diagnosis of sleep disorders. In this study, we present an automated sleep stage classifier termed the Attentive Dilated Convolutional Neural Network (AttDiCNN), which uses deep learning methodologies to address challenges related to data heterogeneity, computational complexity, and reliable automatic sleep staging…
▽ More
Sleep stages play an essential role in the identification of sleep patterns and the diagnosis of sleep disorders. In this study, we present an automated sleep stage classifier termed the Attentive Dilated Convolutional Neural Network (AttDiCNN), which uses deep learning methodologies to address challenges related to data heterogeneity, computational complexity, and reliable automatic sleep staging. We employed a force-directed layout based on the visibility graph to capture the most significant information from the EEG signals, representing the spatial-temporal features. The proposed network consists of three compositors: the Localized Spatial Feature Extraction Network (LSFE), the Spatio-Temporal-Temporal Long Retention Network (S2TLR), and the Global Averaging Attention Network (G2A). The LSFE is tasked with capturing spatial information from sleep data, the S2TLR is designed to extract the most pertinent information in long-term contexts, and the G2A reduces computational overhead by aggregating information from the LSFE and S2TLR. We evaluated the performance of our model on three comprehensive and publicly accessible datasets, achieving state-of-the-art accuracy of 98.56%, 99.66%, and 99.08% for the EDFX, HMC, and NCH datasets, respectively, yet maintaining a low computational complexity with 1.4 M parameters. The results substantiate that our proposed architecture surpasses existing methodologies in several performance metrics, thus proving its potential as an automated tool in clinical settings.
△ Less
Submitted 21 August, 2024;
originally announced September 2024.
-
Automatic Detection of COVID-19 from Chest X-ray Images Using Deep Learning Model
Authors:
Alloy Das,
Rohit Agarwal,
Rituparna Singh,
Arindam Chowdhury,
Debashis Nandi
Abstract:
The infectious disease caused by novel corona virus (2019-nCoV) has been widely spreading since last year and has shaken the entire world. It has caused an unprecedented effect on daily life, global economy and public health. Hence this disease detection has life-saving importance for both patients as well as doctors. Due to limited test kits, it is also a daunting task to test every patient with…
▽ More
The infectious disease caused by novel corona virus (2019-nCoV) has been widely spreading since last year and has shaken the entire world. It has caused an unprecedented effect on daily life, global economy and public health. Hence this disease detection has life-saving importance for both patients as well as doctors. Due to limited test kits, it is also a daunting task to test every patient with severe respiratory problems using conventional techniques (RT-PCR). Thus implementing an automatic diagnosis system is urgently required to overcome the scarcity problem of Covid-19 test kits at hospital, health care systems. The diagnostic approach is mainly classified into two categories-laboratory based and Chest radiography approach. In this paper, a novel approach for computerized corona virus (2019-nCoV) detection from lung x-ray images is presented. Here, we propose models using deep learning to show the effectiveness of diagnostic systems. In the experimental result, we evaluate proposed models on publicly available data-set which exhibit satisfactory performance and promising results compared with other previous existing methods.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic
Authors:
Yassine El Kheir,
Hamdy Mubarak,
Ahmed Ali,
Shammur Absar Chowdhury
Abstract:
This paper presents a novel Dialectal Sound and Vowelization Recovery framework, designed to recognize borrowed and dialectal sounds within phonologically diverse and dialect-rich languages, that extends beyond its standard orthographic sound sets. The proposed framework utilized a quantized sequence of input with(out) continuous pretrained self-supervised representation. We show the efficacy of t…
▽ More
This paper presents a novel Dialectal Sound and Vowelization Recovery framework, designed to recognize borrowed and dialectal sounds within phonologically diverse and dialect-rich languages, that extends beyond its standard orthographic sound sets. The proposed framework utilized a quantized sequence of input with(out) continuous pretrained self-supervised representation. We show the efficacy of the pipeline using limited data for Arabic, a dialect-rich language containing more than 22 major dialects. Phonetically correct transcribed speech resources for dialectal Arabic are scarce. Therefore, we introduce ArabVoice15, a first-of-its-kind, curated test set featuring 5 hours of dialectal speech across 15 Arab countries, with phonetically accurate transcriptions, including borrowed and dialect-specific sounds. We described in detail the annotation guideline along with the analysis of the dialectal confusion pairs. Our extensive evaluation includes both subjective -- human perception tests and objective measures. Our empirical results, reported with three test sets, show that with only one and half hours of training data, our model improve character error rate by ~ 7\% in ArabVoice15 compared to the baseline.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Investigating Confidence Estimation Measures for Speaker Diarization
Authors:
Anurag Chowdhury,
Abhinav Misra,
Mark C. Fuhs,
Monika Woszczyna
Abstract:
Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speec…
▽ More
Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Speech Representation Analysis based on Inter- and Intra-Model Similarities
Authors:
Yassine El Kheir,
Ahmed Ali,
Shammur Absar Chowdhury
Abstract:
Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation an…
▽ More
Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm -- Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts.Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts\footnote{A concept represents a coherent fragment of knowledge, such as ``a class containing certain objects as elements, where the objects have certain properties. We made the code publicly available for facilitating further research, we publicly released our code.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Children's Speech Recognition through Discrete Token Enhancement
Authors:
Vrunda N. Sukhadia,
Shammur Absar Chowdhury
Abstract:
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information…
▽ More
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
△ Less
Submitted 24 June, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Adaptive Safe Reinforcement Learning-Enabled Optimization of Battery Fast-Charging Protocols
Authors:
Myisha A. Chowdhury,
Saif S. S. Al-Wahaibi,
Qiugang Lu
Abstract:
Optimizing charging protocols is critical for reducing battery charging time and decelerating battery degradation in applications such as electric vehicles. Recently, reinforcement learning (RL) methods have been adopted for such purposes. However, RL-based methods may not ensure system (safety) constraints, which can cause irreversible damages to batteries and reduce their lifetime. To this end,…
▽ More
Optimizing charging protocols is critical for reducing battery charging time and decelerating battery degradation in applications such as electric vehicles. Recently, reinforcement learning (RL) methods have been adopted for such purposes. However, RL-based methods may not ensure system (safety) constraints, which can cause irreversible damages to batteries and reduce their lifetime. To this end, this work proposes an adaptive and safe RL framework to optimize fast charging strategies while respecting safety constraints with a high probability. In our method, any unsafe action that the RL agent decides will be projected into a safety region by solving a constrained optimization problem. The safety region is constructed using adaptive Gaussian process (GP) models, consisting of static and dynamic GPs, that learn from online experience to adaptively account for any changes in battery dynamics. Simulation results show that our method can charge the batteries rapidly with constraint satisfaction under varying operating conditions.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition
Authors:
William Ravenscroft,
George Close,
Stefan Goetze,
Thomas Hain,
Mohammad Soleymanpour,
Anurag Chowdhury,
Mark C. Fuhs
Abstract:
One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domai…
▽ More
One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Analyzing Musical Characteristics of National Anthems in Relation to Global Indices
Authors:
S M Rakib Hasan,
Aakar Dhakal,
Ms. Ayesha Siddiqua,
Mohammad Mominur Rahman,
Md Maidul Islam,
Mohammed Arfat Raihan Chowdhury,
S M Masfequier Rahman Swapno,
SM Nuruzzaman Nobel
Abstract:
Music plays a huge part in shaping peoples' psychology and behavioral patterns. This paper investigates the connection between national anthems and different global indices with computational music analysis and statistical correlation analysis. We analyze national anthem musical data to determine whether certain musical characteristics are associated with peace, happiness, suicide rate, crime rate…
▽ More
Music plays a huge part in shaping peoples' psychology and behavioral patterns. This paper investigates the connection between national anthems and different global indices with computational music analysis and statistical correlation analysis. We analyze national anthem musical data to determine whether certain musical characteristics are associated with peace, happiness, suicide rate, crime rate, etc. To achieve this, we collect national anthems from 169 countries and use computational music analysis techniques to extract pitch, tempo, beat, and other pertinent audio features. We then compare these musical characteristics with data on different global indices to ascertain whether a significant correlation exists. Our findings indicate that there may be a correlation between the musical characteristics of national anthems and the indices we investigated. The implications of our findings for music psychology and policymakers interested in promoting social well-being are discussed. This paper emphasizes the potential of musical data analysis in social research and offers a novel perspective on the relationship between music and social indices. The source code and data are made open-access for reproducibility and future research endeavors. It can be accessed at http://bit.ly/na_code.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Learning Non-myopic Power Allocation in Constrained Scenarios
Authors:
Arindam Chowdhury,
Santiago Paternain,
Gunjan Verma,
Ananthram Swami,
Santiago Segarra
Abstract:
We propose a learning-based framework for efficient power allocation in ad hoc interference networks under episodic constraints. The problem of optimal power allocation -- for maximizing a given network utility metric -- under instantaneous constraints has recently gained significant popularity. Several learnable algorithms have been proposed to obtain fast, effective, and near-optimal performance…
▽ More
We propose a learning-based framework for efficient power allocation in ad hoc interference networks under episodic constraints. The problem of optimal power allocation -- for maximizing a given network utility metric -- under instantaneous constraints has recently gained significant popularity. Several learnable algorithms have been proposed to obtain fast, effective, and near-optimal performance. However, a more realistic scenario arises when the utility metric has to be optimized for an entire episode under time-coupled constraints. In this case, the instantaneous power needs to be regulated so that the given utility can be optimized over an entire sequence of wireless network realizations while satisfying the constraint at all times. Solving each instance independently will be myopic as the long-term constraint cannot modulate such a solution. Instead, we frame this as a constrained and sequential decision-making problem, and employ an actor-critic algorithm to obtain the constraint-aware power allocation at each step. We present experimental analyses to illustrate the effectiveness of our method in terms of superior episodic network-utility performance and its efficiency in terms of time and computational complexity.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Automatic Pronunciation Assessment -- A Review
Authors:
Yassine El Kheir,
Ahmed Ali,
Shammur Absar Chowdhury
Abstract:
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challeng…
▽ More
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
The complementary roles of non-verbal cues for Robust Pronunciation Assessment
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verba…
▽ More
Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the non-verbal cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA. % The framework innovatively incorporates both fine-grained frame- and abstract utterance-level non-verbal cues, alongside the conventional speech and phoneme representations. Additionally, we introduce ''Goodness of phonemic-duration'' metric to effectively model duration distribution within the framework. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that either matches or outperforms existing research works.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
L1-aware Multilingual Mispronunciation Detection Framework
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechani…
▽ More
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.
△ Less
Submitted 21 September, 2023; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Voice Morphing: Two Identities in One Voice
Authors:
Sushanta K. Pani,
Anurag Chowdhury,
Morgan Sandler,
Arun Ross
Abstract:
In a biometric system, each biometric sample or template is typically associated with a single identity. However, recent research has demonstrated the possibility of generating "morph" biometric samples that can successfully match more than a single identity. Morph attacks are now recognized as a potential security threat to biometric systems. However, most morph attacks have been studied on biome…
▽ More
In a biometric system, each biometric sample or template is typically associated with a single identity. However, recent research has demonstrated the possibility of generating "morph" biometric samples that can successfully match more than a single identity. Morph attacks are now recognized as a potential security threat to biometric systems. However, most morph attacks have been studied on biometric modalities operating in the image domain, such as face, fingerprint, and iris. In this preliminary work, we introduce Voice Identity Morphing (VIM) - a voice-based morph attack that can synthesize speech samples that impersonate the voice characteristics of a pair of individuals. Our experiments evaluate the vulnerabilities of two popular speaker recognition systems, ECAPA-TDNN and x-vector, to VIM, with a success rate (MMPMR) of over 80% at a false match rate of 1% on the Librispeech dataset.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Half-Duplex APs with Dynamic TDD vs. Full-Duplex APs in Cell-Free Systems
Authors:
Anubhab Chowdhury,
Chandra R. Murthy
Abstract:
In this paper, we present a comparative study of half-duplex (HD) access points (APs) with dynamic time-division duplex (DTDD) and full-duplex (FD) APs in cell-free (CF) systems. Although both DTDD and FD CF systems support concurrent downlink (DL) transmission and uplink (UL) reception capability, the sum spectral efficiency (SE) is limited by various cross-link interferences. We first present a…
▽ More
In this paper, we present a comparative study of half-duplex (HD) access points (APs) with dynamic time-division duplex (DTDD) and full-duplex (FD) APs in cell-free (CF) systems. Although both DTDD and FD CF systems support concurrent downlink (DL) transmission and uplink (UL) reception capability, the sum spectral efficiency (SE) is limited by various cross-link interferences. We first present a novel pilot allocation scheme that minimizes the pilot length required to ensure no pilot contamination among the user equipments (UEs) served by at least one common AP. Then, we derive the sum SE in closed form, considering zero-forcing combining and precoding along with the signal-to-interference plus noise ratio optimal weighting at the central processing unit. We also present a provably convergent algorithm for joint UL-DL power allocation and UL/DL mode scheduling of the APs (for DTDD) to maximize the sum SE. Further, the proposed algorithms are precoder and combiner agnostic and come with closed-form update equations for the UL and DL power control coefficients. Our numerical results illustrate the superiority of the proposed pilot allocation and power control algorithms over several benchmark schemes and show that the sum SE with DTDD can outperform an FD CF system with similar antenna density. Thus, DTDD combined with CF is a promising alternative to FD that attains the same performance using HD APs, while obviating the burden of intra-AP interference cancellation.
△ Less
Submitted 31 January, 2024; v1 submitted 4 September, 2023;
originally announced September 2023.
-
MyVoice: Arabic Speech Resource Collaboration Platform
Authors:
Yousseif Elshahawy,
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and…
▽ More
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.
△ Less
Submitted 23 July, 2023;
originally announced August 2023.
-
Development Of Automated Cardiac Arrhythmia Detection Methods Using Single Channel ECG Signal
Authors:
Arpita Paul,
Avik Kumar Das,
Manas Rakshit,
Ankita Ray Chowdhury,
Susmita Saha,
Hrishin Roy,
Sajal Sarkar,
Dongiri Prasanth,
Eravelli Saicharan
Abstract:
Arrhythmia, an abnormal cardiac rhythm, is one of the most common types of cardiac disease. Automatic detection and classification of arrhythmia can be significant in reducing deaths due to cardiac diseases. This work proposes a multi-class arrhythmia detection algorithm using single channel electrocardiogram (ECG) signal. In this work, heart rate variability (HRV) along with morphological feature…
▽ More
Arrhythmia, an abnormal cardiac rhythm, is one of the most common types of cardiac disease. Automatic detection and classification of arrhythmia can be significant in reducing deaths due to cardiac diseases. This work proposes a multi-class arrhythmia detection algorithm using single channel electrocardiogram (ECG) signal. In this work, heart rate variability (HRV) along with morphological features and wavelet coefficient features are utilized for detection of 9 classes of arrhythmia. Statistical, entropy and energy-based features are extracted and applied to machine learning based random forest classifiers. Data used in both works is taken from 4 broad databases (CPSC and CPSC extra, PTB-XL, G12EC and Chapman-Shaoxing and Ningbo Database) made available by Physionet. With HRV and time domain morphological features, an average accuracy of 85.11%, sensitivity of 85.11%, precision of 85.07% and F1 score of 85.00% is obtained whereas with HRV and wavelet coefficient features, the performance obtained is 90.91% accuracy, 90.91% sensitivity, 90.96% precision and 90.87% F1 score. The detailed analysis of simulation results affirms that the presented scheme effectively detects broad categories of arrhythmia from single-channel ECG records. In the last part of the work, the proposed classification schemes are implemented on hardware using Raspberry Pi for real time ECG signal classification.
△ Less
Submitted 23 July, 2023;
originally announced August 2023.
-
Implicit spoken language diarization
Authors:
Jagabandhu Mishra,
Amartya Chowdhury,
S. R. Mahadeva Prasanna
Abstract:
Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedd…
▽ More
Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedding vectors. Hence this work initially explores the available speaker diarization frameworks that capture speaker information implicitly to perform LD tasks. The performance of the LD system on synthetic code-switch data using the end-to-end x-vector approach is 6.78% and 7.06%, and for practical data is 22.50% and 60.38%, in terms of diarization error rate and Jaccard error rate (JER), respectively. The performance degradation is due to the data imbalance and resolved to some extent by using pre-trained wave2vec embeddings that provide a relative improvement of 30.74% in terms of JER.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Atmospheric Influence on the Path Loss at High Frequencies for Deployment of 5G Cellular Communication Networks
Authors:
Rashed Hasan Ratul,
S M Mehedi Zaman,
Hasib Arman Chowdhury,
Md. Zayed Hassan Sagor,
Mohammad Tawhid Kawser,
Mirza Muntasir Nishat
Abstract:
Over the past few decades, the development of cellular communication technology has spanned several generations in order to add sophisticated features in the updated versions. Moreover, different high-frequency bands are considered for advanced cellular generations. The presence of updated generations like 4G and 5G is driven by the rising demand for a greater data rate and a better experience for…
▽ More
Over the past few decades, the development of cellular communication technology has spanned several generations in order to add sophisticated features in the updated versions. Moreover, different high-frequency bands are considered for advanced cellular generations. The presence of updated generations like 4G and 5G is driven by the rising demand for a greater data rate and a better experience for end users. However, because 5G-NR operates at a high frequency and has significant propagation, atmospheric fluctuations like temperature, humidity, and rain rate might result in poorer signal reception, and higher path loss effects unlike the prior generation, which employed frequencies below 6 GHz. This paper makes an attempt to provide a comparative analysis about the influence of different relative atmospheric conditions on 5G cellular communication for various operating frequencies in any urban microcell (UMi) environment maintaining the real outdoor propagation conditions. In addition, the simulation dataset based on environmental factors has been validated by the prediction of path loss using multiple regression techniques. Consequently, this study also aims to address the performance analysis of regression techniques for stable estimations of path loss at high frequencies for different atmospheric conditions for 5G mobile generations due to various possible radio link quality issues and fluctuations in different seasons in South Asia. Furthermore, in comparison to contemporary studies, the Machine Learning models have outperformed in predicting the path loss for the four seasons in South Asian regions.
△ Less
Submitted 27 July, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Multi-View Multi-Task Representation Learning for Mispronunciation Detection
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phoneti…
▽ More
The disparity in phonology between learner's native (L1) and target (L2) language poses a significant challenge for mispronunciation detection and diagnosis (MDD) systems. This challenge is further intensified by lack of annotated L2 data. This paper proposes a novel MDD architecture that exploits multiple `views' of the same input data assisted by auxiliary tasks to learn more distinctive phonetic representation in a low-resource setting. Using the mono- and multilingual encoders, the model learn multiple views of the input, and capture the sound properties across diverse languages and accents. These encoded representations are further enriched by learning articulatory features in a multi-task setup. Our reported results using the L2-ARCTIC data outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and 8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the single-view mono- and multilingual systems, with a limited L2 dataset.
△ Less
Submitted 7 August, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Automated Grain Boundary (GB) Segmentation and Microstructural Analysis in 347H Stainless Steel Using Deep Learning and Multimodal Microscopy
Authors:
Shoieb Ahmed Chowdhury,
M. F. N. Taufique,
Jing Wang,
Marissa Masden,
Madison Wenzlick,
Ram Devanathan,
Alan L Schemer-Kohrn,
Keerti S Kappagantula
Abstract:
Austenitic 347H stainless steel offers superior mechanical properties and corrosion resistance required for extreme operating conditions such as high temperature. The change in microstructure due to composition and process variations is expected to impact material properties. Identifying microstructural features such as grain boundaries thus becomes an important task in the process-microstructure-…
▽ More
Austenitic 347H stainless steel offers superior mechanical properties and corrosion resistance required for extreme operating conditions such as high temperature. The change in microstructure due to composition and process variations is expected to impact material properties. Identifying microstructural features such as grain boundaries thus becomes an important task in the process-microstructure-properties loop. Applying convolutional neural network (CNN) based deep-learning models is a powerful technique to detect features from material micrographs in an automated manner. Manual labeling of the images for the segmentation task poses a major bottleneck for generating training data and labels in a reliable and reproducible way within a reasonable timeframe. In this study, we attempt to overcome such limitations by utilizing multi-modal microscopy to generate labels directly instead of manual labeling. We combine scanning electron microscopy (SEM) images of 347H stainless steel as training data and electron backscatter diffraction (EBSD) micrographs as pixel-wise labels for grain boundary detection as a semantic segmentation task. We demonstrate that despite producing instrumentation drift during data collection between two modes of microscopy, this method performs comparably to similar segmentation tasks that used manual labeling. Additionally, we find that naïve pixel-wise segmentation results in small gaps and missing boundaries in the predicted grain boundary map. By incorporating topological information during model training, the connectivity of the grain boundary network and segmentation performance is improved. Finally, our approach is validated by accurate computation on downstream tasks of predicting the underlying grain morphology distributions which are the ultimate quantities of interest for microstructural characterization.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
QVoice: Arabic Speech Pronunciation Learning Application
Authors:
Yassine El Kheir,
Fouad Khnaisser,
Shammur Absar Chowdhury,
Hamdy Mubarak,
Shazia Afzal,
Ahmed Ali
Abstract:
This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also helping native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pr…
▽ More
This paper introduces a novel Arabic pronunciation learning application QVoice, powered with end-to-end mispronunciation detection and feedback generator module. The application is designed to support non-native Arabic speakers in enhancing their pronunciation skills, while also helping native speakers mitigate any potential influence from regional dialects on their Modern Standard Arabic (MSA) pronunciation. QVoice employs various learning cues to aid learners in comprehending meaning, drawing connections with their existing knowledge of English language, and offers detailed feedback for pronunciation correction, along with contextual examples showcasing word usage. The learning cues featured in QVoice encompass a wide range of meaningful information, such as visualizations of phrases/words and their translations, as well as phonetic transcriptions and transliterations. QVoice provides pronunciation feedback at the character level and assesses performance at the word level.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Multilingual Word Error Rate Estimation: e-WER3
Authors:
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical re…
▽ More
The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical representation to estimate word error rate. We demonstrate the effectiveness of eWER3 to (i) predict WER without using any internal states from the ASR and (ii) use the multilingual shared latent space to push the performance of the close-related languages. We show our proposed multilingual model outperforms the previous monolingual word error rate estimation method (eWER2) by an absolute 9\% increase in Pearson correlation coefficient (PCC), with better overall estimation between the predicted and reference WER.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
Deep Graph Unfolding for Beamforming in MU-MIMO Interference Networks
Authors:
Arindam Chowdhury,
Gunjan Verma,
Ananthram Swami,
Santiago Segarra
Abstract:
We develop an efficient and near-optimal solution for beamforming in multi-user multiple-input-multiple-output single-hop wireless ad-hoc interference networks. Inspired by the weighted minimum mean squared error (WMMSE) method, a classical approach to solving this problem, and the principle of algorithm unfolding, we present unfolded WMMSE (UWMMSE) for MU-MIMO. This method learns a parameterized…
▽ More
We develop an efficient and near-optimal solution for beamforming in multi-user multiple-input-multiple-output single-hop wireless ad-hoc interference networks. Inspired by the weighted minimum mean squared error (WMMSE) method, a classical approach to solving this problem, and the principle of algorithm unfolding, we present unfolded WMMSE (UWMMSE) for MU-MIMO. This method learns a parameterized functional transformation of key WMMSE parameters using graph neural networks (GNNs), where the channel and interference components of a wireless network constitute the underlying graph. These GNNs are trained through gradient descent on a network utility metric using multiple instances of the beamforming problem. Comprehensive experimental analyses illustrate the superiority of UWMMSE over the classical WMMSE and state-of-the-art learning-based methods in terms of performance, generalizability, and robustness.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
Transthoracic super-resolution ultrasound localisation microscopy of myocardial vasculature in patients
Authors:
Jipeng Yan,
Biao Huang,
Johanna Tonko,
Matthieu Toulemonde,
Joseph Hansen-Shearer,
Qingyuan Tan,
Kai Riemer,
Konstantinos Ntagiantas,
Rasheda A Chowdhury,
Pier Lambiase,
Roxy Senior,
Meng-Xing Tang
Abstract:
Micro-vascular flow in the myocardium is of significant importance clinically but remains poorly understood. Up to 25% of patients with symptoms of coronary heart diseases have no obstructive coronary arteries and have suspected microvascular diseases. However, such microvasculature is difficult to image in vivo with existing modalities due to the lack of resolution and sensitivity. Here, we demon…
▽ More
Micro-vascular flow in the myocardium is of significant importance clinically but remains poorly understood. Up to 25% of patients with symptoms of coronary heart diseases have no obstructive coronary arteries and have suspected microvascular diseases. However, such microvasculature is difficult to image in vivo with existing modalities due to the lack of resolution and sensitivity. Here, we demonstrate the feasibility of transthoracic super-resolution ultrasound localisation microscopy (SRUS/ULM) of myocardial microvasculature and hemodynamics in a large animal model and in patients, using a cardiac phased array probe with a customised data acquisition and processing pipeline. A multi-level motion correction strategy was proposed. A tracking framework incorporating multiple features and automatic parameter initialisations was developed to reconstruct microcirculation. In two patients with impaired myocardial function, we have generated SRUS images of myocardial vascular structure and flow with a resolution that is beyond the wave-diffraction limit (half a wavelength), using data acquired within a breath hold. Myocardial SRUS/ULM has potential to improve the understanding of myocardial microcirculation and the management of patients with cardiac microvascular diseases.
△ Less
Submitted 28 March, 2023; v1 submitted 24 March, 2023;
originally announced March 2023.
-
Offline Learning of Closed-Loop Deep Brain Stimulation Controllers for Parkinson Disease Treatment
Authors:
Qitong Gao,
Stephen L. Schimdt,
Afsana Chowdhury,
Guangyu Feng,
Jennifer J. Peters,
Katherine Genty,
Warren M. Grill,
Dennis A. Turner,
Miroslav Pajic
Abstract:
Deep brain stimulation (DBS) has shown great promise toward treating motor symptoms caused by Parkinson's disease (PD), by delivering electrical pulses to the Basal Ganglia (BG) region of the brain. However, DBS devices approved by the U.S. Food and Drug Administration (FDA) can only deliver continuous DBS (cDBS) stimuli at a fixed amplitude; this energy inefficient operation reduces battery lifet…
▽ More
Deep brain stimulation (DBS) has shown great promise toward treating motor symptoms caused by Parkinson's disease (PD), by delivering electrical pulses to the Basal Ganglia (BG) region of the brain. However, DBS devices approved by the U.S. Food and Drug Administration (FDA) can only deliver continuous DBS (cDBS) stimuli at a fixed amplitude; this energy inefficient operation reduces battery lifetime of the device, cannot adapt treatment dynamically for activity, and may cause significant side-effects (e.g., gait impairment). In this work, we introduce an offline reinforcement learning (RL) framework, allowing the use of past clinical data to train an RL policy to adjust the stimulation amplitude in real time, with the goal of reducing energy use while maintaining the same level of treatment (i.e., control) efficacy as cDBS. Moreover, clinical protocols require the safety and performance of such RL controllers to be demonstrated ahead of deployments in patients. Thus, we also introduce an offline policy evaluation (OPE) method to estimate the performance of RL policies using historical data, before deploying them on patients. We evaluated our framework on four PD patients equipped with the RC+S DBS system, employing the RL controllers during monthly clinical visits, with the overall control efficacy evaluated by severity of symptoms (i.e., bradykinesia and tremor), changes in PD biomakers (i.e., local field potentials), and patient ratings. The results from clinical experiments show that our RL-based controller maintains the same level of control efficacy as cDBS, but with significantly reduced stimulation energy. Further, the OPE method is shown effective in accurately estimating and ranking the expected returns of RL controllers.
△ Less
Submitted 15 March, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation
Authors:
Yassine El Kheir,
Shammur Absar Chowdhury,
Ahmed Ali,
Hamdy Mubarak,
Shazia Afzal
Abstract:
The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly inte…
▽ More
The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.
△ Less
Submitted 12 July, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
A Novel Entropy-Maximizing TD3-based Reinforcement Learning for Automatic PID Tuning
Authors:
Myisha A. Chowdhury,
Qiugang Lu
Abstract:
Proportional-integral-derivative (PID) controllers have been widely used in the process industry. However, the satisfactory control performance of a PID controller depends strongly on the tuning parameters. Conventional PID tuning methods require extensive knowledge of the system model, which is not always known especially in the case of complex dynamical systems. In contrast, reinforcement learni…
▽ More
Proportional-integral-derivative (PID) controllers have been widely used in the process industry. However, the satisfactory control performance of a PID controller depends strongly on the tuning parameters. Conventional PID tuning methods require extensive knowledge of the system model, which is not always known especially in the case of complex dynamical systems. In contrast, reinforcement learning-based PID tuning has gained popularity since it can treat PID tuning as a black-box problem and deliver the optimal PID parameters without requiring explicit process models. In this paper, we present a novel entropy-maximizing twin-delayed deep deterministic policy gradient (EMTD3) method for automating the PID tuning. In the proposed method, an entropy-maximizing stochastic actor is employed at the beginning to encourage the exploration of the action space. Then a deterministic actor is deployed to focus on local exploitation and discover the optimal solution. The incorporation of the entropy-maximizing term can significantly improve the sample efficiency and assist in fast convergence to the global solution. Our proposed method is applied to the PID tuning of a second-order system to verify its effectiveness in improving the sample efficiency and discovering the optimal PID parameters compared to traditional TD3.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
What can Speech and Language Tell us About the Working Alliance in Psychotherapy
Authors:
Sebastian P. Bayerl,
Gabriel Roccabruna,
Shammur Absar Chowdhury,
Tommaso Ciulli,
Morena Danieli,
Korbinian Riedhammer,
Giuseppe Riccardi
Abstract:
We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Alliance Inventory Observer-rated Shortened - a 12 item…
▽ More
We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Alliance Inventory Observer-rated Shortened - a 12 items inventory covering task, goal, and relationship - which has a relevant influence on therapeutic outcomes. In this work, we investigate the relation between this alliance inventory and the spoken conversations (sessions) between the patient and the psychotherapist. We have delivered eight weeks of e-therapy, collected their audio and video call sessions, and manually transcribed them. The spoken conversations have been annotated and evaluated with WAI ratings by professional therapists. We have investigated speech and language features and their association with WAI items. The feature types include turn dynamics, lexical entrainment, and conversational descriptors extracted from the speech and language signals. Our findings provide strong evidence that a subset of these features are strong indicators of working alliance. To the best of our knowledge, this is the first and a novel study to exploit speech and language for characterising working alliance.
△ Less
Submitted 27 June, 2022; v1 submitted 17 June, 2022;
originally announced June 2022.
-
AI-based automated Meibomian gland segmentation, classification and reflection correction in infrared Meibography
Authors:
Ripon Kumar Saha,
A. M. Mahmud Chowdhury,
Kyung-Sun Na,
Gyu Deok Hwang,
Youngsub Eom,
Jaeyoung Kim,
Hae-Gon Jeon,
Ho Sik Hwang,
Euiheon Chung
Abstract:
Purpose: Develop a deep learning-based automated method to segment meibomian glands (MG) and eyelids, quantitatively analyze the MG area and MG ratio, estimate the meiboscore, and remove specular reflections from infrared images. Methods: A total of 1600 meibography images were captured in a clinical setting. 1000 images were precisely annotated with multiple revisions by investigators and graded…
▽ More
Purpose: Develop a deep learning-based automated method to segment meibomian glands (MG) and eyelids, quantitatively analyze the MG area and MG ratio, estimate the meiboscore, and remove specular reflections from infrared images. Methods: A total of 1600 meibography images were captured in a clinical setting. 1000 images were precisely annotated with multiple revisions by investigators and graded 6 times by meibomian gland dysfunction (MGD) experts. Two deep learning (DL) models were trained separately to segment areas of the MG and eyelid. Those segmentation were used to estimate MG ratio and meiboscores using a classification-based DL model. A generative adversarial network was implemented to remove specular reflections from original images. Results: The mean ratio of MG calculated by investigator annotation and DL segmentation was consistent 26.23% vs 25.12% in the upper eyelids and 32.34% vs. 32.29% in the lower eyelids, respectively. Our DL model achieved 73.01% accuracy for meiboscore classification on validation set and 59.17% accuracy when tested on images from independent center, compared to 53.44% validation accuracy by MGD experts. The DL-based approach successfully removes reflection from the original MG images without affecting meiboscore grading. Conclusions: DL with infrared meibography provides a fully automated, fast quantitative evaluation of MG morphology (MG Segmentation, MG area, MG ratio, and meiboscore) which are sufficiently accurate for diagnosing dry eye disease. Also, the DL removes specular reflection from images to be used by ophthalmologists for distraction-free assessment.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition
Authors:
Amir Hussein,
Shammur Absar Chowdhury,
Ahmed Abdelali,
Najim Dehak,
Ahmed Ali,
Sanjeev Khudanpur
Abstract:
The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with mono…
▽ More
The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with monolingual data. In this work, we propose a zero-shot learning methodology for CS-ASR by augmenting the monolingual data with artificially generating CS text. We based our approach on random lexical replacements and Equivalence Constraint (EC) while exploiting aligned translation pairs to generate random and grammatically valid CS content. Our empirical results show a 65.5% relative reduction in language model perplexity, and 7.7% in ASR WER on two ecologically valid CS test sets. The human evaluation of the generated text using EC suggests that more than 80% is of adequate quality.
△ Less
Submitted 11 January, 2023; v1 submitted 7 January, 2022;
originally announced January 2022.
-
Stability-Preserving Automatic Tuning of PID Control with Reinforcement Learning
Authors:
Ayub I. Lakhani,
Myisha A. Chowdhury,
Qiugang Lu
Abstract:
PID control has been the dominant control strategy in the process industry due to its simplicity in design and effectiveness in controlling a wide range of processes. However, traditional methods on PID tuning often require extensive domain knowledge and field experience. To address the issue, this work proposes an automatic PID tuning framework based on reinforcement learning (RL), particularly t…
▽ More
PID control has been the dominant control strategy in the process industry due to its simplicity in design and effectiveness in controlling a wide range of processes. However, traditional methods on PID tuning often require extensive domain knowledge and field experience. To address the issue, this work proposes an automatic PID tuning framework based on reinforcement learning (RL), particularly the deterministic policy gradient (DPG) method. Different from existing studies on using RL for PID tuning, in this work, we consider the closed-loop stability throughout the RL-based tuning process. In particular, we propose a novel episodic tuning framework that allows for an episodic closed-loop operation under selected PID parameters where the actor and critic networks are updated once at the end of each episode. To ensure the closed-loop stability during the tuning, we initialize the training with a conservative but stable baseline PID controller and the resultant reward is used as a benchmark score. A supervisor mechanism is used to monitor the running reward (e.g., tracking error) at each step in the episode. As soon as the running reward exceeds the benchmark score, the underlying controller is replaced by the baseline controller as an early correction to prevent instability. Moreover, we use layer normalization to standardize the input to each layer in actor and critic networks to overcome the issue of policy saturation at action bounds, to ensure the convergence to the optimum. The developed methods are validated through setpoint tracking experiments on a second-order plus dead-time system. Simulation results show that with our scheme, the closed-loop stability can be maintained throughout RL explorations and the explored PID parameters by the RL agent converge quickly to the optimum.
△ Less
Submitted 11 February, 2022; v1 submitted 30 December, 2021;
originally announced December 2021.
-
OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated Circuit Synthesis
Authors:
Animesh Basak Chowdhury,
Benjamin Tan,
Ramesh Karri,
Siddharth Garg
Abstract:
Logic synthesis is a challenging and widely-researched combinatorial optimization problem during integrated circuit (IC) design. It transforms a high-level description of hardware in a programming language like Verilog into an optimized digital circuit netlist, a network of interconnected Boolean logic gates, that implements the function. Spurred by the success of ML in solving combinatorial and g…
▽ More
Logic synthesis is a challenging and widely-researched combinatorial optimization problem during integrated circuit (IC) design. It transforms a high-level description of hardware in a programming language like Verilog into an optimized digital circuit netlist, a network of interconnected Boolean logic gates, that implements the function. Spurred by the success of ML in solving combinatorial and graph problems in other domains, there is growing interest in the design of ML-guided logic synthesis tools. Yet, there are no standard datasets or prototypical learning tasks defined for this problem domain. Here, we describe OpenABC-D,a large-scale, labeled dataset produced by synthesizing open source designs with a leading open-source logic synthesis tool and illustrate its use in developing, evaluating and benchmarking ML-guided logic synthesis. OpenABC-D has intermediate and final outputs in the form of 870,000 And-Inverter-Graphs (AIGs) produced from 1500 synthesis runs plus labels such as the optimized node counts, and de-lay. We define a generic learning problem on this dataset and benchmark existing solutions for it. The codes related to dataset creation and benchmark models are available athttps://github.com/NYU-MLDA/OpenABC.git. The dataset generated is available athttps://archive.nyu.edu/handle/2451/63311
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Can Dynamic TDD Enabled Half-Duplex Cell-Free Massive MIMO Outperform Full-Duplex Cellular Massive MIMO?
Authors:
Anubhab Chowdhury,
Ribhu Chopra,
Chandra R. Murthy
Abstract:
We consider a dynamic time division duplex (DTDD) enabled cell-free massive multiple-input multiple-output (CF-mMIMO) system, where each half-duplex (HD) access point (AP) is scheduled to operate in the uplink (UL) or downlink (DL) mode based on the data demands of the user equipments (UEs), with the goal of maximizing the sum UL-DL spectral efficiency (SE). We develop a new, low complexity, greed…
▽ More
We consider a dynamic time division duplex (DTDD) enabled cell-free massive multiple-input multiple-output (CF-mMIMO) system, where each half-duplex (HD) access point (AP) is scheduled to operate in the uplink (UL) or downlink (DL) mode based on the data demands of the user equipments (UEs), with the goal of maximizing the sum UL-DL spectral efficiency (SE). We develop a new, low complexity, greedy algorithm for the combinatorial AP scheduling problem, with an optimality guarantee theoretically established via showing that a lower bound of the sum UL-DL SE is sub-modular. We also consider pilot sequence reuse among the UEs to limit the channel estimation overhead. In CF systems, all the APs estimate the channel from every UE, making pilot allocation problem different from the cellular case. We develop a novel algorithm that iteratively minimizes the maximum pilot contamination across the UEs. We compare the performance of our solutions, both theoretically and via simulations, against a full duplex (FD) multi-cell mMIMO system. Our results show that, due to the joint processing of the signals at the central processing unit, CF-mMIMO with dynamic HD AP-scheduling significantly outperforms cellular FD-mMIMO in terms of the sum SE and 90% likely SE. Thus, DTDD enabled HD CF-mMIMO is a promising alternative to cellular FD-mMIMO, without the cost of hardware for self-interference suppression.
△ Less
Submitted 21 May, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
Stability Analysis of Unfolded WMMSE for Power Allocation
Authors:
Arindam Chowdhury,
Fernando Gama,
Santiago Segarra
Abstract:
Power allocation is one of the fundamental problems in wireless networks and a wide variety of algorithms address this problem from different perspectives. A common element among these algorithms is that they rely on an estimation of the channel state, which may be inaccurate on account of hardware defects, noisy feedback systems, and environmental and adversarial disturbances. Therefore, it is es…
▽ More
Power allocation is one of the fundamental problems in wireless networks and a wide variety of algorithms address this problem from different perspectives. A common element among these algorithms is that they rely on an estimation of the channel state, which may be inaccurate on account of hardware defects, noisy feedback systems, and environmental and adversarial disturbances. Therefore, it is essential that the output power allocation of these algorithms is stable with respect to input perturbations, to the extent that the variations in the output are bounded for bounded variations in the input. In this paper, we focus on UWMMSE -- a modern algorithm leveraging graph neural networks --, and illustrate its stability to additive input perturbations of bounded energy through both theoretical analysis and empirical validation.
△ Less
Submitted 9 January, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Label Propagation across Graphs: Node Classification using Graph Neural Tangent Kernels
Authors:
Artun Bayer,
Arindam Chowdhury,
Santiago Segarra
Abstract:
Graph neural networks (GNNs) have achieved superior performance on node classification tasks in the last few years. Commonly, this is framed in a transductive semi-supervised learning setup wherein the entire graph, including the target nodes to be labeled, is available for training. Driven in part by scalability, recent works have focused on the inductive case where only the labeled portion of a…
▽ More
Graph neural networks (GNNs) have achieved superior performance on node classification tasks in the last few years. Commonly, this is framed in a transductive semi-supervised learning setup wherein the entire graph, including the target nodes to be labeled, is available for training. Driven in part by scalability, recent works have focused on the inductive case where only the labeled portion of a graph is available for training. In this context, our current work considers a challenging inductive setting where a set of labeled graphs are available for training while the unlabeled target graph is completely separate, i.e., there are no connections between labeled and unlabeled nodes. Under the implicit assumption that the testing and training graphs come from similar distributions, our goal is to develop a labeling function that generalizes to unobserved connectivity structures. To that end, we employ a graph neural tangent kernel (GNTK) that corresponds to infinitely wide GNNs to find correspondences between nodes in different graphs based on both the topology and the node features. We augment the capabilities of the GNTK with residual connections and empirically illustrate its performance gains on standard benchmarks.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Moving Object Detection for Event-based vision using Graph Spectral Clustering
Authors:
Anindya Mondal,
Shashant R,
Jhony H. Giraldo,
Thierry Bouwmans,
Ananda S. Chowdhury
Abstract:
Moving object detection has been a central topic of discussion in computer vision for its wide range of applications like in self-driving cars, video surveillance, security, and enforcement. Neuromorphic Vision Sensors (NVS) are bio-inspired sensors that mimic the working of the human eye. Unlike conventional frame-based cameras, these sensors capture a stream of asynchronous 'events' that pose mu…
▽ More
Moving object detection has been a central topic of discussion in computer vision for its wide range of applications like in self-driving cars, video surveillance, security, and enforcement. Neuromorphic Vision Sensors (NVS) are bio-inspired sensors that mimic the working of the human eye. Unlike conventional frame-based cameras, these sensors capture a stream of asynchronous 'events' that pose multiple advantages over the former, like high dynamic range, low latency, low power consumption, and reduced motion blur. However, these advantages come at a high cost, as the event camera data typically contains more noise and has low resolution. Moreover, as event-based cameras can only capture the relative changes in brightness of a scene, event data do not contain usual visual information (like texture and color) as available in video data from normal cameras. So, moving object detection in event-based cameras becomes an extremely challenging task. In this paper, we present an unsupervised Graph Spectral Clustering technique for Moving Object Detection in Event-based data (GSCEventMOD). We additionally show how the optimum number of moving objects can be automatically determined. Experimental comparisons on publicly available datasets show that the proposed GSCEventMOD algorithm outperforms a number of state-of-the-art techniques by a maximum margin of 30%.
△ Less
Submitted 2 December, 2021; v1 submitted 30 September, 2021;
originally announced September 2021.
-
ML-aided power allocation for Tactical MIMO
Authors:
Arindam Chowdhury,
Gunjan Verma,
Chirag Rao,
Ananthram Swami,
Santiago Segarra
Abstract:
We study the problem of optimal power allocation in single-hop multi-antenna ad-hoc wireless networks. A standard technique to solve this problem involves optimizing a tri-convex function under power constraints using a block-coordinate-descent based iterative algorithm. This approach, termed WMMSE, tends to be computationally complex and time consuming. Several learning-based approaches have been…
▽ More
We study the problem of optimal power allocation in single-hop multi-antenna ad-hoc wireless networks. A standard technique to solve this problem involves optimizing a tri-convex function under power constraints using a block-coordinate-descent based iterative algorithm. This approach, termed WMMSE, tends to be computationally complex and time consuming. Several learning-based approaches have been proposed to speed up the power allocation process. A recent work, UWMMSE, learns an affine transformation of a WMMSE parameter in an unfolded structure to accelerate convergence. In spite of achieving promising results, its application is limited to single-antenna wireless networks. In this work, we present a UWMMSE framework for power allocation in (multiple-input multiple-output) MIMO interference networks. A major advantage of this method lies in its use of low-complexity learnable systems in which the number of parameters scales linearly with respect to the hidden layer size of embedded neural architectures and the product of the number of transmitter and receiver antennas only, fully independent of the number of transceivers in the network. We illustrate the superiority of our method through an empirical study of our approach in comparison to WMMSE and also analyze its robustness to changes in channel conditions and network size.
△ Less
Submitted 28 October, 2021; v1 submitted 14 September, 2021;
originally announced September 2021.
-
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis
Authors:
Shammur Absar Chowdhury,
Nadir Durrani,
Ahmed Ali
Abstract:
Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis o…
▽ More
Deep neural networks are inherently opaque and challenging to interpret. Unlike hand-crafted feature-based models, we struggle to comprehend the concepts learned and how they interact within these models. This understanding is crucial not only for debugging purposes but also for ensuring fairness in ethical decision-making. In our study, we conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework [1]. Specifically, we analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification. We conduct layer and neuron-wise analyses, probing for speaker, language, and channel properties. Our study aims to answer the following questions: i) what information is captured within the representations? ii) how is it represented and distributed? and iii) can we identify a minimal subset of the network that possesses this information?
Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, iii) complex properties such as dialectal information are encoded only in the task-oriented pretrained network, iv) and is localised in the upper layers, v) we can extract a minimal subset of neurons encoding the pre-defined property, vi) salient neurons are sometimes shared between properties, vii) our analysis highlights the presence of biases (for example gender) in the network. Our cross-architectural comparison indicates that: i) the pretrained models capture speaker-invariant information, and ii) CNN models are competitive with Transformer models in encoding various understudied properties.
△ Less
Submitted 10 July, 2023; v1 submitted 1 July, 2021;
originally announced July 2021.
-
QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus
Authors:
Hamdy Mubarak,
Amir Hussein,
Shammur Absar Chowdhury,
Ahmed Ali
Abstract:
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, pun…
▽ More
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
Ultrasound Classification of Breast Masses Using a Comprehensive Nakagami Imaging and Machine Learning Framework
Authors:
Ahmad Chowdhury,
Rezwana R. Razzaque,
Ahmad Shafiullah,
Sabiq Muhtadi,
Brian S. Garra,
S. Kaisar Alam
Abstract:
In this study we investigate the potential of parametric images formed from ultrasound B-mode scans using the Nakagami distribution for non-invasive classification of breast lesions. Through a sliding window technique, we generated seven types of parametric images from each patient scan in our dataset using basic and as well as derived parameters of the Nakagami distribution. To determine the most…
▽ More
In this study we investigate the potential of parametric images formed from ultrasound B-mode scans using the Nakagami distribution for non-invasive classification of breast lesions. Through a sliding window technique, we generated seven types of parametric images from each patient scan in our dataset using basic and as well as derived parameters of the Nakagami distribution. To determine the most suitable window size for image generation, we conducted an empirical analysis using three windows, and selected the best one for our study. From the parametric images formed for each patient, we extracted a total of 72 features. Feature selection was performed to find the optimum subset of features for the best classification performance. Incorporating the selected subset of features with the Support Vector Machine (SVM) classifier, and by tuning the decision threshold, we obtained a maximum classification accuracy of 93.08%, an Area under the ROC Curve (AUC) of 0.9712, a False Negative Rate of 0%, and a very low False Positive Rate of 8.65%. Our results indicate that the high accuracy of such a procedure may assist in the diagnostic process associated with detection of breast cancer, as well as help to reduce false positive diagnosis.
△ Less
Submitted 20 June, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
Authors:
Shammur Absar Chowdhury,
Amir Hussein,
Ahmed Abdelali,
Ahmed Ali
Abstract:
With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (E…
▽ More
With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
△ Less
Submitted 5 July, 2021; v1 submitted 31 May, 2021;
originally announced May 2021.
-
DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis
Authors:
Anurag Chowdhury,
Arun Ross,
Prabu David
Abstract:
Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style fe…
▽ More
Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition.
△ Less
Submitted 14 February, 2021; v1 submitted 9 December, 2020;
originally announced December 2020.
-
Efficient power allocation using graph neural networks and deep algorithm unfolding
Authors:
Arindam Chowdhury,
Gunjan Verma,
Chirag Rao,
Ananthram Swami,
Santiago Segarra
Abstract:
We study the problem of optimal power allocation in a single-hop ad hoc wireless network. In solving this problem, we propose a hybrid neural architecture inspired by the algorithmic unfolding of the iterative weighted minimum mean squared error (WMMSE) method, that we denote as unfolded WMMSE (UWMMSE). The learnable weights within UWMMSE are parameterized using graph neural networks (GNNs), where…
▽ More
We study the problem of optimal power allocation in a single-hop ad hoc wireless network. In solving this problem, we propose a hybrid neural architecture inspired by the algorithmic unfolding of the iterative weighted minimum mean squared error (WMMSE) method, that we denote as unfolded WMMSE (UWMMSE). The learnable weights within UWMMSE are parameterized using graph neural networks (GNNs), where the time-varying underlying graphs are given by the fading interference coefficients in the wireless network. These GNNs are trained through a gradient descent approach based on multiple instances of the power allocation problem. Once trained, UWMMSE achieves performance comparable to that of WMMSE while significantly reducing the computational complexity. This phenomenon is illustrated through numerical experiments along with the robustness and generalization to wireless networks of different densities and sizes.
△ Less
Submitted 18 November, 2020;
originally announced December 2020.
-
Unfolding WMMSE using Graph Neural Networks for Efficient Power Allocation
Authors:
Arindam Chowdhury,
Gunjan Verma,
Chirag Rao,
Ananthram Swami,
Santiago Segarra
Abstract:
We study the problem of optimal power allocation in a single-hop ad hoc wireless network. In solving this problem, we depart from classical purely model-based approaches and propose a hybrid method that retains key modeling elements in conjunction with data-driven components. More precisely, we put forth a neural network architecture inspired by the algorithmic unfolding of the iterative weighted…
▽ More
We study the problem of optimal power allocation in a single-hop ad hoc wireless network. In solving this problem, we depart from classical purely model-based approaches and propose a hybrid method that retains key modeling elements in conjunction with data-driven components. More precisely, we put forth a neural network architecture inspired by the algorithmic unfolding of the iterative weighted minimum mean squared error (WMMSE) method, that we denote by unfolded WMMSE (UWMMSE). The learnable weights within UWMMSE are parameterized using graph neural networks (GNNs), where the time-varying underlying graphs are given by the fading interference coefficients in the wireless network. These GNNs are trained through a gradient descent approach based on multiple instances of the power allocation problem. We show that the proposed architecture is permutation equivariant, thus facilitating generalizability across network topologies. Comprehensive numerical experiments illustrate the performance attained by UWMMSE along with its robustness to hyper-parameter selection and generalizability to unseen scenarios such as different network densities and network sizes.
△ Less
Submitted 8 April, 2021; v1 submitted 22 September, 2020;
originally announced September 2020.
-
DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in Non-ideal Audio Signals
Authors:
Anurag Chowdhury,
Arun Ross
Abstract:
Automatic speaker recognition algorithms typically use pre-defined filterbanks, such as Mel-Frequency and Gammatone filterbanks, for characterizing speech audio. However, it has been observed that the features extracted using these filterbanks are not resilient to diverse audio degradations. In this work, we propose a deep learning-based technique to deduce the filterbank design from vast amounts…
▽ More
Automatic speaker recognition algorithms typically use pre-defined filterbanks, such as Mel-Frequency and Gammatone filterbanks, for characterizing speech audio. However, it has been observed that the features extracted using these filterbanks are not resilient to diverse audio degradations. In this work, we propose a deep learning-based technique to deduce the filterbank design from vast amounts of speech audio. The purpose of such a filterbank is to extract features robust to non-ideal audio conditions, such as degraded, short duration, and multi-lingual speech. To this effect, a 1D convolutional neural network is designed to learn a time-domain filterbank called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet mining technique is developed to efficiently mine the data samples best suited to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX filterbanks reveals the presence of both vocal source and vocal tract characteristics in the extracted features. Experimental results on VOXCeleb2, NIST SRE 2008, 2010 and 2018, and Fisher speech datasets demonstrate the efficacy of the DeepVOX features across a variety of degraded, short duration, and multi-lingual speech. The DeepVOX features also shown to improve the performance of existing speaker recognition algorithms, such as the xVector-PLDA and the iVector-PLDA.
△ Less
Submitted 12 June, 2022; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Symbolic Semantic Segmentation and Interpretation of COVID-19 Lung Infections in Chest CT volumes based on Emergent Languages
Authors:
Aritra Chowdhury,
Alberto Santamaria-Pang,
James R. Kubricht,
Jianwei Qiu,
Peter Tu
Abstract:
The coronavirus disease (COVID-19) has resulted in a pandemic crippling the a breadth of services critical to daily life. Segmentation of lung infections in computerized tomography (CT) slices could be be used to improve diagnosis and understanding of COVID-19 in patients. Deep learning systems lack interpretability because of their black box nature. Inspired by human communication of complex idea…
▽ More
The coronavirus disease (COVID-19) has resulted in a pandemic crippling the a breadth of services critical to daily life. Segmentation of lung infections in computerized tomography (CT) slices could be be used to improve diagnosis and understanding of COVID-19 in patients. Deep learning systems lack interpretability because of their black box nature. Inspired by human communication of complex ideas through language, we propose a symbolic framework based on emergent languages for the segmentation of COVID-19 infections in CT scans of lungs. We model the cooperation between two artificial agents - a Sender and a Receiver. These agents synergistically cooperate using emergent symbolic language to solve the task of semantic segmentation. Our game theoretic approach is to model the cooperation between agents unlike Generative Adversarial Networks (GANs). The Sender retrieves information from one of the higher layers of the deep network and generates a symbolic sentence sampled from a categorical distribution of vocabularies. The Receiver ingests the stream of symbols and cogenerates the segmentation mask. A private emergent language is developed that forms the communication channel used to describe the task of segmentation of COVID infections. We augment existing state of the art semantic segmentation architectures with our symbolic generator to form symbolic segmentation models. Our symbolic segmentation framework achieves state of the art performance for segmentation of lung infections caused by COVID-19. Our results show direct interpretation of symbolic sentences to discriminate between normal and infected regions, infection morphology and image characteristics. We show state of the art results for segmentation of COVID-19 lung infections in CT.
△ Less
Submitted 22 August, 2020;
originally announced August 2020.