-
Bronchovascular Tree-Guided Weakly Supervised Learning Method for Pulmonary Segment Segmentation
Authors:
Ruijie Zhao,
Zuopeng Tan,
Xiao Xue,
Longfei Zhao,
Bing Li,
Zicheng Liao,
Ying Ming,
Jiaru Wang,
Ran Xiao,
Sirong Piao,
Rui Zhao,
Qiqi Xu,
Wei Song
Abstract:
Pulmonary segment segmentation is crucial for cancer localization and surgical planning. However, the pixel-wise annotation of pulmonary segments is laborious, as the boundaries between segments are indistinguishable in medical images. To this end, we propose a weakly supervised learning (WSL) method, termed Anatomy-Hierarchy Supervised Learning (AHSL), which consults the precise clinical anatomic…
▽ More
Pulmonary segment segmentation is crucial for cancer localization and surgical planning. However, the pixel-wise annotation of pulmonary segments is laborious, as the boundaries between segments are indistinguishable in medical images. To this end, we propose a weakly supervised learning (WSL) method, termed Anatomy-Hierarchy Supervised Learning (AHSL), which consults the precise clinical anatomical definition of pulmonary segments to perform pulmonary segment segmentation. Since pulmonary segments reside within the lobes and are determined by the bronchovascular tree, i.e., artery, airway and vein, the design of the loss function is founded on two principles. First, segment-level labels are utilized to directly supervise the output of the pulmonary segments, ensuring that they accurately encompass the appropriate bronchovascular tree. Second, lobe-level supervision indirectly oversees the pulmonary segment, ensuring their inclusion within the corresponding lobe. Besides, we introduce a two-stage segmentation strategy that incorporates bronchovascular priori information. Furthermore, a consistency loss is proposed to enhance the smoothness of segment boundaries, along with an evaluation metric designed to measure the smoothness of pulmonary segment boundaries. Visual inspection and evaluation metrics from experiments conducted on a private dataset demonstrate the effectiveness of our method.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Authors:
Xiaohui Sun,
Ruitong Xiao,
Jianye Mo,
Bowen Wu,
Qun Yu,
Baoxun Wang
Abstract:
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformula…
▽ More
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (a 29.5% relative reduction in WER) and speaker similarity (a 4.6% relative increase in SIM score) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.
△ Less
Submitted 22 April, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
Fusion of ECG Foundation Model Embeddings to Improve Early Detection of Acute Coronary Syndromes
Authors:
Zeyuan Meng,
Lovely Yeswanth Panchumarthi,
Saurabh Kataria,
Alex Fedorov,
Jessica Zègre-Hemsey,
Xiao Hu,
Ran Xiao
Abstract:
Acute Coronary Syndrome (ACS) is a life-threatening cardiovascular condition where early and accurate diagnosis is critical for effective treatment and improved patient outcomes. This study explores the use of ECG foundation models, specifically ST-MEM and ECG-FM, to enhance ACS risk assessment using prehospital ECG data collected in ambulances. Both models leverage self-supervised learning (SSL),…
▽ More
Acute Coronary Syndrome (ACS) is a life-threatening cardiovascular condition where early and accurate diagnosis is critical for effective treatment and improved patient outcomes. This study explores the use of ECG foundation models, specifically ST-MEM and ECG-FM, to enhance ACS risk assessment using prehospital ECG data collected in ambulances. Both models leverage self-supervised learning (SSL), with ST-MEM using a reconstruction-based approach and ECG-FM employing contrastive learning, capturing unique spatial and temporal ECG features. We evaluate the performance of these models individually and through a fusion approach, where their embeddings are combined for enhanced prediction. Results demonstrate that both foundation models outperform a baseline ResNet-50 model, with the fusion-based approach achieving the highest performance (AUROC: 0.843 +/- 0.006, AUCPR: 0.674 +/- 0.012). These findings highlight the potential of ECG foundation models for early ACS detection and motivate further exploration of advanced fusion strategies to maximize complementary feature utilization.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Incomplete Graph Learning: A Comprehensive Survey
Authors:
Riting Xia,
Huibo Liu,
Anchen Li,
Xueyan Liu,
Yan Zhang,
Chunxu Zhang,
Bo Yang
Abstract:
Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve…
▽ More
Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve more accurate and representative results. In this paper, we conducted a comprehensive review of the literature on incomplete graph learning. Initially, we categorize incomplete graphs and provide precise definitions of relevant concepts, terminologies, and techniques, thereby establishing a solid understanding for readers. Subsequently, we classify incomplete graph learning methods according to the types of incompleteness: (1) attribute-incomplete graph learning methods, (2) attribute-missing graph learning methods, and (3) hybrid-absent graph learning methods. By systematically classifying and summarizing incomplete graph learning methods, we highlight the commonalities and differences among existing approaches, aiding readers in selecting methods and laying the groundwork for further advancements. In addition, we summarize the datasets, incomplete processing modes, evaluation metrics, and application domains used by the current methods. Lastly, we discuss the current challenges and propose future directions for incomplete graph learning, with the aim of stimulating further innovations in this crucial field. To our knowledge, this is the first review dedicated to incomplete graph learning, aiming to offer valuable insights for researchers in related fields.We developed an online resource to follow relevant research based on this review, available at https://github.com/cherry-a11y/Incomplete-graph-learning.git
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Baichuan-Omni-1.5 Technical Report
Authors:
Yadong Li,
Jun Liu,
Tao Zhang,
Tao Zhang,
Song Chen,
Tianpeng Li,
Zehuan Li,
Lijun Liu,
Lingfeng Ming,
Guosheng Dong,
Da Pan,
Chong Li,
Yuanbo Fang,
Dongdong Kuang,
Mingrui Wang,
Chenglin Zhu,
Youwei Zhang,
Hongyu Guo,
Fengyu Zhang,
Yuran Wang,
Bowen Ding,
Wei Song,
Xu Li,
Yuqi Huo,
Zheng Liang
, et al. (68 additional authors not shown)
Abstract:
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip…
▽ More
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Enhancing Scene Classification in Cloudy Image Scenarios: A Collaborative Transfer Method with Information Regulation Mechanism using Optical Cloud-Covered and SAR Remote Sensing Images
Authors:
Yuze Wang,
Rong Xiao,
Haifeng Li,
Mariana Belgiu,
Chao Tao
Abstract:
In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or dire…
▽ More
In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or directly using Synthetic aperture radar (SAR) data in the target domain. However, cloud removal requires substantial auxiliary data for support and pre-training, while directly using SAR disregards the unobstructed portions of optical data. This study presents a scene classification transfer method that synergistically combines multi-modality data, which aims to transfer the source domain model trained on cloudfree optical data to the target domain that includes both cloudy optical and SAR data at low cost. Specifically, the framework incorporates two parts: (1) the collaborative transfer strategy, based on knowledge distillation, enables the efficient prior knowledge transfer across heterogeneous data; (2) the information regulation mechanism (IRM) is proposed to address the modality imbalance issue during transfer. It employs auxiliary models to measure the contribution discrepancy of each modality, and automatically balances the information utilization of modalities during the target model learning process at the sample-level. The transfer experiments were conducted on simulated and real cloud datasets, demonstrating the superior performance of the proposed method compared to other solutions in cloud-covered scenarios. We also verified the importance and limitations of IRM, and further discussed and visualized the modality imbalance problem during the model transfer. Codes are available at https://github.com/wangyuze-csu/ESCCS
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
Conformable Convolution for Topologically Aware Learning of Complex Anatomical Structures
Authors:
Yousef Yeganeh,
Rui Xiao,
Goktug Guvercin,
Nassir Navab,
Azade Farshad
Abstract:
While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to…
▽ More
While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. Such shortcomings can significantly impact the reliability of analysis results and hinder clinical decision-making. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly enforce topological consistency. Conformable Convolution learns adaptive kernel offsets that preferentially focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Our proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. Experimental results on three diverse datasets demonstrate that our framework effectively preserves the topology in the segmentation downstream task, both quantitatively and qualitatively.
△ Less
Submitted 29 December, 2024;
originally announced December 2024.
-
Acceleration and Parallelization Methods for ISRS EGN Model
Authors:
Ruiyang Xia,
Guanjun Gao,
Zanshan Zhao,
Haoyu Wang,
Kun Wen,
Daobin Wang
Abstract:
The enhanced Gaussian noise (EGN) model, which accounts for inter-channel stimulated Raman scattering (ISRS), has been extensively utilized for evaluating nonlinear interference (NLI) within the C+L band. Compared to closed-form expressions and machine learning-based NLI evaluation models, it demonstrates broader applicability and its accuracy is not dependent on the support of large-scale dataset…
▽ More
The enhanced Gaussian noise (EGN) model, which accounts for inter-channel stimulated Raman scattering (ISRS), has been extensively utilized for evaluating nonlinear interference (NLI) within the C+L band. Compared to closed-form expressions and machine learning-based NLI evaluation models, it demonstrates broader applicability and its accuracy is not dependent on the support of large-scale datasets. However, its high computational complexity often results in lengthy computation times. Through analysis, the high-frequency oscillations of the four-wave mixing (FWM) efficiency factor integrand were identified as a primary factor limiting the computational speed of the ISRS EGN model. To address this issue, we propose an accurate approximation method that enables the derivation of a closed-form expression for the FWM efficiency factor without imposing restrictive conditions. Thereby, the scheme proposed in this paper could significantly accelerate the computational speed. Numerical results demonstrate that method in this work could achieve low error levels under high ISRS influence levels, with an MAE of less than 0.001 dB, and no cumulative error over increasing transmission distances, while reducing computation time by over 97%. Furthermore, a parallel computation strategy targeting independent regions within the integration domain is proposed, which could further improve computational efficiency by nearly 11 times.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Topology and Intersection-Union Constrained Loss Function for Multi-Region Anatomical Segmentation in Ocular Images
Authors:
Ruiyu Xia,
Jianqiang Li,
Xi Xu,
Guanghui Fu
Abstract:
Ocular Myasthenia Gravis (OMG) is a rare and challenging disease to detect in its early stages, but symptoms often first appear in the eye muscles, such as drooping eyelids and double vision. Ocular images can be used for early diagnosis by segmenting different regions, such as the sclera, iris, and pupil, which allows for the calculation of area ratios to support accurate medical assessments. How…
▽ More
Ocular Myasthenia Gravis (OMG) is a rare and challenging disease to detect in its early stages, but symptoms often first appear in the eye muscles, such as drooping eyelids and double vision. Ocular images can be used for early diagnosis by segmenting different regions, such as the sclera, iris, and pupil, which allows for the calculation of area ratios to support accurate medical assessments. However, no publicly available dataset and tools currently exist for this purpose. To address this, we propose a new topology and intersection-union constrained loss function (TIU loss) that improves performance using small training datasets. We conducted experiments on a public dataset consisting of 55 subjects and 2,197 images. Our proposed method outperformed two widely used loss functions across three deep learning networks, achieving a mean Dice score of 83.12% [82.47%, 83.81%] with a 95% bootstrap confidence interval. In a low-percentage training scenario (10% of the training data), our approach showed an 8.32% improvement in Dice score compared to the baseline. Additionally, we evaluated the method in a clinical setting with 47 subjects and 501 images, achieving a Dice score of 64.44% [63.22%, 65.62%]. We did observe some bias when applying the model in clinical settings. These results demonstrate that the proposed method is accurate, and our code along with the trained model is publicly available.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
EAR: Edge-Aware Reconstruction of 3-D vertebrae structures from bi-planar X-ray images
Authors:
Lixing Tan,
Shuang Song,
Yaofeng He,
Kangneng Zhou,
Tong Lu,
Ruoxiu Xiao
Abstract:
X-ray images ease the diagnosis and treatment process due to their rapid imaging speed and high resolution. However, due to the projection process of X-ray imaging, much spatial information has been lost. To accurately provide efficient spinal morphological and structural information, reconstructing the 3-D structures of the spine from the 2-D X-ray images is essential. It is challenging for curre…
▽ More
X-ray images ease the diagnosis and treatment process due to their rapid imaging speed and high resolution. However, due to the projection process of X-ray imaging, much spatial information has been lost. To accurately provide efficient spinal morphological and structural information, reconstructing the 3-D structures of the spine from the 2-D X-ray images is essential. It is challenging for current reconstruction methods to preserve the edge information and local shapes of the asymmetrical vertebrae structures. In this study, we propose a new Edge-Aware Reconstruction network (EAR) to focus on the performance improvement of the edge information and vertebrae shapes. In our network, by using the auto-encoder architecture as the backbone, the edge attention module and frequency enhancement module are proposed to strengthen the perception of the edge reconstruction. Meanwhile, we also combine four loss terms, including reconstruction loss, edge loss, frequency loss and projection loss. The proposed method is evaluated using three publicly accessible datasets and compared with four state-of-the-art models. The proposed method is superior to other methods and achieves 25.32%, 15.32%, 86.44%, 80.13%, 23.7612 and 0.3014 with regard to MSE, MAE, Dice, SSIM, PSNR and frequency distance. Due to the end-to-end and accurate reconstruction process, EAR can provide sufficient 3-D spatial information and precise preoperative surgical planning guidance.
△ Less
Submitted 4 August, 2024; v1 submitted 30 July, 2024;
originally announced July 2024.
-
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Authors:
Ye Bai,
Jingping Chen,
Jitong Chen,
Wei Chen,
Zhuo Chen,
Chuang Ding,
Linhao Dong,
Qianqian Dong,
Yujiao Du,
Kepan Gao,
Lu Gao,
Yi Guo,
Minglun Han,
Ting Han,
Wenchao Hu,
Xinying Hu,
Yuxiang Hu,
Deyu Hua,
Lu Huang,
Mingkun Huang,
Youjia Huang,
Jishuo Jin,
Fanliu Kong,
Zongwei Lan,
Tianyu Li
, et al. (30 additional authors not shown)
Abstract:
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor…
▽ More
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
△ Less
Submitted 10 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge
Authors:
Hongwei Bran Li,
Fernando Navarro,
Ivan Ezhov,
Amirhossein Bayat,
Dhritiman Das,
Florian Kofler,
Suprosanna Shit,
Diana Waldmannstetter,
Johannes C. Paetzold,
Xiaobin Hu,
Benedikt Wiestler,
Lucas Zimmer,
Tamaz Amiranashvili,
Chinmay Prabhakar,
Christoph Berger,
Jonas Weidner,
Michelle Alonso-Basant,
Arif Rashid,
Ujjwal Baid,
Wesam Adel,
Deniz Ali,
Bhakti Baheti,
Yingbin Bai,
Ishaan Bhatt,
Sabri Can Cetindag
, et al. (55 additional authors not shown)
Abstract:
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de…
▽ More
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.
△ Less
Submitted 24 June, 2024; v1 submitted 19 March, 2024;
originally announced May 2024.
-
SQUWA: Signal Quality Aware DNN Architecture for Enhanced Accuracy in Atrial Fibrillation Detection from Noisy PPG Signals
Authors:
Runze Yan,
Cheng Ding,
Ran Xiao,
Aleksandr Fedorov,
Randall J Lee,
Fadi Nahab,
Xiao Hu
Abstract:
Atrial fibrillation (AF), a common cardiac arrhythmia, significantly increases the risk of stroke, heart disease, and mortality. Photoplethysmography (PPG) offers a promising solution for continuous AF monitoring, due to its cost efficiency and integration into wearable devices. Nonetheless, PPG signals are susceptible to corruption from motion artifacts and other factors often encountered in ambu…
▽ More
Atrial fibrillation (AF), a common cardiac arrhythmia, significantly increases the risk of stroke, heart disease, and mortality. Photoplethysmography (PPG) offers a promising solution for continuous AF monitoring, due to its cost efficiency and integration into wearable devices. Nonetheless, PPG signals are susceptible to corruption from motion artifacts and other factors often encountered in ambulatory settings. Conventional approaches typically discard corrupted segments or attempt to reconstruct original signals, allowing for the use of standard machine learning techniques. However, this reduces dataset size and introduces biases, compromising prediction accuracy and the effectiveness of continuous monitoring. We propose a novel deep learning model, Signal Quality Weighted Fusion of Attentional Convolution and Recurrent Neural Network (SQUWA), designed to learn how to retain accurate predictions from partially corrupted PPG. Specifically, SQUWA innovatively integrates an attention mechanism that directly considers signal quality during the learning process, dynamically adjusting the weights of time series segments based on their quality. This approach enhances the influence of higher-quality segments while reducing that of lower-quality ones, effectively utilizing partially corrupted segments. This approach represents a departure from the conventional methods that exclude such segments, enabling the utilization of a broader range of data, which has great implications for less disruption when monitoring of AF risks and more accurate estimation of AF burdens. Our extensive experiments show that SQUWA outperform existing PPG-based models, achieving the highest AUCPR of 0.89 with label noise mitigation. This also exceeds the 0.86 AUCPR of models trained with using both electrocardiogram (ECG) and PPG data.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Multi-view X-ray Image Synthesis with Multiple Domain Disentanglement from CT Scans
Authors:
Lixing Tan,
Shuang Song,
Kangneng Zhou,
Chengbo Duan,
Lanying Wang,
Huayang Ren,
Linlin Liu,
Wei Zhang,
Ruoxiu Xiao
Abstract:
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume d…
▽ More
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($π$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.
△ Less
Submitted 30 July, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Reconsideration on evaluation of machine learning models in continuous monitoring using wearables
Authors:
Cheng Ding,
Zhicheng Guo,
Cynthia Rudin,
Ran Xiao,
Fadi B Nahab,
Xiao Hu
Abstract:
This paper explores the challenges in evaluating machine learning (ML) models for continuous health monitoring using wearable devices beyond conventional metrics. We state the complexities posed by real-world variability, disease dynamics, user-specific characteristics, and the prevalence of false notifications, necessitating novel evaluation strategies. Drawing insights from large-scale heart stu…
▽ More
This paper explores the challenges in evaluating machine learning (ML) models for continuous health monitoring using wearable devices beyond conventional metrics. We state the complexities posed by real-world variability, disease dynamics, user-specific characteristics, and the prevalence of false notifications, necessitating novel evaluation strategies. Drawing insights from large-scale heart studies, the paper offers a comprehensive guideline for robust ML model evaluation on continuous health monitoring.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Audio Prompt Tuning for Universal Sound Separation
Authors:
Yuzhuo Liu,
Xubo Liu,
Yan Zhao,
Yuanyuan Wang,
Rui Xia,
Pingchuan Tain,
Yuxuan Wang
Abstract:
Universal sound separation (USS) is a task to separate arbitrary sounds from an audio mixture. Existing USS systems are capable of separating arbitrary sources, given a few examples of the target sources as queries. However, separating arbitrary sounds with a single system is challenging, and the robustness is not always guaranteed. In this work, we propose audio prompt tuning (APT), a simple yet…
▽ More
Universal sound separation (USS) is a task to separate arbitrary sounds from an audio mixture. Existing USS systems are capable of separating arbitrary sources, given a few examples of the target sources as queries. However, separating arbitrary sounds with a single system is challenging, and the robustness is not always guaranteed. In this work, we propose audio prompt tuning (APT), a simple yet effective approach to enhance existing USS systems. Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen. We evaluate the proposed method on MUSDB18 and ESC-50 datasets. Compared with the baseline model, APT can improve the signal-to-distortion ratio performance by 0.67 dB and 2.06 dB using the full training set of two datasets. Moreover, APT with only 5 audio samples even outperforms the baseline systems utilizing full training data on the ESC-50 dataset, indicating the great potential of few-shot APT.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Photoplethysmography based atrial fibrillation detection: an updated review from July 2019
Authors:
Cheng Ding,
Ran Xiao,
Weijia Wang,
Elizabeth Holdsworth,
Xiao Hu
Abstract:
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with significant health ramifications, including an elevated susceptibility to ischemic stroke, heart disease, and heightened mortality. Photoplethysmography (PPG) has emerged as a promising technology for continuous AF monitoring for its cost-effectiveness and widespread integration into wearable devices. Our team previously co…
▽ More
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with significant health ramifications, including an elevated susceptibility to ischemic stroke, heart disease, and heightened mortality. Photoplethysmography (PPG) has emerged as a promising technology for continuous AF monitoring for its cost-effectiveness and widespread integration into wearable devices. Our team previously conducted an exhaustive review on PPG-based AF detection before June 2019. However, since then, more advanced technologies have emerged in this field. This paper offers a comprehensive review of the latest advancements in PPG-based AF detection, utilizing digital health and artificial intelligence (AI) solutions, within the timeframe spanning from July 2019 to December 2022. Through extensive exploration of scientific databases, we have identified 59 pertinent studies. Our comprehensive review encompasses an in-depth assessment of the statistical methodologies, traditional machine learning techniques, and deep learning approaches employed in these studies. In addition, we address the challenges encountered in the domain of PPG-based AF detection. Furthermore, we maintain a dedicated website to curate the latest research in this area, with regular updates on a regular basis.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
GAEI-UNet: Global Attention and Elastic Interaction U-Net for Vessel Image Segmentation
Authors:
Ruiqiang Xiao,
Zhuoyue Wan
Abstract:
Vessel image segmentation plays a pivotal role in medical diagnostics, aiding in the early detection and treatment of vascular diseases. While segmentation based on deep learning has shown promising results, effectively segmenting small structures and maintaining connectivity between them remains challenging. To address these limitations, we propose GAEI-UNet, a novel model that combines global at…
▽ More
Vessel image segmentation plays a pivotal role in medical diagnostics, aiding in the early detection and treatment of vascular diseases. While segmentation based on deep learning has shown promising results, effectively segmenting small structures and maintaining connectivity between them remains challenging. To address these limitations, we propose GAEI-UNet, a novel model that combines global attention and elastic interaction-based techniques. GAEI-UNet leverages global spatial and channel context information to enhance high-level semantic understanding within the U-Net architecture, enabling precise segmentation of small vessels. Additionally, we adopt an elastic interaction-based loss function to improve connectivity among these fine structures. By capturing the forces generated by misalignment between target and predicted shapes, our model effectively learns to preserve the correct topology of vessel networks. Evaluation on retinal vessel dataset -- DRIVE demonstrates the superior performance of GAEI-UNet in terms of SE and connectivity of small structures, without significantly increasing computational complexity. This research aims to advance the field of vessel image segmentation, providing more accurate and reliable diagnostic tools for the medical community. The implementation code is available on Code.
△ Less
Submitted 22 August, 2023; v1 submitted 16 August, 2023;
originally announced August 2023.
-
Separate Anything You Describe
Authors:
Xubo Liu,
Qiuqiang Kong,
Yan Zhao,
Haohe Liu,
Yi Yuan,
Yuzhuo Liu,
Rui Xia,
Yuxuan Wang,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instr…
▽ More
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.
△ Less
Submitted 1 December, 2024; v1 submitted 9 August, 2023;
originally announced August 2023.
-
Efficient Neural Music Generation
Authors:
Max W. Y. Lam,
Qiao Tian,
Tang Li,
Zongyu Yin,
Siyuan Feng,
Ming Tu,
Yuliang Ji,
Rui Xia,
Mingbo Ma,
Xuchen Song,
Jitong Chen,
Yuping Wang,
Yuxuan Wang
Abstract:
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real…
▽ More
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation.
Our samples are available at https://Efficient-MeLoDy.github.io/.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Language-universal phonetic encoder for low-resource speech recognition
Authors:
Siyuan Feng,
Ming Tu,
Rui Xia,
Chuanzeng Huang,
Yuxuan Wang
Abstract:
Multilingual training is effective in improving low-resource ASR, which may partially be explained by phonetic representation sharing between languages. In end-to-end (E2E) ASR systems, graphemes are often used as basic modeling units, however graphemes may not be ideal for multilingual phonetic sharing. In this paper, we leverage International Phonetic Alphabet (IPA) based language-universal phon…
▽ More
Multilingual training is effective in improving low-resource ASR, which may partially be explained by phonetic representation sharing between languages. In end-to-end (E2E) ASR systems, graphemes are often used as basic modeling units, however graphemes may not be ideal for multilingual phonetic sharing. In this paper, we leverage International Phonetic Alphabet (IPA) based language-universal phonetic model to improve low-resource ASR performances, for the first time within the attention encoder-decoder architecture. We propose an adaptation method on the phonetic IPA model to further improve the proposed approach on extreme low-resource languages. Experiments carried out on the open-source MLS corpus and our internal databases show our approach outperforms baseline monolingual models and most state-of-the-art works. Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition
Authors:
Siyuan Feng,
Ming Tu,
Rui Xia,
Chuanzeng Huang,
Yuxuan Wang
Abstract:
We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Mult…
▽ More
We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Multilingual Speech (MLS) Corpus show that the proposed approach consistently outperforms the standard HuBERT on all the target languages. Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1.5k hours (75%) at most. Our approach outperforms most of the state of the arts, with much less pretraining data in terms of hours and language diversity. Compared to XLSR-53 and a retraining based multilingual method, our approach performs better with full and limited finetuning data scenarios.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition
Authors:
Yukun Feng,
Ming Tu,
Rui Xia,
Chuanzeng Huang,
Yuxuan Wang
Abstract:
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a n…
▽ More
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
△ Less
Submitted 30 December, 2022;
originally announced January 2023.
-
Encoding feature supervised UNet++: Redesigning Supervision for liver and tumor segmentation
Authors:
Jiahao Cui,
Ruoxin Xiao,
Shiyuan Fang,
Minnan Pei,
Yixuan Yu
Abstract:
Liver tumor segmentation in CT images is a critical step in the diagnosis, surgical planning and postoperative evaluation of liver disease. An automatic liver and tumor segmentation method can greatly relieve physicians of the heavy workload of examining CT images and better improve the accuracy of diagnosis. In the last few decades, many modifications based on U-Net model have been proposed in th…
▽ More
Liver tumor segmentation in CT images is a critical step in the diagnosis, surgical planning and postoperative evaluation of liver disease. An automatic liver and tumor segmentation method can greatly relieve physicians of the heavy workload of examining CT images and better improve the accuracy of diagnosis. In the last few decades, many modifications based on U-Net model have been proposed in the literature. However, there are relatively few improvements for the advanced UNet++ model. In our paper, we propose an encoding feature supervised UNet++(ES-UNet++) and apply it to the liver and tumor segmentation. ES-UNet++ consists of an encoding UNet++ and a segmentation UNet++. The well-trained encoding UNet++ can extract the encoding features of label map which are used to additionally supervise the segmentation UNet++. By adding supervision to the each encoder of segmentation UNet++, U-Nets of different depths that constitute UNet++ outperform the original version by average 5.7% in dice score and the overall dice score is thus improved by 2.1%. ES-UNet++ is evaluated with dataset LiTS, achieving 95.6% for liver segmentation and 67.4% for tumor segmentation in dice score. In this paper, we also concluded some valuable properties of ES-UNet++ by conducting comparative anaylsis between ES-UNet++ and UNet++:(1) encoding feature supervision can accelerate the convergence of the model.(2) encoding feature supervision enhances the effect of model pruning by achieving huge speedup while providing pruned models with fairly good performance.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Learning From Alarms: A Robust Learning Approach for Accurate Photoplethysmography-Based Atrial Fibrillation Detection using Eight Million Samples Labeled with Imprecise Arrhythmia Alarms
Authors:
Cheng Ding,
Zhicheng Guo,
Cynthia Rudin,
Ran Xiao,
Amit Shah,
Duc H. Do,
Randall J Lee,
Gari Clifford,
Fadi B Nahab,
Xiao Hu
Abstract:
Atrial fibrillation (AF) is a common cardiac arrhythmia with serious health consequences if not detected and treated early. Detecting AF using wearable devices with photoplethysmography (PPG) sensors and deep neural networks has demonstrated some success using proprietary algorithms in commercial solutions. However, further advancement of this paradigm of continuous AF detection in ambulatory sett…
▽ More
Atrial fibrillation (AF) is a common cardiac arrhythmia with serious health consequences if not detected and treated early. Detecting AF using wearable devices with photoplethysmography (PPG) sensors and deep neural networks has demonstrated some success using proprietary algorithms in commercial solutions. However, further advancement of this paradigm of continuous AF detection in ambulatory settings, towards a population-wide screening use case, still faces several challenges, one of which is the lack of large-scale labeled training data. To address this challenge, in this study, we propose to leverage AF alarms from bedside patient monitors to label concurrent PPG signals, resulting in the largest PPG-AF dataset so far (8.5M 30-second records from 24100 patients) and demonstrating a practical approach to build large labeled PPG datasets. Furthermore, we recognize that the AF labels thus obtained contain errors because of false AF alarms generated from imperfect built-in algorithms from bedside monitors. Dealing with label noise with unknown distribution characteristics in this case requires advanced algorithms. We, therefore, introduce and open source a novel loss design, the cluster membership consistency (CMC) loss, to mitigate label errors. By comparing CMC with state-of-the-art methods selected from a noisy label competition, we demonstrate its superiority in multiple aspects including handling label noise in PPG data, resilience to poor-quality signals, and computational efficiency.
△ Less
Submitted 12 November, 2023; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Degradation-invariant Enhancement of Fundus Images via Pyramid Constraint Network
Authors:
Haofeng Liu,
Heng Li,
Huazhu Fu,
Ruoxiu Xiao,
Yunshu Gao,
Yan Hu,
Jiang Liu
Abstract:
As an economical and efficient fundus imaging modality, retinal fundus images have been widely adopted in clinical fundus examination. Unfortunately, fundus images often suffer from quality degradation caused by imaging interferences, leading to misdiagnosis. Despite impressive enhancement performances that state-of-the-art methods have achieved, challenges remain in clinical scenarios. For boosti…
▽ More
As an economical and efficient fundus imaging modality, retinal fundus images have been widely adopted in clinical fundus examination. Unfortunately, fundus images often suffer from quality degradation caused by imaging interferences, leading to misdiagnosis. Despite impressive enhancement performances that state-of-the-art methods have achieved, challenges remain in clinical scenarios. For boosting the clinical deployment of fundus image enhancement, this paper proposes the pyramid constraint to develop a degradation-invariant enhancement network (PCE-Net), which mitigates the demand for clinical data and stably enhances unknown data. Firstly, high-quality images are randomly degraded to form sequences of low-quality ones sharing the same content (SeqLCs). Then individual low-quality images are decomposed to Laplacian pyramid features (LPF) as the multi-level input for the enhancement. Subsequently, a feature pyramid constraint (FPC) for the sequence is introduced to enforce the PCE-Net to learn a degradation-invariant model. Extensive experiments have been conducted under the evaluation metrics of enhancement and segmentation. The effectiveness of the PCE-Net was demonstrated in comparison with state-of-the-art methods and the ablation study. The source code of this study is publicly available at https://github.com/HeverLaw/PCENet-Image-Enhancement.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Deepfake Detection System for the ADD Challenge Track 3.2 Based on Score Fusion
Authors:
Yuxiang Zhang,
Jingze Lu,
Xingming Wang,
Zhuo Li,
Runqiu Xiao,
Wenchao Wang,
Ming Li,
Pengyuan Zhang
Abstract:
This paper describes the deepfake audio detection system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3.2 and gives an analysis of score fusion. The proposed system is a score-level fusion of several light convolutional neural network (LCNN) based models. Various front-ends are used as input features, including low-frequency short-time Fourier transform and Constant Q tran…
▽ More
This paper describes the deepfake audio detection system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3.2 and gives an analysis of score fusion. The proposed system is a score-level fusion of several light convolutional neural network (LCNN) based models. Various front-ends are used as input features, including low-frequency short-time Fourier transform and Constant Q transform. Due to the complex noise and rich synthesis algorithms, it is difficult to obtain the desired performance using the training set directly. Online data augmentation methods effectively improve the robustness of fake audio detection systems. In particular, the reasons for the poor improvement of score fusion are explored through visualization of the score distributions and comparison with score distribution on another dataset. The overfitting of the model to the training set leads to extreme values of the scores and low correlation of the score distributions, which makes score fusion difficult. Fusion with partially fake audio detection system improves system performance further. The submission on track 3.2 obtained the weighted equal error rate (WEER) of 11.04\%, which is one of the best performing systems in the challenge.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
The HCCL System for the NIST SRE21
Authors:
Zhuo Li,
Runqiu Xiao,
Hangting Chen,
Zhenduo Zhao,
Zihan Zhang,
Wenchao Wang
Abstract:
This paper describes the systems developed by the HCCL team for the NIST 2021 speaker recognition evaluation (NIST SRE21).We first explore various state-of-the-art speaker embedding extractors combined with a novel circle loss to obtain discriminative deep speaker embeddings. Considering that cross-channel and cross-linguistic speaker recognition are the key challenges of SRE21, we introduce sever…
▽ More
This paper describes the systems developed by the HCCL team for the NIST 2021 speaker recognition evaluation (NIST SRE21).We first explore various state-of-the-art speaker embedding extractors combined with a novel circle loss to obtain discriminative deep speaker embeddings. Considering that cross-channel and cross-linguistic speaker recognition are the key challenges of SRE21, we introduce several techniques to reduce the cross-domain mismatch. Specifically, Codec and speech enhancement are directly applied to the raw speech to eliminate the codecs and the environment noise mismatch. We denote the methods that work directly on speech to eliminate the relatively explicit mismatches collectively as data adaptation methods. Experiments show that data adaption methods achieve 15\% improvements over our baseline. Furthermore, some popular back-ends domain adaptation algorithms are deployed on speaker embeddings to alleviate speaker performance degradation caused by the implicit mismatch. Score calibration is a major failure for us in SRE21. The reason is that score calibration with too many parameters easily lead to overfitting problems.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Back-ends Selection for Deep Speaker Embeddings
Authors:
Zhuo Li,
Runqiu Xiao,
Zihan Zhang,
Zhenduo Zhao,
Wenchao Wang,
Pengyuan Zhang
Abstract:
Probabilistic Linear Discriminant Analysis (PLDA) was the dominant and necessary back-end for early speaker recognition approaches, like i-vector and x-vector. However, with the development of neural networks and margin-based loss functions, we can obtain deep speaker embeddings (DSEs), which have advantages of increased inter-class separation and smaller intra-class distances. In this case, PLDA…
▽ More
Probabilistic Linear Discriminant Analysis (PLDA) was the dominant and necessary back-end for early speaker recognition approaches, like i-vector and x-vector. However, with the development of neural networks and margin-based loss functions, we can obtain deep speaker embeddings (DSEs), which have advantages of increased inter-class separation and smaller intra-class distances. In this case, PLDA seems unnecessary or even counterproductive for the discriminative embeddings, and cosine similarity scoring (Cos) achieves better performance than PLDA in some situations. Motivated by this, in this paper, we systematically explore how to select back-ends (Cos or PLDA) for deep speaker embeddings to achieve better performance in different situations. By analyzing PLDA and the properties of DSEs extracted from models with different numbers of segment-level layers, we make the conjecture that Cos is better in same-domain situations and PLDA is better in cross-domain situations. We conduct experiments on VoxCeleb and NIST SRE datasets in four application situations, single-/multi-domain training and same-/cross-domain test, to validate our conjecture and briefly explain why back-ends adaption algorithms work.
△ Less
Submitted 24 April, 2022;
originally announced April 2022.
-
DGC-vector: A new speaker embedding for zero-shot voice conversion
Authors:
Ruitong Xiao,
Haitong Zhang,
Yue Lin
Abstract:
Recently, more and more zero-shot voice conversion algorithms have been proposed. As a fundamental part of zero-shot voice conversion, speaker embeddings are the key to improving the converted speech's speaker similarity. In this paper, we study the impact of speaker embeddings on zero-shot voice conversion performance. To better represent the characteristics of the target speaker and improve the…
▽ More
Recently, more and more zero-shot voice conversion algorithms have been proposed. As a fundamental part of zero-shot voice conversion, speaker embeddings are the key to improving the converted speech's speaker similarity. In this paper, we study the impact of speaker embeddings on zero-shot voice conversion performance. To better represent the characteristics of the target speaker and improve the speaker similarity in zero-shot voice conversion, we propose a novel speaker representation method in this paper. Our method combines the advantages of D-vector, global style token (GST) based speaker representation and auxiliary supervision. Objective and subjective evaluations show that the proposed method achieves a decent performance on zero-shot voice conversion and significantly improves speaker similarity over D-vector and GST-based speaker embedding.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Cloning one's voice using very limited data in the wild
Authors:
Dongyang Dai,
Yuanzhe Chen,
Li Chen,
Ming Tu,
Lu Liu,
Rui Xia,
Qiao Tian,
Yuping Wang,
Yuxuan Wang
Abstract:
With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and tim…
▽ More
With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points.
△ Less
Submitted 8 October, 2021; v1 submitted 7 October, 2021;
originally announced October 2021.
-
The ByteDance Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2021
Authors:
Keke Wang,
Xudong Mao,
Hao Wu,
Chen Ding,
Chuxiang Shang,
Rui Xia,
Yuxuan Wang
Abstract:
This paper describes the ByteDance speaker diarization system for the fourth track of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). The VoxSRC-21 provides both the dev set and test set of VoxConverse for use in validation and a standalone test set for evaluation. We first collect the duration and signal-to-noise ratio (SNR) of all audio and find that the distribution of the VoxConve…
▽ More
This paper describes the ByteDance speaker diarization system for the fourth track of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). The VoxSRC-21 provides both the dev set and test set of VoxConverse for use in validation and a standalone test set for evaluation. We first collect the duration and signal-to-noise ratio (SNR) of all audio and find that the distribution of the VoxConverse's test set and the VoxSRC-21's test set is more closer. Our system consists of voice active detection (VAD), speaker embedding extraction, spectral clustering followed by a re-clustering step based on agglomerative hierarchical clustering (AHC) and overlapped speech detection and handling. Finally, we integrate systems with different time scales using DOVER-Lap. Our best system achieves 5.15\% of the diarization error rate (DER) on evaluation set, ranking the second at the diarization track of the challenge.
△ Less
Submitted 5 September, 2021;
originally announced September 2021.
-
Log-Spectral Matching GAN: PPG-based Atrial Fibrillation Detection can be Enhanced by GAN-based Data Augmentation with Integration of Spectral Loss
Authors:
Cheng Ding,
Ran Xiao,
Duc Do,
David Scott Lee,
Shadi Kalantarian,
Randall J Lee,
Xiao Hu
Abstract:
Photoplethysmography (PPG) is a ubiquitous physiological measurement that detects beat-to-beat pulsatile blood volume changes and hence has a potential for monitoring cardiovascular conditions, particularly in ambulatory settings. A PPG dataset that is created for a particular use case is often imbalanced, due to a low prevalence of the pathological condition it targets to predict and the paroxysm…
▽ More
Photoplethysmography (PPG) is a ubiquitous physiological measurement that detects beat-to-beat pulsatile blood volume changes and hence has a potential for monitoring cardiovascular conditions, particularly in ambulatory settings. A PPG dataset that is created for a particular use case is often imbalanced, due to a low prevalence of the pathological condition it targets to predict and the paroxysmal nature of the condition as well. To tackle this problem, we propose log-spectral matching GAN (LSM-GAN), a generative model that can be used as a data augmentation technique to alleviate the class imbalance in a PPG dataset to train a classifier. LSM-GAN utilizes a novel generator that generates a synthetic signal without a up-sampling process of input white noises, as well as adds the mismatch between real and synthetic signals in frequency domain to the conventional adversarial loss. In this study, experiments are designed focusing on examining how the influence of LSM-GAN as a data augmentation technique on one specific classification task - atrial fibrillation (AF) detection using PPG. We show that by taking spectral information into consideration, LSM-GAN as a data augmentation solution can generate more realistic PPG signals. The code of LSM-GAN is available at https://github.com/chengding0713/Log-Spectral-matching-GAN.
△ Less
Submitted 31 January, 2022; v1 submitted 11 August, 2021;
originally announced August 2021.
-
The HCCL Speaker Verification System for Far-Field Speaker Verification Challenge
Authors:
Zhuo Li,
Ce Fang,
Runqiu Xiao,
Zhigao Chen,
Wenchao Wang,
Yonghong Yan
Abstract:
This paper describes the systems submitted by team HCCL to the Far-Field Speaker Verification Challenge. Our previous work in the AIshell Speaker Verification Challenge 2019 shows that the powerful modeling abilities of Neural Network architectures can provide exceptional performance for this kind of task. Therefore, in this challenge, we focus on constructing deep Neural Network architectures bas…
▽ More
This paper describes the systems submitted by team HCCL to the Far-Field Speaker Verification Challenge. Our previous work in the AIshell Speaker Verification Challenge 2019 shows that the powerful modeling abilities of Neural Network architectures can provide exceptional performance for this kind of task. Therefore, in this challenge, we focus on constructing deep Neural Network architectures based on TDNN, Resnet and Res2net blocks. Most of the developed systems consist of Neural Network embeddings are applied with PLDA backend. Firstly, the speed perturbation method is applied to augment data and significant performance improvements are achieved. Then, we explore the use of AMsoftmax loss function and propose to join a CE-loss branch when we train model using AMsoftmax loss. In addition, the impact of score normalization on performance is also investigated. The final system, a fusion of four systems, achieves minDCF 0.5342, EER 5.05\% on task1 eval set, and achieves minDCF 0.5193, EER 5.47\% on task3 eval set.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Adaptive Margin Circle Loss for Speaker Verification
Authors:
Runqiu Xiao
Abstract:
Deep-Neural-Network (DNN) based speaker verification sys-tems use the angular softmax loss with margin penalties toenhance the intra-class compactness of speaker embeddings,which achieved remarkable performance. In this paper, we pro-pose a novel angular loss function called adaptive margin cir-cle loss for speaker verification. The stage-based margin andchunk-based margin are applied to improve t…
▽ More
Deep-Neural-Network (DNN) based speaker verification sys-tems use the angular softmax loss with margin penalties toenhance the intra-class compactness of speaker embeddings,which achieved remarkable performance. In this paper, we pro-pose a novel angular loss function called adaptive margin cir-cle loss for speaker verification. The stage-based margin andchunk-based margin are applied to improve the angular discrim-ination of circle loss on the training set. The analysis on gradi-ents shows that, compared with the previous angular loss likeAdditive Margin Softmax(Am-Softmax), circle loss has flexi-ble optimization and definite convergence status. Experimentsare carried out on the Voxceleb and SITW. By applying adap-tive margin circle loss, our best system achieves 1.31%EER onVoxceleb1 and 2.13% on SITW core-core.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Speech enhancement with weakly labelled data from AudioSet
Authors:
Qiuqiang Kong,
Haohe Liu,
Xingjian Du,
Li Chen,
Rui Xia,
Yuxuan Wang
Abstract:
Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. We…
▽ More
Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. Weakly labelled data only contain audio tags of audio clips, but not the onset or offset times of speech. We first apply pretrained audio neural networks (PANNs) to detect anchor segments that contain speech or sound events in audio clips. Then, we randomly mix two detected anchor segments containing speech and sound events as a mixture, and build a conditional source separation network using PANNs predictions as soft conditions for speech enhancement. In inference, we input a noisy speech signal with the one-hot encoding of "Speech" as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.
△ Less
Submitted 19 February, 2021;
originally announced February 2021.
-
Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement
Authors:
Dongyang Dai,
Li Chen,
Yuping Wang,
Mu Wang,
Rui Xia,
Xuchen Song,
Zhiyong Wu,
Yuxuan Wang
Abstract:
With the popularity of deep neural network, speech synthesis task has achieved significant improvements based on the end-to-end encoder-decoder framework in the recent days. More and more applications relying on speech synthesis technology have been widely used in our daily life. Robust speech synthesis model depends on high quality and customized data which needs lots of collecting efforts. It is…
▽ More
With the popularity of deep neural network, speech synthesis task has achieved significant improvements based on the end-to-end encoder-decoder framework in the recent days. More and more applications relying on speech synthesis technology have been widely used in our daily life. Robust speech synthesis model depends on high quality and customized data which needs lots of collecting efforts. It is worth investigating how to take advantage of low-quality and low resource voice data which can be easily obtained from the Internet for usage of synthesizing personalized voice. In this paper, the proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively. Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice. Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning pre-trained multi-speaker speech synthesis model with denoised new speaker data.
△ Less
Submitted 22 October, 2020; v1 submitted 26 May, 2020;
originally announced May 2020.
-
A Generative Learning Approach for Spatio-temporal Modeling in Connected Vehicular Network
Authors:
Rong Xia,
Yong Xiao,
Yingyu Li,
Marwan Krunz,
Dusit Niyato
Abstract:
Spatio-temporal modeling of wireless access latency is of great importance for connected-vehicular systems. The quality of the molded results rely heavily on the number and quality of samples which can vary significantly due to the sensor deployment density as well as traffic volume and density. This paper proposes LaMI (Latency Model Inpainting), a novel framework to generate a comprehensive spat…
▽ More
Spatio-temporal modeling of wireless access latency is of great importance for connected-vehicular systems. The quality of the molded results rely heavily on the number and quality of samples which can vary significantly due to the sensor deployment density as well as traffic volume and density. This paper proposes LaMI (Latency Model Inpainting), a novel framework to generate a comprehensive spatio-temporal of wireless access latency of a connected vehicles across a wide geographical area. LaMI adopts the idea from image inpainting and synthesizing and can reconstruct the missing latency samples by a two-step procedure. In particular, it first discovers the spatial correlation between samples collected in various regions using a patching-based approach and then feeds the original and highly correlated samples into a Variational Autoencoder (VAE), a deep generative model, to create latency samples with similar probability distribution with the original samples. Finally, LaMI establishes the empirical PDF of latency performance and maps the PDFs into the confidence levels of different vehicular service requirements. Extensive performance evaluation has been conducted using the real traces collected in a commercial LTE network in a university campus. Simulation results show that our proposed model can significantly improve the accuracy of latency modeling especially compared to existing popular solutions such as interpolation and nearest neighbor-based methods.
△ Less
Submitted 15 March, 2020;
originally announced March 2020.
-
Machine Learning Techniques for Biomedical Image Segmentation: An Overview of Technical Aspects and Introduction to State-of-Art Applications
Authors:
Hyunseok Seo,
Masoud Badiei Khuzani,
Varun Vasudevan,
Charles Huang,
Hongyi Ren,
Ruoxiu Xiao,
Xiao Jia,
Lei Xing
Abstract:
In recent years, significant progress has been made in developing more accurate and efficient machine learning algorithms for segmentation of medical and natural images. In this review article, we highlight the imperative role of machine learning algorithms in enabling efficient and accurate segmentation in the field of medical imaging. We specifically focus on several key studies pertaining to th…
▽ More
In recent years, significant progress has been made in developing more accurate and efficient machine learning algorithms for segmentation of medical and natural images. In this review article, we highlight the imperative role of machine learning algorithms in enabling efficient and accurate segmentation in the field of medical imaging. We specifically focus on several key studies pertaining to the application of machine learning methods to biomedical image segmentation. We review classical machine learning algorithms such as Markov random fields, k-means clustering, random forest, etc. Although such classical learning models are often less accurate compared to the deep learning techniques, they are often more sample efficient and have a less complex structure. We also review different deep learning architectures, such as the artificial neural networks (ANNs), the convolutional neural networks (CNNs), and the recurrent neural networks (RNNs), and present the segmentation results attained by those learning models that were published in the past three years. We highlight the successes and limitations of each machine learning paradigm. In addition, we discuss several challenges related to the training of different machine learning models, and we present some heuristics to address those challenges.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Modified U-Net (mU-Net) with Incorporation of Object-Dependent High Level Features for Improved Liver and Liver-Tumor Segmentation in CT Images
Authors:
Hyunseok Seo,
Charles Huang,
Maxime Bassenne,
Ruoxiu Xiao,
Lei Xing
Abstract:
Segmentation of livers and liver tumors is one of the most important steps in radiation therapy of hepatocellular carcinoma. The segmentation task is often done manually, making it tedious, labor intensive, and subject to intra-/inter- operator variations. While various algorithms for delineating organ-at-risks (OARs) and tumor targets have been proposed, automatic segmentation of livers and liver…
▽ More
Segmentation of livers and liver tumors is one of the most important steps in radiation therapy of hepatocellular carcinoma. The segmentation task is often done manually, making it tedious, labor intensive, and subject to intra-/inter- operator variations. While various algorithms for delineating organ-at-risks (OARs) and tumor targets have been proposed, automatic segmentation of livers and liver tumors remains intractable due to their low tissue contrast with respect to the surrounding organs and their deformable shape in CT images. The U-Net has gained increasing popularity recently for image analysis tasks and has shown promising results. Conventional U-Net architectures, however, suffer from three major drawbacks. To cope with these problems, we added a residual path with deconvolution and activation operations to the skip connection of the U-Net to avoid duplication of low resolution information of features. In the case of small object inputs, features in the skip connection are not incorporated with features in the residual path. Furthermore, the proposed architecture has additional convolution layers in the skip connection in order to extract high level global features of small object inputs as well as high level features of high resolution edge information of large object inputs. Efficacy of the modified U-Net (mU-Net) was demonstrated using the public dataset of Liver tumor segmentation (LiTS) challenge 2017. The proposed mU-Net outperformed existing state-of-art networks.
△ Less
Submitted 31 October, 2019;
originally announced November 2019.
-
A Ranking Model Motivated by Nonnegative Matrix Factorization with Applications to Tennis Tournaments
Authors:
Rui Xia,
Vincent Y. F. Tan,
Louis Filstroff,
Cédric Févotte
Abstract:
We propose a novel ranking model that combines the Bradley-Terry-Luce probability model with a nonnegative matrix factorization framework to model and uncover the presence of latent variables that influence the performance of top tennis players. We derive an efficient, provably convergent, and numerically stable majorization-minimization-based algorithm to maximize the likelihood of datasets under…
▽ More
We propose a novel ranking model that combines the Bradley-Terry-Luce probability model with a nonnegative matrix factorization framework to model and uncover the presence of latent variables that influence the performance of top tennis players. We derive an efficient, provably convergent, and numerically stable majorization-minimization-based algorithm to maximize the likelihood of datasets under the proposed statistical model. The model is tested on datasets involving the outcomes of matches between 20 top male and female tennis players over 14 major tournaments for men (including the Grand Slams and the ATP Masters 1000) and 16 major tournaments for women over the past 10 years. Our model automatically infers that the surface of the court (e.g., clay or hard court) is a key determinant of the performances of male players, but less so for females. Top players on various surfaces over this longitudinal period are also identified in an objective manner.
△ Less
Submitted 12 June, 2019; v1 submitted 15 March, 2019;
originally announced March 2019.