-
SHAP-AAD: DeepSHAP-Guided Channel Reduction for EEG Auditory Attention Detection
Authors:
Rayan Salmi,
Guorui Lu,
Qinyu Chen
Abstract:
Electroencephalography (EEG)-based auditory attention detection (AAD) offers a non-invasive way to enhance hearing aids, but conventional methods rely on too many electrodes, limiting wearability and comfort. This paper presents SHAP-AAD, a two-stage framework that combines DeepSHAP-based channel selection with a lightweight temporal convolutional network (TCN) for efficient AAD using fewer channe…
▽ More
Electroencephalography (EEG)-based auditory attention detection (AAD) offers a non-invasive way to enhance hearing aids, but conventional methods rely on too many electrodes, limiting wearability and comfort. This paper presents SHAP-AAD, a two-stage framework that combines DeepSHAP-based channel selection with a lightweight temporal convolutional network (TCN) for efficient AAD using fewer channels.DeepSHAP, an explainable AI technique, is applied to a Convolutional Neural Network (CNN) trained on topographic alpha-power maps to rank channel importance, and the top-k EEG channels are used to train a compact TCN. Experiments on the DTU dataset show that using 32 channels yields comparable accuracy to the full 64-channel setup (79.21% vs. 81.06%) on average. In some cases, even 8 channels can deliver satisfactory accuracy. These results demonstrate the effectiveness of SHAP-AAD in reducing complexity while preserving high detection performance.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
PanTS: The Pancreatic Tumor Segmentation Dataset
Authors:
Wenxuan Li,
Xinze Zhou,
Qi Chen,
Tianyu Lin,
Pedro R. A. S. Bassi,
Szymon Plotka,
Jaroslaw B. Cwikla,
Xiaoxi Chen,
Chen Ye,
Zheren Zhu,
Kai Ding,
Heng Li,
Kang Wang,
Yang Yang,
Yucheng Tang,
Daguang Xu,
Alan L. Yuille,
Zongwei Zhou
Abstract:
PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/tho…
▽ More
PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/thoracic organs. Each scan includes metadata such as patient age, sex, diagnosis, contrast phase, in-plane spacing, slice thickness, etc. AI models trained on PanTS achieve significantly better performance in pancreatic tumor detection, localization, and segmentation compared to those trained on existing public datasets. Our analysis indicates that these gains are directly attributable to the 16x larger-scale tumor annotations and indirectly supported by the 24 additional surrounding anatomical structures. As the largest and most comprehensive resource of its kind, PanTS offers a new benchmark for developing and evaluating AI models in pancreatic CT analysis.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
Authors:
Huadai Liu,
Jialei Wang,
Kaicheng Luo,
Wen Wang,
Qian Chen,
Zhou Zhao,
Wei Xue
Abstract:
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework t…
▽ More
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Project.github.io.
△ Less
Submitted 28 June, 2025; v1 submitted 26 June, 2025;
originally announced June 2025.
-
MTSIC: Multi-stage Transformer-based GAN for Spectral Infrared Image Colorization
Authors:
Tingting Liu,
Yuan Liu,
Jinhui Tang,
Liyin Yuan,
Chengyu Liu,
Chunlai Li,
Xiubao Sui,
Qian Chen
Abstract:
Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient…
▽ More
Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient feature extraction capabilities, which often result in image distortion and semantic ambiguity. In contrast, multiband infrared imagery provides richer spectral data, facilitating the preservation of finer details and enhancing semantic accuracy. In this paper, we propose a generative adversarial network (GAN)-based framework designed to integrate spectral information to enhance the colorization of infrared images. The framework employs a multi-stage spectral self-attention Transformer network (MTSIC) as the generator. Each spectral feature is treated as a token for self-attention computation, and a multi-head self-attention mechanism forms a spatial-spectral attention residual block (SARB), achieving multi-band feature mapping and reducing semantic confusion. Multiple SARB units are integrated into a Transformer-based single-stage network (STformer), which uses a U-shaped architecture to extract contextual information, combined with multi-scale wavelet blocks (MSWB) to align semantic information in the spatial-frequency dual domain. Multiple STformer modules are cascaded to form MTSIC, progressively optimizing the reconstruction quality. Experimental results demonstrate that the proposed method significantly outperforms traditional techniques and effectively enhances the visual quality of infrared images.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
TCN-DPD: Parameter-Efficient Temporal Convolutional Networks for Wideband Digital Predistortion
Authors:
Huanqiang Duan,
Manno Versluis,
Qinyu Chen,
Leo C. N. de Vreede,
Chang Gao
Abstract:
Digital predistortion (DPD) is essential for mitigating nonlinearity in RF power amplifiers, particularly for wideband applications. This paper presents TCN-DPD, a parameter-efficient architecture based on temporal convolutional networks, integrating noncausal dilated convolutions with optimized activation functions. Evaluated on the OpenDPD framework with the DPA_200MHz dataset, TCN-DPD achieves…
▽ More
Digital predistortion (DPD) is essential for mitigating nonlinearity in RF power amplifiers, particularly for wideband applications. This paper presents TCN-DPD, a parameter-efficient architecture based on temporal convolutional networks, integrating noncausal dilated convolutions with optimized activation functions. Evaluated on the OpenDPD framework with the DPA_200MHz dataset, TCN-DPD achieves simulated ACPRs of -51.58/-49.26 dBc (L/R), EVM of -47.52 dB, and NMSE of -44.61 dB with 500 parameters and maintains superior linearization than prior models down to 200 parameters, making it promising for efficient wideband PA linearization.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
Authors:
Chen Bao,
Chuanbing Huo,
Qinyu Chen,
Chang Gao
Abstract:
This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic tr…
▽ More
This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic transcripts, improving supervision quality. We conduct extensive experiments across multiple data mixing configurations and evaluation settings. Results show that our fine-tuned model significantly outperforms the zero-shot baseline, reducing WER on aphasic speech by over 30% while preserving performance on standard speech. The proposed framework offers a scalable, efficient solution for real-world disordered speech recognition.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?
Authors:
Tianyu Lin,
Xinran Li,
Chuntung Zhuang,
Qi Chen,
Yuanhao Cai,
Kai Ding,
Alan L. Yuille,
Zongwei Zhou
Abstract:
Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metric…
▽ More
Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
Authors:
Wenrui Liu,
Qian Chen,
Wen Wang,
Yafeng Chen,
Jin Xu,
Zhifang Guo,
Guanrou Yang,
Weiqin Li,
Xiaoda Yang,
Tao Jin,
Minghui Fang,
Jialong Zuo,
Bai Jionghao,
Zemin Liu
Abstract:
Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-r…
▽ More
Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-range dependency: due to the monotonic alignment between text and speech in text-to-speech (TTS) tasks, the prediction of the current token primarily relies on its local context, while long-range tokens contribute less to the current token prediction and often contain redundant information. Inspired by this observation, we propose a \textbf{compressed-to-fine language modeling} approach to address the challenge of long sequence speech tokens within neural codec language models: (1) \textbf{Fine-grained Initial and Short-range Information}: Our approach retains the prompt and local tokens during prediction to ensure text alignment and the integrity of paralinguistic information; (2) \textbf{Compressed Long-range Context}: Our approach compresses long-range token spans into compact representations to reduce redundant information while preserving essential semantics. Extensive experiments on various neural audio codecs and downstream language models validate the effectiveness and generalizability of the proposed approach, highlighting the importance of token compression in improving speech generation within neural codec language models. The demo of audio samples will be available at https://anonymous.4open.science/r/SpeechTokenPredictionViaCompressedToFinedLM.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Authors:
Zhihao Du,
Changfeng Gao,
Yuxuan Wang,
Fan Yu,
Tianyu Zhao,
Hao Wang,
Xiang Lv,
Hui Wang,
Chongjia Ni,
Xian Shi,
Keyu An,
Guanrou Yang,
Yabin Li,
Yanni Chen,
Zhifu Gao,
Qian Chen,
Yue Gu,
Mengzhe Chen,
Yafeng Chen,
Shiliang Zhang,
Wen Wang,
Jieping Ye
Abstract:
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-…
▽ More
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.
△ Less
Submitted 27 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization
Authors:
Yafeng Chen,
Chong Deng,
Hui Wang,
Yiheng Jiang,
Han Yin,
Qian Chen,
Wen Wang
Abstract:
Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explic…
▽ More
Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explicitly addresses the collapse problem through the application of regularization terms to speaker embeddings. Moreover, we integrate score normalization techniques from fully supervised SV to further bridge the gap toward supervised verification performance. SDPN with dimension regularization and score normalization sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.29%, 1.60%, and 2.80% for trial VoxCeleb1-{O,E,H} respectively. These results demonstrate relative improvements of 28.3%, 19.6%, and 22.6% over the current best self-supervised methods, thereby advancing the frontiers of SV technology.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
DeltaDPD: Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion
Authors:
Yizhuo Wu,
Yi Zhu,
Kun Qian,
Qinyu Chen,
Anding Zhu,
John Gajadharsing,
Leo C. N. de Vreede,
Chang Gao
Abstract:
Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This…
▽ More
Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This paper introduces DeltaDPD, exploring the dynamic temporal sparsity of input signals and neuronal hidden states in RNNs for energy-efficient DPD, reducing arithmetic operations and memory accesses while preserving satisfactory linearization performance. Applying a TM3.1a 200MHz-BW 256-QAM OFDM signal to a 3.5 GHz GaN Doherty RF PA, DeltaDPD achieves -50.03 dBc in Adjacent Channel Power Ratio (ACPR), -37.22 dB in Normalized Mean Square Error (NMSE) and -38.52 dBc in Error Vector Magnitude (EVM) with 52% temporal sparsity, leading to a 1.8X reduction in estimated inference power. The DeltaDPD code will be released after formal publication at https://www.opendpd.com.
△ Less
Submitted 29 April, 2025;
originally announced May 2025.
-
Rapid diagnostics of reconfigurable intelligent surfaces using space-time-coding modulation
Authors:
Yi Ning Zheng,
Lei Zhang,
Xiao Qing Chen,
Marco Rossi,
Giuseppe Castaldi,
Shuo Liu,
Tie Jun Cui,
Vincenzo Galdi
Abstract:
Reconfigurable intelligent surfaces (RISs) have emerged as a key technology for shaping smart wireless environments in next-generation wireless communication systems. To support the large-scale deployment of RISs, a reliable and efficient diagnostic method is essential to ensure optimal performance. In this work, a robust and efficient approach for RIS diagnostics is proposed using a space-time co…
▽ More
Reconfigurable intelligent surfaces (RISs) have emerged as a key technology for shaping smart wireless environments in next-generation wireless communication systems. To support the large-scale deployment of RISs, a reliable and efficient diagnostic method is essential to ensure optimal performance. In this work, a robust and efficient approach for RIS diagnostics is proposed using a space-time coding strategy with orthogonal codes. The method encodes the reflected signals from individual RIS elements into distinct code channels, enabling the recovery of channel power at the receiving terminals for fault identification. Theoretical analysis shows that the normally functioning elements generate high power in their respective code channels, whereas the faulty elements exhibit significantly lower power. This distinction enables rapid and accurate diagnostics of elements' operational states through simple signal processing techniques. Simulation results validate the effectiveness of the proposed method, even under high fault ratios and varying reception angles. Proof-of-principle experiments on two RIS prototypes are conducted, implementing two coding strategies: direct and segmented. Experimental results in a realistic scenario confirm the reliability of the diagnostic method, demonstrating its potential for large-scale RIS deployment in future wireless communication systems and radar applications.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
OmniAudio: Generating Spatial Audio from 360-Degree Video
Authors:
Huadai Liu,
Tianyi Luo,
Kaicheng Luo,
Qikai Jiang,
Peiwen Sun,
Jialei Wang,
Rongjie Huang,
Qian Chen,
Wen Wang,
Xiangtai Li,
Shiliang Zhang,
Zhijie Yan,
Zhou Zhao,
Wei Xue
Abstract:
Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard for…
▽ More
Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at https://github.com/liuhuadai/OmniAudio. The project website is available at https://OmniAudio-360V2SA.github.io.
△ Less
Submitted 2 June, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Authors:
Guanrou Yang,
Chen Yang,
Qian Chen,
Ziyang Ma,
Wenxi Chen,
Wen Wang,
Tianrui Wang,
Yifan Yang,
Zhikang Niu,
Wenrui Liu,
Fan Yu,
Zhihao Du,
Zhifu Gao,
ShiLiang Zhang,
Xie Chen
Abstract:
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLM…
▽ More
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://yanghaha0908.github.io/EmoVoice/. Dataset, code, and checkpoints will be released.
△ Less
Submitted 21 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation
Authors:
Jia Wei,
Xiaoqi Zhao,
Jonghye Woo,
Jinsong Ouyang,
Georges El Fakhri,
Qingyu Chen,
Xiaofeng Liu
Abstract:
Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of o…
▽ More
Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Deep Learning-Based Quantitative Assessment of Renal Chronicity Indices in Lupus Nephritis
Authors:
Tianqi Tu,
Hui Wang,
Jiangbo Pei,
Xiaojuan Yu,
Aidong Men,
Suxia Wang,
Qingchao Chen,
Ying Tan,
Feng Yu,
Minghui Zhao
Abstract:
Background: Renal chronicity indices (CI) have been identified as strong predictors of long-term outcomes in lupus nephritis (LN) patients. However, assessment by pathologists is hindered by challenges such as substantial time requirements, high interobserver variation, and susceptibility to fatigue. This study aims to develop an effective deep learning (DL) pipeline that automates the assessment…
▽ More
Background: Renal chronicity indices (CI) have been identified as strong predictors of long-term outcomes in lupus nephritis (LN) patients. However, assessment by pathologists is hindered by challenges such as substantial time requirements, high interobserver variation, and susceptibility to fatigue. This study aims to develop an effective deep learning (DL) pipeline that automates the assessment of CI and provides valuable prognostic insights from a disease-specific perspective. Methods: We curated a dataset comprising 282 slides obtained from 141 patients across two independent cohorts with a complete 10-years follow-up. Our DL pipeline was developed on 60 slides (22,410 patch images) from 30 patients in the training cohort and evaluated on both an internal testing set (148 slides, 77,605 patch images) and an external testing set (74 slides, 27,522 patch images). Results: The study included two cohorts with slight demographic differences, particularly in age and hemoglobin levels. The DL pipeline showed high segmentation performance across tissue compartments and histopathologic lesions, outperforming state-of-the-art methods. The DL pipeline also demonstrated a strong correlation with pathologists in assessing CI, significantly improving interobserver agreement. Additionally, the DL pipeline enhanced prognostic accuracy, particularly in outcome prediction, when combined with clinical parameters and pathologist-assessed CIs Conclusions: The DL pipeline demonstrated accuracy and efficiency in assessing CI in LN, showing promise in improving interobserver agreement among pathologists. It also exhibited significant value in prognostic analysis and enhancing outcome prediction in LN patients, offering a valuable tool for clinical decision-making.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology
Authors:
Yi Luo,
Hamed Hooshangnejad,
Xue Feng,
Gaofeng Huang,
Xiaojian Chen,
Rui Zhang,
Quan Chen,
Wil Ngwa,
Kai Ding
Abstract:
Background: Lung cancer ranks as the leading cause of cancer-related mortality worldwide. The complexity of tumor delineation, crucial for radiation therapy, requires expertise often unavailable in resource-limited settings. Artificial Intelligence(AI), particularly with advancements in deep learning (DL) and natural language processing (NLP), offers potential solutions yet is challenged by high f…
▽ More
Background: Lung cancer ranks as the leading cause of cancer-related mortality worldwide. The complexity of tumor delineation, crucial for radiation therapy, requires expertise often unavailable in resource-limited settings. Artificial Intelligence(AI), particularly with advancements in deep learning (DL) and natural language processing (NLP), offers potential solutions yet is challenged by high false positive rates. Purpose: The Oncology Contouring Copilot (OCC) system is developed to leverage oncologist expertise for precise tumor contouring using textual descriptions, aiming to increase the efficiency of oncological workflows by combining the strengths of AI with human oversight. Methods: Our OCC system initially identifies nodule candidates from CT scans. Employing Language Vision Models (LVMs) like GPT-4V, OCC then effectively reduces false positives with clinical descriptive texts, merging textual and visual data to automate tumor delineation, designed to elevate the quality of oncology care by incorporating knowledge from experienced domain experts. Results: Deployments of the OCC system resulted in a significant reduction in the false discovery rate by 35.0%, a 72.4% decrease in false positives per scan, and an F1-score of 0.652 across our dataset for unbiased evaluation. Conclusions: OCC represents a significant advance in oncology care, particularly through the use of the latest LVMs to improve contouring results by (1) streamlining oncology treatment workflows by optimizing tumor delineation, reducing manual processes; (2) offering a scalable and intuitive framework to reduce false positives in radiotherapy planning using LVMs; (3) introducing novel medical language vision prompt techniques to minimize LVMs hallucinations with ablation study, and (4) conducting a comparative analysis of LVMs, highlighting their potential in addressing medical language vision challenges.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Adaptive Mixture of Low-Rank Experts for Robust Audio Spoofing Detection
Authors:
Qixian Chen,
Yuxiong Xu,
Sara Mandelli,
Sheng Li,
Bin Li
Abstract:
In audio spoofing detection, most studies rely on clean datasets, making models susceptible to real-world post-processing attacks, such as channel compression and noise. To overcome this challenge, we propose the Adaptive MixtUre Low-rank ExperTs (AMULET) framework, which enhances resilience by leveraging attack-specific knowledge and dynamically adapting to varied attack conditions. Specifically,…
▽ More
In audio spoofing detection, most studies rely on clean datasets, making models susceptible to real-world post-processing attacks, such as channel compression and noise. To overcome this challenge, we propose the Adaptive MixtUre Low-rank ExperTs (AMULET) framework, which enhances resilience by leveraging attack-specific knowledge and dynamically adapting to varied attack conditions. Specifically, AMULET employs Attack-Specific Experts (ASEs) fine-tuned with Low-Rank Adaptation (LoRA), allowing each expert to focus on distinct post-processing patterns using just 1.13\% of the parameters required for full fine-tuning. Furthermore, we introduce Adaptive Expert Fusion (AEF), which adaptively selects and integrates expert knowledge to enhance the robustness of spoofing detection. Experimental results demonstrate that AMULET significantly enhances robustness by improving noise resilience and exhibiting greater adaptability to unseen post-processing methods compared to models trained with full fine-tuning. Additionally, our framework outperforms both single expert and other expert aggregation strategies under various mixed attacks, demonstrating its superior robustness and adaptability in managing complex real-world scenarios.
△ Less
Submitted 10 May, 2025; v1 submitted 15 March, 2025;
originally announced March 2025.
-
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Authors:
Zeyue Tian,
Yizhu Jin,
Zhaoyang Liu,
Ruibin Yuan,
Xu Tan,
Qifeng Chen,
Wei Xue,
Yike Guo
Abstract:
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anyt…
▽ More
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/
△ Less
Submitted 23 April, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
FlowDec: A flow-based full-band general audio codec with high perceptual quality
Authors:
Simon Welker,
Matthew Le,
Ricky T. Q. Chen,
Wei-Ning Hsu,
Timo Gerkmann,
Alexander Richard,
Yi-Chiao Wu
Abstract:
We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quali…
▽ More
We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation
Authors:
Chong Zhang,
Yukun Ma,
Qian Chen,
Wen Wang,
Shengkui Zhao,
Zexu Pan,
Hao Wang,
Chongjia Ni,
Trung Hieu Nguyen,
Kun Zhou,
Yidi Jiang,
Chaohong Tan,
Zhifu Gao,
Zhihao Du,
Bin Ma
Abstract:
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sam…
▽ More
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook
Authors:
Yidi Jiang,
Qian Chen,
Shengpeng Ji,
Yu Xi,
Wen Wang,
Chong Zhang,
Xianghu Yue,
ShiLiang Zhang,
Haizhou Li
Abstract:
The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from multi-layer residual vector quantizer to single-layer quantizer are beneficial for language-autoregressive decoding. However, the capability to handle multi-domain audio…
▽ More
The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from multi-layer residual vector quantizer to single-layer quantizer are beneficial for language-autoregressive decoding. However, the capability to handle multi-domain audio signals through a single codebook remains constrained by inter-domain distribution discrepancies. In this work, we introduce UniCodec, a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound. To achieve this, we propose a partitioned domain-adaptive codebook method and domain Mixture-of-Experts strategy to capture the distinct characteristics of each audio domain. Furthermore, to enrich the semantic density of the codec without auxiliary modules, we propose a self-supervised mask prediction modeling approach. Comprehensive objective and subjective evaluations demonstrate that UniCodec achieves excellent audio reconstruction performance across the three audio domains, outperforming existing unified neural codecs with a single codebook, and even surpasses state-of-the-art domain-specific codecs on both acoustic and semantic representation capabilities.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
CSSSTN: A Class-sensitive Subject-to-subject Semantic Style Transfer Network for EEG Classification in RSVP Tasks
Authors:
Ziyue Yang,
Chengrui Chen,
Yong Peng,
Qiong Chen,
Wanzeng Kong
Abstract:
The Rapid Serial Visual Presentation (RSVP) paradigm represents a promising application of electroencephalography (EEG) in Brain-Computer Interface (BCI) systems. However, cross-subject variability remains a critical challenge, particularly for BCI-illiterate users who struggle to effectively interact with these systems. To address this issue, we propose the Class-Sensitive Subject-to-Subject Sema…
▽ More
The Rapid Serial Visual Presentation (RSVP) paradigm represents a promising application of electroencephalography (EEG) in Brain-Computer Interface (BCI) systems. However, cross-subject variability remains a critical challenge, particularly for BCI-illiterate users who struggle to effectively interact with these systems. To address this issue, we propose the Class-Sensitive Subject-to-Subject Semantic Style Transfer Network (CSSSTN), which incorporates a class-sensitive approach to align feature distributions between golden subjects (BCI experts) and target (BCI-illiterate) users on a class-by-class basis. Building on the SSSTN framework, CSSSTN incorporates three key components: (1) subject-specific classifier training, (2) a unique style loss to transfer class-discriminative features while preserving semantic information through a modified content loss, and (3) an ensemble approach to integrate predictions from both source and target domains. We evaluated CSSSTN using both a publicly available dataset and a self-collected dataset. Experimental results demonstrate that CSSSTN outperforms state-of-the-art methods, achieving mean balanced accuracy improvements of 6.4\% on the Tsinghua dataset and 3.5\% on the HDU dataset, with notable benefits for BCI-illiterate users. Ablation studies confirm the effectiveness of each component, particularly the class-sensitive transfer and the use of lower-layer features, which enhance transfer performance and mitigate negative transfer. Additionally, CSSSTN achieves competitive results with minimal target data, reducing calibration time and effort. These findings highlight the practical potential of CSSSTN for real-world BCI applications, offering a robust and scalable solution to improve the performance of BCI-illiterate users while minimizing reliance on extensive training data. Our code is available at https://github.com/ziyuey/CSSSTN.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?
Authors:
Qingshan Hou,
Yukun Zhou,
Jocelyn Hui Lin Goh,
Ke Zou,
Samantha Min Er Yew,
Sahana Srinivasan,
Meng Wang,
Thaddaeus Lo,
Xiaofeng Lei,
Siegfried K. Wagner,
Mark A. Chia,
Dawei Yang,
Hongyang Jiang,
AnRan Ran,
Rui Santos,
Gabor Mark Somfai,
Juan Helen Zhou,
Haoyu Chen,
Qingyu Chen,
Carol Yim-Lui Cheung,
Pearse A. Keane,
Yih Chung Tham
Abstract:
The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domai…
▽ More
The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Exploiting the Hidden Capacity of MMC Through Accurate Quantification of Modulation Indices
Authors:
Qianhao Sun,
Jingwei Meng,
Ruofan Li,
Mingchao Xia,
Qifang Chen,
Jiejie Zhou,
Meiqi Fan,
Peiqian Guo
Abstract:
The modular multilevel converter (MMC) has become increasingly important in voltage-source converter-based high-voltage direct current (VSC-HVDC) systems. Direct and indirect modulation are widely used as mainstream modulation techniques in MMCs. However, due to the challenge of quantitatively evaluating the operation of different modulation schemes, the academic and industrial communities still h…
▽ More
The modular multilevel converter (MMC) has become increasingly important in voltage-source converter-based high-voltage direct current (VSC-HVDC) systems. Direct and indirect modulation are widely used as mainstream modulation techniques in MMCs. However, due to the challenge of quantitatively evaluating the operation of different modulation schemes, the academic and industrial communities still hold differing opinions on their performance. To address this controversy, this paper employs the state-of-the-art computational methods and quantitative metrics to compare the performance among different modulation schemes. The findings indicate that direct modulation offers superior modulation potential for MMCs, highlighting its higher ac voltage output capability and broader linear PQ operation region. Conversely, indirect modulation is disadvantaged in linear modulation, which indicates inferior output voltage capability. Furthermore, this paper delves into the conditions whereby direct and indirect modulation techniques become equivalent in steady-state. The study findings suggest that the modulation capability of direct modulation is the same as that of indirect modulation in steady-state when additional controls, including closed-loop capacitor voltage control and circulating current suppression control (CCSC), are simultaneously active. Simulation and experiments verify the correctness and validity.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
A Grid-Forming HVDC Series Tapping Converter Using Extended Techniques of Flex-LCC
Authors:
Qianhao Sun,
Ruofan Li,
Jichen Wang,
Mingchao Xia,
Qifang Chen,
Meiqi Fan,
Gen Li,
Xuebo Qiao
Abstract:
This paper discusses an extension technology for the previously proposed Flexible Line-Commutated Converter (Flex LCC) [1]. The proposed extension involves modifying the arm internal-electromotive-force control, redesigning the main-circuit parameters, and integrating a low-power coordination strategy. As a result, the Flex-LCC transforms from a grid-forming (GFM) voltage source converter (VSC) ba…
▽ More
This paper discusses an extension technology for the previously proposed Flexible Line-Commutated Converter (Flex LCC) [1]. The proposed extension involves modifying the arm internal-electromotive-force control, redesigning the main-circuit parameters, and integrating a low-power coordination strategy. As a result, the Flex-LCC transforms from a grid-forming (GFM) voltage source converter (VSC) based on series-connected LCC and FBMMC into a novel GFM HVDC series tapping converter, referred to as the Extended Flex-LCC (EFLCC). The EFLCC provides dc characteristics resembling those of current source converters (CSCs) and ac characteristics resembling those of GFM VSCs. This makes it easier to integrate relatively small renewable energy sources (RESs) that operate in islanded or weak-grid supported conditions with an existing LCC-HVDC. Meanwhile, the EFLCC distinguishes itself by requiring fewer full-controlled switches and less energy storage, resulting in lower losses and costs compared to the FBMMC HVDC series tap solution. In particular, the reduced capacity requirement and the wide allowable range of valve-side ac voltages in the FBMMC part facilitate the matching of current-carrying capacities between full-controlled switches and thyristors. The application scenario, system-level analysis, implementation, converter-level operation, and comparison of the EFLCC are presented in detail in this paper. The theoretical analysis is confirmed by experimental and simulation results.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Authors:
Qianniu Chen,
Xiaoyang Hao,
Bowen Li,
Yue Liu,
Li Lu
Abstract:
Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In thi…
▽ More
Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Authors:
Qian Chen,
Yafeng Chen,
Yanni Chen,
Mengzhe Chen,
Yingda Chen,
Chong Deng,
Zhihao Du,
Ruize Gao,
Changfeng Gao,
Zhifu Gao,
Yabin Li,
Xiang Lv,
Jiaqing Liu,
Haoneng Luo,
Bin Ma,
Chongjia Ni,
Xian Shi,
Jialong Tang,
Hui Wang,
Hao Wang,
Wen Wang,
Yuxuan Wang,
Yunlan Xu,
Fan Yu,
Zhijie Yan
, et al. (11 additional authors not shown)
Abstract:
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence le…
▽ More
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Language-based Audio Retrieval with Co-Attention Networks
Authors:
Haoran Sun,
Zimu Wang,
Qiuyi Chen,
Jianjun Chen,
Jia Wang,
Haiyang Zhang
Abstract:
In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across…
▽ More
In recent years, user-generated audio content has proliferated across various media platforms, creating a growing need for efficient retrieval methods that allow users to search for audio clips using natural language queries. This task, known as language-based audio retrieval, presents significant challenges due to the complexity of learning semantic representations from heterogeneous data across both text and audio modalities. In this work, we introduce a novel framework for the language-based audio retrieval task that leverages co-attention mechanismto jointly learn meaningful representations from both modalities. To enhance the model's ability to capture fine-grained cross-modal interactions, we propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to progressively refine the semantic alignment between text and audio. Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method. Specifically, our best performed co-attention model achieves a 16.6% improvement in mean Average Precision on Clotho dataset, and a 15.1% improvement on AudioCaps.
△ Less
Submitted 30 December, 2024;
originally announced December 2024.
-
Text-Driven Tumor Synthesis
Authors:
Xinran Li,
Yi Shuai,
Chen Liu,
Qi Chen,
Qilong Wu,
Pengfei Guo,
Dong Yang,
Can Zhao,
Pedro R. A. S. Bassi,
Daguang Xu,
Kang Wang,
Yang Yang,
Alan Yuille,
Zongwei Zhou
Abstract:
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and…
▽ More
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Authors:
Zhihao Du,
Yuxuan Wang,
Qian Chen,
Xian Shi,
Xiang Lv,
Tianyu Zhao,
Zhifu Gao,
Yexin Yang,
Changfeng Gao,
Hui Wang,
Fan Yu,
Huadai Liu,
Zhengyan Sheng,
Yue Gu,
Chong Deng,
Wen Wang,
Shiliang Zhang,
Zhijie Yan,
Jingren Zhou
Abstract:
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progr…
▽ More
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.
△ Less
Submitted 25 December, 2024; v1 submitted 13 December, 2024;
originally announced December 2024.
-
KNN-MMD: Cross Domain Wireless Sensing via Local Distribution Alignment
Authors:
Zijian Zhao,
Zhijie Cai,
Tingwei Chen,
Xiaoyang Li,
Hang Li,
Qimei Chen,
Guangxu Zhu
Abstract:
Wireless sensing has recently found widespread applications in diverse environments, including homes, offices, and public spaces. By analyzing patterns in channel state information (CSI), it is possible to infer human actions for tasks such as person identification, gesture recognition, and fall detection. However, CSI is highly sensitive to environmental changes, where even minor alterations can…
▽ More
Wireless sensing has recently found widespread applications in diverse environments, including homes, offices, and public spaces. By analyzing patterns in channel state information (CSI), it is possible to infer human actions for tasks such as person identification, gesture recognition, and fall detection. However, CSI is highly sensitive to environmental changes, where even minor alterations can significantly distort the CSI patterns. This sensitivity often leads to performance degradation or outright failure when applying wireless sensing models trained in one environment to another. To address this challenge, Domain Alignment (DAL) has been widely adopted for cross-domain classification tasks, as it focuses on aligning the global distributions of the source and target domains in feature space. Despite its popularity, DAL often neglects inter-category relationships, which can lead to misalignment between categories across domains, even when global alignment is achieved. To overcome these limitations, we propose K-Nearest Neighbors Maximum Mean Discrepancy (KNN-MMD), a novel few-shot method for cross-domain wireless sensing. Our approach begins by constructing a help set using KNN from the target domain, enabling local alignment between the source and target domains within each category using MMD. Additionally, we address a key instability issue commonly observed in cross-domain methods, where model performance fluctuates sharply between epochs. Further, most existing methods struggle to determine an optimal stopping point during training due to the absence of labeled data from the target domain. Our method resolves this by excluding the support set from the target domain during training and employing it as a validation set to determine the stopping criterion.The dataset and code are publicly available at https://github.com/RS2002/KNN-MMD .
△ Less
Submitted 27 June, 2025; v1 submitted 6 December, 2024;
originally announced December 2024.
-
Towards Clinical Practice in CT-Based Pulmonary Disease Screening: An Efficient and Reliable Framework
Authors:
Qian Shao,
Bang Du,
Kai Zhang,
Yixuan Wu,
Zepeng Li,
Qiyuan Chen,
Qianqian Tang,
Jian Wu,
Jintai Chen,
Honghao Gao,
Hongxia Xu
Abstract:
Deep learning models for pulmonary disease screening from Computed Tomography (CT) scans promise to alleviate the immense workload on radiologists. Still, their high computational cost, stemming from processing entire 3D volumes, remains a major barrier to widespread clinical adoption. Current sub-sampling techniques often compromise diagnostic integrity by introducing artifacts or discarding crit…
▽ More
Deep learning models for pulmonary disease screening from Computed Tomography (CT) scans promise to alleviate the immense workload on radiologists. Still, their high computational cost, stemming from processing entire 3D volumes, remains a major barrier to widespread clinical adoption. Current sub-sampling techniques often compromise diagnostic integrity by introducing artifacts or discarding critical information. To overcome these limitations, we propose an Efficient and Reliable Framework (ERF) that fundamentally improves the practicality of automated CT analysis. Our framework introduces two core innovations: (1) A Cluster-based Sub-Sampling (CSS) method that efficiently selects a compact yet comprehensive subset of CT slices by optimizing for both representativeness and diversity. By integrating an efficient k-Nearest Neighbor (k-NN) search with an iterative refinement process, CSS bypasses the computational bottlenecks of previous methods while preserving vital diagnostic features. (2) A lightweight Hybrid Uncertainty Quantification (HUQ) mechanism, which uniquely assesses both Aleatoric Uncertainty (AU) and Epistemic Uncertainty (EU) with minimal computational overhead. By maximizing the discrepancy between auxiliary classifiers, HUQ provides a robust reliability score, which is crucial for building trust in automated systems operating on partial data. Validated on two public datasets with 2,654 CT volumes across diagnostic tasks for 3 pulmonary diseases, our proposed ERF achieves diagnostic performance comparable to the full-volume analysis (over 90% accuracy and recall) while reducing processing time by more than 60%. This work represents a significant step towards deploying fast, accurate, and trustworthy AI-powered screening tools in time-sensitive clinical settings.
△ Less
Submitted 12 June, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation
Authors:
Tianxing Chen,
Yao Mu,
Zhixuan Liang,
Zanxin Chen,
Shijia Peng,
Qiangyu Chen,
Mingkun Xu,
Ruizhen Hu,
Hongyuan Zhang,
Xuelong Li,
Ping Luo
Abstract:
Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundat…
▽ More
Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
△ Less
Submitted 21 June, 2025; v1 submitted 27 November, 2024;
originally announced November 2024.
-
Structure-Guided MR-to-CT Synthesis with Spatial and Semantic Alignments for Attenuation Correction of Whole-Body PET/MR Imaging
Authors:
Jiaxu Zheng,
Zhenrong Shen,
Lichi Zhang,
Qun Chen
Abstract:
Deep-learning-based MR-to-CT synthesis can estimate the electron density of tissues, thereby facilitating PET attenuation correction in whole-body PET/MR imaging. However, whole-body MR-to-CT synthesis faces several challenges including the issue of spatial misalignment and the complexity of intensity mapping, primarily due to the variety of tissues and organs throughout the whole body. Here we pr…
▽ More
Deep-learning-based MR-to-CT synthesis can estimate the electron density of tissues, thereby facilitating PET attenuation correction in whole-body PET/MR imaging. However, whole-body MR-to-CT synthesis faces several challenges including the issue of spatial misalignment and the complexity of intensity mapping, primarily due to the variety of tissues and organs throughout the whole body. Here we propose a novel whole-body MR-to-CT synthesis framework, which consists of three novel modules to tackle these challenges: (1) Structure-Guided Synthesis module leverages structure-guided attention gates to enhance synthetic image quality by diminishing unnecessary contours of soft tissues; (2) Spatial Alignment module yields precise registration between paired MR and CT images by taking into account the impacts of tissue volumes and respiratory movements, thus providing well-aligned ground-truth CT images during training; (3) Semantic Alignment module utilizes contrastive learning to constrain organ-related semantic information, thereby ensuring the semantic authenticity of synthetic CT images.We conduct extensive experiments to demonstrate that the proposed whole-body MR-to-CT framework can produce visually plausible and semantically realistic CT images, and validate its utility in PET attenuation correction.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Computation-power Coupled Modeling for IDCs and Collaborative Optimization in ADNs
Authors:
Chuyi Li,
Kedi Zheng,
Hongye Guo,
Chongqing Kang,
Qixin Chen
Abstract:
The batch and online workload of Internet data centers (IDCs) offer temporal and spatial scheduling flexibility. Given that power generation costs vary over time and location, harnessing the flexibility of IDCs' energy consumption through workload regulation can optimize the power flow within the system. This paper focuses on multi-geographically distributed IDCs managed by an Internet service com…
▽ More
The batch and online workload of Internet data centers (IDCs) offer temporal and spatial scheduling flexibility. Given that power generation costs vary over time and location, harnessing the flexibility of IDCs' energy consumption through workload regulation can optimize the power flow within the system. This paper focuses on multi-geographically distributed IDCs managed by an Internet service company (ISC), which are aggregated as a controllable load. The load flexibility resulting from spatial load regulation of online workload is taken into account. A two-step workload scheduling mechanism is adopted, and a computation-power coupling model of ISC is established to facilitate collaborative optimization in active distribution networks (ADNs). To address the model-solving problem based on the assumption of scheduling homogeneity, a model reconstruction method is proposed. An efficient iterative algorithm is designed to solve the reconstructed model. Furthermore, the Nash bargaining solution is employed to coordinate the different optimization objectives of ISC and power system operators, thereby avoiding subjective arbitrariness. Experimental cases based on a 33-node distribution system are designed to verify the effectiveness of the model and algorithm in optimizing ISC's energy consumption and power flow within the system.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Optimal Energy Dispatch of Grid-Connected Electric Vehicle Considering Lithium Battery Electrochemical Model
Authors:
Yuanbo Chen,
Kedi Zheng,
Yuxuan Gu,
Jianxiao Wang,
Qixin Chen
Abstract:
The grid-connected electric vehicles (EVs) serve as a promising regulating resource in the distribution grid with Vehicle-to-Grid (V2G) facilities. In the day-ahead stage, electric vehicle batteries (EVBs) need to be precisely dispatched and controlled to ensure high efficiency and prevent degradation. This article focuses on considering a refined battery model, i.e. the electrochemical model (EM)…
▽ More
The grid-connected electric vehicles (EVs) serve as a promising regulating resource in the distribution grid with Vehicle-to-Grid (V2G) facilities. In the day-ahead stage, electric vehicle batteries (EVBs) need to be precisely dispatched and controlled to ensure high efficiency and prevent degradation. This article focuses on considering a refined battery model, i.e. the electrochemical model (EM), in the optimal dispatch of the local energy system with high penetration of EVs which replenish energy through V2G-equipped charge station and battery swapping station (BSS). In this paper, to utilize the EM efficiently, recursive EVB constraints and a corresponding matrix-based state update method are proposed based on EM power characterization. The charging EV state distribution is profiled and a multi-layer BSS model along with binary aggregation is proposed, in order to overcome the computation complexity of combining the refined battery constraints with the mixed integer optimization. Finally, a local energy system scenario is investigated for evaluation. The efficiency and effectiveness of EM consideration are assessed from the perspective of both the system and battery.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
A Data-Driven Pool Strategy for Price-Makers Under Imperfect Information
Authors:
Kedi Zheng,
Hongye Guo,
Qixin Chen
Abstract:
This paper studies the pool strategy for price-makers under imperfect information. In this occasion, market participants cannot obtain essential transmission parameters of the power system. Thus, price-makers should estimate the market results with respect to their offer curves using available historical information. The linear programming model of economic dispatch is analyzed with the theory of…
▽ More
This paper studies the pool strategy for price-makers under imperfect information. In this occasion, market participants cannot obtain essential transmission parameters of the power system. Thus, price-makers should estimate the market results with respect to their offer curves using available historical information. The linear programming model of economic dispatch is analyzed with the theory of rim multi-parametric linear programming (rim-MPLP). The characteristics of system patterns (combinations of status flags for generating units and transmission lines) are revealed. A multi-class classification model based on support vector machine (SVM) is trained to map the offer curves to system patterns, which is then integrated into the decision framework of the price-maker. The performance of the proposed method is validated on the IEEE 30-bus system, Illinois synthetic 200-bus system, and South Carolina synthetic 500-bus system.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
LiTformer: Efficient Modeling and Analysis of High-Speed Link Transmitters Using Non-Autoregressive Transformer
Authors:
Songyu Sun,
Xiao Dong,
Yanliang Sha,
Quan Chen,
Cheng Zhuo
Abstract:
High-speed serial links are fundamental to energy-efficient and high-performance computing systems such as artificial intelligence, 5G mobile and automotive, enabling low-latency and high-bandwidth communication. Transmitters (TXs) within these links are key to signal quality, while their modeling presents challenges due to nonlinear behavior and dynamic interactions with links. In this paper, we…
▽ More
High-speed serial links are fundamental to energy-efficient and high-performance computing systems such as artificial intelligence, 5G mobile and automotive, enabling low-latency and high-bandwidth communication. Transmitters (TXs) within these links are key to signal quality, while their modeling presents challenges due to nonlinear behavior and dynamic interactions with links. In this paper, we propose LiTformer: a Transformer-based model for high-speed link TXs, with a non-sequential encoder and a Transformer decoder to incorporate link parameters and capture long-range dependencies of output signals. We employ a non-autoregressive mechanism in model training and inference for parallel prediction of the signal sequence. LiTformer achieves precise TX modeling considering link impacts including crosstalk from multiple links, and provides fast prediction for various long-sequence signals with high data rates. Experimental results show that LiTformer achieves 148-456$\times$ speedup for 2-link TXs and 404-944$\times$ speedup for 16-link with mean relative errors of 0.68-1.25%, supporting 4-bit signals at Gbps data rates of single-ended and differential TXs, as well as PAM4 TXs.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Unsupervised Congestion Status Identification Using LMP Data
Authors:
Kedi Zheng,
Qixin Chen,
Yi Wang,
Chongqing Kang,
Le Xie
Abstract:
Having a better understanding of how locational marginal prices (LMPs) change helps in price forecasting and market strategy making. This paper investigates the fundamental distribution of the congestion part of LMPs in high-dimensional Euclidean space using an unsupervised approach. LMP models based on the lossless and lossy DC optimal power flow (DC-OPF) are analyzed to show the overlapping subs…
▽ More
Having a better understanding of how locational marginal prices (LMPs) change helps in price forecasting and market strategy making. This paper investigates the fundamental distribution of the congestion part of LMPs in high-dimensional Euclidean space using an unsupervised approach. LMP models based on the lossless and lossy DC optimal power flow (DC-OPF) are analyzed to show the overlapping subspace property of the LMP data. The congestion part of LMPs is spanned by certain row vectors of the power transfer distribution factor (PTDF) matrix, and the subspace attributes of an LMP vector uniquely are found to reflect the instantaneous congestion status of all the transmission lines. The proposed method searches for the basis vectors that span the subspaces of congestion LMP data in hierarchical ways. In the bottom-up search, the data belonging to 1-dimensional subspaces are detected, and other data are projected on the orthogonal subspaces. This procedure is repeated until all the basis vectors are found or the basis gap appears. Top-down searching is used to address the basis gap by hyperplane detection with outliers. Once all the basis vectors are detected, the congestion status can be identified. Numerical experiments based on the IEEE 30-bus system, IEEE 118-bus system, Illinois 200-bus system, and Southwest Power Pool are conducted to show the performance of the proposed method.
△ Less
Submitted 15 November, 2024;
originally announced November 2024.
-
A Novel Combined Data-Driven Approach for Electricity Theft Detection
Authors:
Kedi Zheng,
Qixin Chen,
Yi Wang,
Chongqing Kang,
Qing Xia
Abstract:
The two-way flow of information and energy is an important feature of the Energy Internet. Data analytics is a powerful tool in the information flow that aims to solve practical problems using data mining techniques. As the problem of electricity thefts via tampering with smart meters continues to increase, the abnormal behaviors of thefts become more diversified and more difficult to detect. Thus…
▽ More
The two-way flow of information and energy is an important feature of the Energy Internet. Data analytics is a powerful tool in the information flow that aims to solve practical problems using data mining techniques. As the problem of electricity thefts via tampering with smart meters continues to increase, the abnormal behaviors of thefts become more diversified and more difficult to detect. Thus, a data analytics method for detecting various types of electricity thefts is required. However, the existing methods either require a labeled dataset or additional system information which is difficult to obtain in reality or have poor detection accuracy. In this paper, we combine two novel data mining techniques to solve the problem. One technique is the Maximum Information Coefficient (MIC), which can find the correlations between the non-technical loss (NTL) and a certain electricity behavior of the consumer. MIC can be used to precisely detect thefts that appear normal in shapes. The other technique is the clustering technique by fast search and find of density peaks (CFSFDP). CFSFDP finds the abnormal users among thousands of load profiles, making it quite suitable for detecting electricity thefts with arbitrary shapes. Next, a framework for combining the advantages of the two techniques is proposed. Numerical experiments on the Irish smart meter dataset are conducted to show the good performance of the combined method.
△ Less
Submitted 10 November, 2024;
originally announced November 2024.
-
Coherent Hierarchical Probabilistic Forecasting of Electric Vehicle Charging Demand
Authors:
Kedi Zheng,
Hanwei Xu,
Zeyang Long,
Yi Wang,
Qixin Chen
Abstract:
The growing penetration of electric vehicles (EVs) significantly changes typical load curves in smart grids. With the development of fast charging technology, the volatility of EV charging demand is increasing, which requires additional flexibility for real-time power balance. The forecasting of EV charging demand involves probabilistic modeling of high dimensional time series dynamics across dive…
▽ More
The growing penetration of electric vehicles (EVs) significantly changes typical load curves in smart grids. With the development of fast charging technology, the volatility of EV charging demand is increasing, which requires additional flexibility for real-time power balance. The forecasting of EV charging demand involves probabilistic modeling of high dimensional time series dynamics across diverse electric vehicle charging stations (EVCSs). This paper studies the forecasting problem of multiple EVCS in a hierarchical probabilistic manner. For each charging station, a deep learning model based on a partial input convex neural network (PICNN) is trained to predict the day-ahead charging demand's conditional distribution, preventing the common quantile crossing problem in traditional quantile regression models. Then, differentiable convex optimization layers (DCLs) are used to reconcile the scenarios sampled from the distributions to yield coherent scenarios that satisfy the hierarchical constraint. It learns a better weight matrix for adjusting the forecasting results of different targets in a machine-learning approach compared to traditional optimization-based hierarchical reconciling methods. Numerical experiments based on real-world EV charging data are conducted to demonstrate the efficacy of the proposed method.
△ Less
Submitted 3 November, 2024; v1 submitted 31 October, 2024;
originally announced November 2024.
-
Intelligent Angle Map-based Beam Alignment for RIS-aided mmWave Communication Networks
Authors:
Hao Xia,
Qing Xue,
Yanping Liu,
Binggui Zhou,
Meng Hua,
Qianbin Chen
Abstract:
Recently, reconfigurable intelligent surface (RIS) has been widely used to enhance the performance of millimeter wave (mmWave) communication systems, making beam alignment more challenging. To ensure efficient communication, this paper proposes a novel intelligent angle map-based beam alignment scheme for both general user equipments (UEs) and RIS-aided UEs simultaneously in a fast and effective w…
▽ More
Recently, reconfigurable intelligent surface (RIS) has been widely used to enhance the performance of millimeter wave (mmWave) communication systems, making beam alignment more challenging. To ensure efficient communication, this paper proposes a novel intelligent angle map-based beam alignment scheme for both general user equipments (UEs) and RIS-aided UEs simultaneously in a fast and effective way. Specifically, we construct a beam alignment architecture that utilizes only angular information. To obtain the angle information, the currently hottest seq2seq model - the Transformer - is introduced to offline learn the relationship between UE geographic location and the corresponding optimal beam direction. Based on the powerful machine learning model, the location-angle mapping function, i.e., the angle map, can be built. As long as the location information of UEs is available, the angle map can make the acquisition of beam alignment angles effortless. In the simulation, we utilize a ray-tracing-based dataset to verify the performance of the proposed scheme. It is demonstrated that the proposed scheme can achieve high-precision beam alignment and remarkable system performance without any beam scanning.
△ Less
Submitted 31 October, 2024;
originally announced October 2024.
-
Threshold-Based Automated Pest Detection System for Sustainable Agriculture
Authors:
Tianle Li,
Jia Shu,
Qinghong Chen,
Murad Mehrab Abrar,
John Raiti
Abstract:
This paper presents a threshold-based automated pea weevil detection system, developed as part of the Microsoft FarmVibes project. Based on Internet-of-Things (IoT) and computer vision, the system is designed to monitor and manage pea weevil populations in agricultural settings, with the goal of enhancing crop production and promoting sustainable farming practices. Unlike the machine learning-base…
▽ More
This paper presents a threshold-based automated pea weevil detection system, developed as part of the Microsoft FarmVibes project. Based on Internet-of-Things (IoT) and computer vision, the system is designed to monitor and manage pea weevil populations in agricultural settings, with the goal of enhancing crop production and promoting sustainable farming practices. Unlike the machine learning-based approaches, our detection approach relies on binary grayscale thresholding and contour detection techniques determined by the pea weevil sizes. We detail the design of the product, the system architecture, the integration of hardware and software components, and the overall technology strategy. Our test results demonstrate significant effectiveness in weevil management and offer promising scalability for deployment in resource-constrained environments. In addition, the software has been open-sourced for the global research community.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Authors:
Qinglin Zhang,
Luyao Cheng,
Chong Deng,
Qian Chen,
Wen Wang,
Siqi Zheng,
Jiaqing Liu,
Hai Yu,
Chaohong Tan,
Zhihao Du,
Shiliang Zhang
Abstract:
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backch…
▽ More
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).
△ Less
Submitted 3 January, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
DRACO: A Denoising-Reconstruction Autoencoder for Cryo-EM
Authors:
Yingjun Shen,
Haizhao Dai,
Qihe Chen,
Yan Zeng,
Jiakai Zhang,
Yuan Pei,
Jingyi Yu
Abstract:
Foundation models in computer vision have demonstrated exceptional performance in zero-shot and few-shot tasks by extracting multi-purpose features from large-scale datasets through self-supervised pre-training methods. However, these models often overlook the severe corruption in cryogenic electron microscopy (cryo-EM) images by high-level noises. We introduce DRACO, a Denoising-Reconstruction Au…
▽ More
Foundation models in computer vision have demonstrated exceptional performance in zero-shot and few-shot tasks by extracting multi-purpose features from large-scale datasets through self-supervised pre-training methods. However, these models often overlook the severe corruption in cryogenic electron microscopy (cryo-EM) images by high-level noises. We introduce DRACO, a Denoising-Reconstruction Autoencoder for CryO-EM, inspired by the Noise2Noise (N2N) approach. By processing cryo-EM movies into odd and even images and treating them as independent noisy observations, we apply a denoising-reconstruction hybrid training scheme. We mask both images to create denoising and reconstruction tasks. For DRACO's pre-training, the quality of the dataset is essential, we hence build a high-quality, diverse dataset from an uncurated public database, including over 270,000 movies or micrographs. After pre-training, DRACO naturally serves as a generalizable cryo-EM image denoiser and a foundation model for various cryo-EM downstream tasks. DRACO demonstrates the best performance in denoising, micrograph curation, and particle picking tasks compared to state-of-the-art baselines.
△ Less
Submitted 28 October, 2024; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Reinforcement Learning Based Bidding Framework with High-dimensional Bids in Power Markets
Authors:
Jinyu Liu,
Hongye Guo,
Yun Li,
Qinghu Tang,
Fuquan Huang,
Tunan Chen,
Haiwang Zhong,
Qixin Chen
Abstract:
Over the past decade, bidding in power markets has attracted widespread attention. Reinforcement Learning (RL) has been widely used for power market bidding as a powerful AI tool to make decisions under real-world uncertainties. However, current RL methods mostly employ low dimensional bids, which significantly diverge from the N price-power pairs commonly used in the current power markets. The N-…
▽ More
Over the past decade, bidding in power markets has attracted widespread attention. Reinforcement Learning (RL) has been widely used for power market bidding as a powerful AI tool to make decisions under real-world uncertainties. However, current RL methods mostly employ low dimensional bids, which significantly diverge from the N price-power pairs commonly used in the current power markets. The N-pair bidding format is denoted as High Dimensional Bids (HDBs), which has not been fully integrated into the existing RL-based bidding methods. The loss of flexibility in current RL bidding methods could greatly limit the bidding profits and make it difficult to tackle the rising uncertainties brought by renewable energy generations. In this paper, we intend to propose a framework to fully utilize HDBs for RL-based bidding methods. First, we employ a special type of neural network called Neural Network Supply Functions (NNSFs) to generate HDBs in the form of N price-power pairs. Second, we embed the NNSF into a Markov Decision Process (MDP) to make it compatible with most existing RL methods. Finally, experiments on Energy Storage Systems (ESSs) in the PJM Real-Time (RT) power market show that the proposed bidding method with HDBs can significantly improve bidding flexibility, thereby improving the profit of the state-of-the-art RL bidding methods.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
CleanUMamba: A Compact Mamba Network for Speech Denoising using Channel Pruning
Authors:
Sjoerd Groot,
Qinyu Chen,
Jan C. van Gemert,
Chang Gao
Abstract:
This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising perf…
▽ More
This paper presents CleanUMamba, a time-domain neural network architecture designed for real-time causal audio denoising directly applied to raw waveforms. CleanUMamba leverages a U-Net encoder-decoder structure, incorporating the Mamba state-space model in the bottleneck layer. By replacing conventional self-attention and LSTM mechanisms with Mamba, our architecture offers superior denoising performance while maintaining a constant memory footprint, enabling streaming operation. To enhance efficiency, we applied structured channel pruning, achieving an 8X reduction in model size without compromising audio quality. Our model demonstrates strong results in the Interspeech 2020 Deep Noise Suppression challenge. Specifically, CleanUMamba achieves a PESQ score of 2.42 and STOI of 95.1% with only 442K parameters and 468M MACs, matching or outperforming larger models in real-time performance. Code will be available at: https://github.com/lab-emi/CleanUMamba
△ Less
Submitted 10 February, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Mean Age of Information in Partial Offloading Mobile Edge Computing Networks
Authors:
Ying Dong,
Hang Xiao,
Haonan Hu,
Jiliang Zhang,
Qianbin Chen,
Jie Zhang
Abstract:
The age of information (AoI) performance analysis is essential for evaluating the information freshness in the large-scale mobile edge computing (MEC) networks. This work proposes the earliest analysis of the mean AoI (MAoI) performance of large-scale partial offloading MEC networks. Firstly, we derive and validate the closed-form expressions of MAoI by using queueing theory and stochastic geometr…
▽ More
The age of information (AoI) performance analysis is essential for evaluating the information freshness in the large-scale mobile edge computing (MEC) networks. This work proposes the earliest analysis of the mean AoI (MAoI) performance of large-scale partial offloading MEC networks. Firstly, we derive and validate the closed-form expressions of MAoI by using queueing theory and stochastic geometry. Based on these expressions, we analyse the effects of computing offloading ratio (COR) and task generation rate (TGR) on the MAoI performance and compare the MAoI performance under the local computing, remote computing, and partial offloading schemes. The results show that by jointly optimising the COR and TGR, the partial offloading scheme outperforms the local and remote computing schemes in terms of the MAoI, which can be improved by up to 51% and 61%, respectively. This encourages the MEC networks to adopt the partial offloading scheme to improve the MAoI performance.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Autotuning Bipedal Locomotion MPC with GRFM-Net for Efficient Sim-to-Real Transfer
Authors:
Qianzhong Chen,
Junheng Li,
Sheng Cheng,
Naira Hovakimyan,
Quan Nguyen
Abstract:
Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuni…
▽ More
Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuning method that leverages differential programming for efficient parameter learning. A major difficulty lies in balancing model fidelity with differentiability. We address this difficulty using a low-fidelity model for differentiability, enhanced by a Ground Reaction Force-and-Moment Network (GRFM-Net) to capture discrepancies between MPC commands and actual control effects. We validate the parameters learned by DiffTune with GRFM-Net in hardware experiments, which demonstrates the parameters' optimality in a multi-objective setting compared with baseline parameters, reducing the total loss by up to 40.5$\%$ compared with the expert-tuned parameters. The results confirm the GRFM-Net's effectiveness in mitigating the sim-to-real gap, improving the transferability of simulation-learned parameters to real hardware.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.