Search | arXiv e-print repository

MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound

Authors: Rusi Chen, Yuanting Yang, Jiezhi Yao, Hongning Song, Ji Zhang, Yongsong Zhou, Yuhao Huang, Ronghao Yang, Dan Jia, Yuhan Zhang, Xing Tao, Haoran Dou, Qing Zhou, Xin Yang, Dong Ni

Abstract: Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hind… ▽ More Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at https://github.com/crs524/MTCNet. △ Less

Submitted 3 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

Comments: Accepted by MICCAI 2025

arXiv:2507.00185 [pdf]

Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines. △ Less

Submitted 30 June, 2025; originally announced July 2025.

Comments: 42 pages, 3 composite figures, 4 tables

arXiv:2506.23986 [pdf, ps, other]

StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

Authors: Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie

Abstract: Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token s… ▽ More Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.\footnote{Speech samples: https://dukguo.github.io/StreamFlow/} △ Less

Submitted 1 July, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.12325 [pdf, ps, other]

GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition

Authors: Yuntao Shou, Jun Yao, Tao Meng, Wei Ai, Cen Chen, Keqin Li

Abstract: Multimodal emotion recognition in conversations (MERC) aims to infer the speaker's emotional state by analyzing utterance information from multiple sources (i.e., video, audio, and text). Compared with unimodality, a more robust utterance representation can be obtained by fusing complementary semantic information from different modalities. However, the modality missing problem severely limits the… ▽ More Multimodal emotion recognition in conversations (MERC) aims to infer the speaker's emotional state by analyzing utterance information from multiple sources (i.e., video, audio, and text). Compared with unimodality, a more robust utterance representation can be obtained by fusing complementary semantic information from different modalities. However, the modality missing problem severely limits the performance of MERC in practical scenarios. Recent work has achieved impressive performance on modality completion using graph neural networks and diffusion models, respectively. This inspires us to combine these two dimensions through the graph diffusion model to obtain more powerful modal recovery capabilities. Unfortunately, existing graph diffusion models may destroy the connectivity and local structure of the graph by directly adding Gaussian noise to the adjacency matrix, resulting in the generated graph data being unable to retain the semantic and topological information of the original graph. To this end, we propose a novel Graph Spectral Diffusion Network (GSDNet), which maps Gaussian noise to the graph spectral space of missing modalities and recovers the missing data according to its original distribution. Compared with previous graph diffusion methods, GSDNet only affects the eigenvalues of the adjacency matrix instead of destroying the adjacency matrix directly, which can maintain the global topological information and important spectral features during the diffusion process. Extensive experiments have demonstrated that GSDNet achieves state-of-the-art emotion recognition performance in various modality loss scenarios. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.01023 [pdf, ps, other]

A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech Enhancement

Authors: Shenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li

Abstract: This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its su… ▽ More This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Comments: 5 pages, 2 figure, accepted by Interspeech 2025

arXiv:2506.00987 [pdf, ps, other]

doi 10.1109/LWC.2025.3576288

Blind Passive Beamforming for MIMO System

Authors: Wenhai Lai, Jiawei Yao, Kaiming Shen

Abstract: Passive beamforming for the intelligent surface (IS)-aided multiple-input multiple-output (MIMO) communication is a difficult nonconvex problem. It becomes even more challenging under the practical discrete constraints on phase shifts. Unlike most of the existing approaches that rely on the channel state information (CSI), this work advocates a blind beamforming strategy without any CSI. Simply pu… ▽ More Passive beamforming for the intelligent surface (IS)-aided multiple-input multiple-output (MIMO) communication is a difficult nonconvex problem. It becomes even more challenging under the practical discrete constraints on phase shifts. Unlike most of the existing approaches that rely on the channel state information (CSI), this work advocates a blind beamforming strategy without any CSI. Simply put, we propose a statistical method that learns the main feature of the wireless environment from the random samples of received signal power. Field tests in the 5G commercial network demonstrate the superiority of the proposed blind passive beamforming method. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Comments: 6 pages

Journal ref: IEEE Wireless Communications Letters 2025

arXiv:2505.15004 [pdf, ps, other]

EASY: Emotion-aware Speaker Anonymization via Factorized Distillation

Authors: Jixun Yao, Hexin Liu, Eng Siong Chng, Lei Xie

Abstract: Emotion plays a significant role in speech interaction, conveyed through tone, pitch, and rhythm, enabling the expression of feelings and intentions beyond words to create a more personalized experience. However, most existing speaker anonymization systems employ parallel disentanglement methods, which only separate speech into linguistic content and speaker identity, often neglecting the preserva… ▽ More Emotion plays a significant role in speech interaction, conveyed through tone, pitch, and rhythm, enabling the expression of feelings and intentions beyond words to create a more personalized experience. However, most existing speaker anonymization systems employ parallel disentanglement methods, which only separate speech into linguistic content and speaker identity, often neglecting the preservation of the original emotional state. In this study, we introduce EASY, an emotion-aware speaker anonymization framework. EASY employs a novel sequential disentanglement process to disentangle speaker identity, linguistic content, and emotional representation, modeling each speech attribute in distinct subspaces through a factorized distillation approach. By independently constraining speaker identity and emotional representation, EASY minimizes information leakage, enhancing privacy protection while preserving original linguistic content and emotional state. Experimental results on the VoicePrivacy Challenge official datasets demonstrate that our proposed approach outperforms all baseline systems, effectively protecting speaker privacy while maintaining linguistic content and emotional state. △ Less

Submitted 30 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

Comments: Accepted by INTERSPEECH 2025

arXiv:2505.13805 [pdf, ps, other]

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Authors: Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

Abstract: Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language… ▽ More Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: Accepted by InterSpeech 2025

arXiv:2505.10793 [pdf, ps, other]

SongEval: A Benchmark Dataset for Song Aesthetics Evaluation

Authors: Jixun Yao, Guobin Ma, Huixin Xue, Huakang Chen, Chunbo Hao, Yuepeng Jiang, Haohe Liu, Ruibin Yuan, Jin Xu, Wei Xue, Hao Liu, Lei Xie

Abstract: Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspec… ▽ More Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspects that define musical appeal. To address this issue, we introduce SongEval, the first open-source, large-scale benchmark dataset for evaluating the aesthetics of full-length songs. SongEval includes over 2,399 songs in full length, summing up to more than 140 hours, with aesthetic ratings from 16 professional annotators with musical backgrounds. Each song is evaluated across five key dimensions: overall coherence, memorability, naturalness of vocal breathing and phrasing, clarity of song structure, and overall musicality. The dataset covers both English and Chinese songs, spanning nine mainstream genres. Moreover, to assess the effectiveness of song aesthetic evaluation, we conduct experiments using SongEval to predict aesthetic scores and demonstrate better performance than existing objective evaluation metrics in predicting human-perceived musical quality. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2505.09141 [pdf, ps, other]

Sensing-Assisted Channel Prediction in Complex Wireless Environments: An LLM-Based Approach

Authors: Junjie He, Zixiang Ren, Jianping Yao, Han Hu, Tony Xiao Han, Jie Xu

Abstract: This letter studies the sensing-assisted channel prediction for a multi-antenna orthogonal frequency division multiplexing (OFDM) system operating in realistic and complex wireless environments. In this system,an integrated sensing and communication (ISAC) transmitter leverages the mono-static sensing capability to facilitate the prediction of its bi-static communication channel, by exploiting the… ▽ More This letter studies the sensing-assisted channel prediction for a multi-antenna orthogonal frequency division multiplexing (OFDM) system operating in realistic and complex wireless environments. In this system,an integrated sensing and communication (ISAC) transmitter leverages the mono-static sensing capability to facilitate the prediction of its bi-static communication channel, by exploiting the fact that the sensing and communication channels share the same physical environment involving shared scatterers. Specifically, we propose a novel large language model (LLM)-based channel prediction approach,which adapts pre-trained text-based LLM to handle the complex-matrix-form channel state information (CSI) data. This approach utilizes the LLM's strong ability to capture the intricate spatiotemporal relationships between the multi-path sensing and communication channels, and thus efficiently predicts upcoming communication CSI based on historical communication and sensing CSI data. Experimental results show that the proposed LLM-based approach significantly outperforms conventional deep learning-based methods and the benchmark scheme without sensing assistance. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2503.19368

RIS-Assisted Passive Localization (RAPL): An Efficient Zero-Overhead Framework Using Conditional Sample Mean

Authors: Jiawei Yao, Yijie Mao, Mingzhe Chen, Ye Hu

Abstract: Reconfigurable Intelligent Surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and… ▽ More Reconfigurable Intelligent Surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and the RIS. To address these challenges, in this work, we move beyond conventional methods and introduce a novel data-driven, multiple RISs-assisted passive localization approach (RAPL). The proposed method includes two stages, the angle-of-directions (AoDs) between the RISs and the user is estimated by using the conditional sample mean in the first stage, and then the user's position is determined based on the estimated multiple AoD pairs in the second stage. This approach only utilizes the existing communication signals between the user and the BS, relying solely on the measurement of received signal power at each BS antenna for a set of randomly generated phase shifts across all RISs. Moreover, by obviating the need for real-time RIS phase shift optimization or user-to-BS pilot transmissions, the method introduces no additional communication overhead, making it highly suitable for deployment in real-world networks. The proposed scheme is then extended to multi-RIS scenarios considering both parallel and cascaded RIS topologies. Numerical results show that the proposed RAPL improves localization accuracy while significantly reducing energy and signaling overhead compared to conventional methods. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission

arXiv:2503.18610

RIS-Assisted Localization: A Novel Conditional Sample Mean Approach without CSI

Authors: Jiawei Yao, Yijie Mao, Mingzhe Chen

Abstract: Reconfigurable intelligent surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and… ▽ More Reconfigurable intelligent surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and the RIS. In this work, we propose a novel multiple RISs aided localization approach to address these challenges. The proposed method first estimates the angle-of-directions (AoDs) between the RISs and the user using the conditional sample mean approach, and then uses the estimated multiple AoD pairs to determine the user's position. This approach only requires measuring the received signal strength at the BS for a set of randomly generated phase shifts across all RISs, thereby eliminating the need for real-time RIS phase shift design or user-to-BS pilot transmissions. Numerical results show that the proposed localization approach improves localization accuracy while significantly reducing energy and signaling overhead compared to conventional methods. △ Less

Submitted 24 March, 2025; originally announced March 2025.

Comments: arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission

arXiv:2503.17649 [pdf, ps, other]

Quantized Analog Beamforming Enabled Multi-task Federated Learning Over-the-air

Authors: Jiacheng Yao, Wei Xu, Guangxu Zhu, Zhaohui Yang, Kaibin Huang, Dusit Niyato

Abstract: Over-the-air computation (AirComp) has recently emerged as a pivotal technique for communication-efficient federated learning (FL) in resource-constrained wireless networks. Though AirComp leverages the superposition property of multiple access channels for computation, it inherently limits its ability to manage inter-task interference in multi-task computing. In this paper, we propose a quantized… ▽ More Over-the-air computation (AirComp) has recently emerged as a pivotal technique for communication-efficient federated learning (FL) in resource-constrained wireless networks. Though AirComp leverages the superposition property of multiple access channels for computation, it inherently limits its ability to manage inter-task interference in multi-task computing. In this paper, we propose a quantized analog beamforming scheme at the receiver to enable simultaneous multi-task FL. Specifically, inspiring by the favorable propagation and channel hardening properties of large-scale antenna arrays, a targeted analog beamforming method in closed form is proposed for statistical interference elimination. Analytical results reveal that the interference power vanishes by an order of $\mathcal{O}\left(1/N_r\right)$ with the number of analog phase shifters, $N_r$, irrespective of their quantization precision. Numerical results demonstrate the effectiveness of the proposed analog beamforming method and show that the performance upper bound of ideal learning without errors can be achieved by increasing the number of low-precision analog phase shifters. △ Less

Submitted 22 March, 2025; originally announced March 2025.

Comments: Accepted by IEEE VTC-Spring 2025

arXiv:2503.03560 [pdf, ps, other]

Optimal Beamforming for Multi-Target Multi-User ISAC Exploiting Prior Information: How Many Sensing Beams Are Needed?

Authors: Jiayi Yao, Shuowen Zhang

Abstract: This paper studies a multi-target multi-user integrated sensing and communication (ISAC) system where a multi-antenna base station (BS) communicates with multiple single-antenna users in the downlink and senses the unknown and random angle information of multiple targets based on their reflected echo signals at the BS receiver as well as their prior probability information. We focus on a general b… ▽ More This paper studies a multi-target multi-user integrated sensing and communication (ISAC) system where a multi-antenna base station (BS) communicates with multiple single-antenna users in the downlink and senses the unknown and random angle information of multiple targets based on their reflected echo signals at the BS receiver as well as their prior probability information. We focus on a general beamforming structure with both communication beams and dedicated sensing beams, whose design is highly non-trivial as more sensing beams provide more flexibility in sensing, but introduce extra interference to communication. To resolve this trade-off, we first characterize the periodic posterior Cramér-Rao bound (PCRB) as a lower bound of the mean-cyclic error (MCE) in multi-target sensing. Then, we optimize the beamforming to minimize the maximum periodic PCRB among all targets to ensure fairness, subject to individual communication rate constraints at multiple users. Despite the non-convexity of this problem, we propose a general construction method for the optimal solution by leveraging semi-definite relaxation (SDR), and derive a general bound on the number of sensing beams needed. Moreover, we unveil specific structures of the optimal solution in various cases, where tighter bounds on the number of sensing beams needed are derived (e.g., no or at most one sensing beam is needed under stringent rate constraints or with homogeneous targets). Next, we study the beamforming optimization to minimize the sum periodic PCRB under user rate constraints. By applying SDR, we propose a general construction method for the optimal solution and its specific structures which yield lower computational complexities. We derive a general bound and various tighter bounds on the number of sensing beams needed. Numerical results validate our analysis and effectiveness of our proposed beamforming designs. △ Less

Submitted 28 June, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

Comments: This is the longer version of a paper submitted for possible journal publication

arXiv:2503.01183 [pdf, other]

DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion

Authors: Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, Lei Xie

Abstract: Recent advancements in music generation have garnered significant attention, yet existing approaches face critical limitations. Some current generative models can only synthesize either the vocal track or the accompaniment track. While some models can generate combined vocal and accompaniment, they typically rely on meticulously designed multi-stage cascading architectures and intricate data pipel… ▽ More Recent advancements in music generation have garnered significant attention, yet existing approaches face critical limitations. Some current generative models can only synthesize either the vocal track or the accompaniment track. While some models can generate combined vocal and accompaniment, they typically rely on meticulously designed multi-stage cascading architectures and intricate data pipelines, hindering scalability. Additionally, most systems are restricted to generating short musical segments rather than full-length songs. Furthermore, widely used language model-based methods suffer from slow inference speeds. To address these challenges, we propose DiffRhythm, the first latent diffusion-based song generation model capable of synthesizing complete songs with both vocal and accompaniment for durations of up to 4m45s in only ten seconds, maintaining high musicality and intelligibility. Despite its remarkable capabilities, DiffRhythm is designed to be simple and elegant: it eliminates the need for complex data preparation, employs a straightforward model structure, and requires only lyrics and a style prompt during inference. Additionally, its non-autoregressive structure ensures fast inference speeds. This simplicity guarantees the scalability of DiffRhythm. Moreover, we release the complete training code along with the pre-trained model on large-scale data to promote reproducibility and further research. △ Less

Submitted 3 March, 2025; originally announced March 2025.

arXiv:2503.00298 [pdf, other]

Energy-Efficient Edge Inference in Integrated Sensing, Communication, and Computation Networks

Authors: Jiacheng Yao, Wei Xu, Guangxu Zhu, Kaibin Huang, Shuguang Cui

Abstract: Task-oriented integrated sensing, communication, and computation (ISCC) is a key technology for achieving low-latency edge inference and enabling efficient implementation of artificial intelligence (AI) in industrial cyber-physical systems (ICPS). However, the constrained energy supply at edge devices has emerged as a critical bottleneck. In this paper, we propose a novel energy-efficient ISCC fra… ▽ More Task-oriented integrated sensing, communication, and computation (ISCC) is a key technology for achieving low-latency edge inference and enabling efficient implementation of artificial intelligence (AI) in industrial cyber-physical systems (ICPS). However, the constrained energy supply at edge devices has emerged as a critical bottleneck. In this paper, we propose a novel energy-efficient ISCC framework for AI inference at resource-constrained edge devices, where adjustable split inference, model pruning, and feature quantization are jointly designed to adapt to diverse task requirements. A joint resource allocation design problem for the proposed ISCC framework is formulated to minimize the energy consumption under stringent inference accuracy and latency constraints. To address the challenge of characterizing inference accuracy, we derive an explicit approximation for it by analyzing the impact of sensing, communication, and computation processes on the inference performance. Building upon the analytical results, we propose an iterative algorithm employing alternating optimization to solve the resource allocation problem. In each subproblem, the optimal solutions are available by respectively applying a golden section search method and checking the Karush-Kuhn-Tucker (KKT) conditions, thereby ensuring the convergence to a local optimum of the original problem. Numerical results demonstrate the effectiveness of the proposed ISCC design, showing a significant reduction in energy consumption of up to 40% compared to existing methods, particularly in low-latency scenarios. △ Less

Submitted 28 February, 2025; originally announced March 2025.

Comments: Accepted by IEEE JSAC

arXiv:2502.02950 [pdf, other]

Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech

Authors: Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie

Abstract: Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of aud… ▽ More Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of audio samples, while other segments are well-generated. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO focuses on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. Specifically, we first analyze the types of issues in generated samples, categorize them into two groups, and propose a selective training loss strategy to optimize preferences based on fine-grained labels for each issue type. Experimental results show that FPO enhances the robustness of zero-shot TTS systems by effectively addressing local issues, significantly reducing the bad case ratio, and improving intelligibility. Furthermore, FPO exhibits superior data efficiency compared with baseline systems, achieving similar performance with fewer training samples. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: WIP

arXiv:2502.02942 [pdf, other]

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Authors: Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, Lei Xie

Abstract: Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semanti… ▽ More Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semantic information embedded in speech, which is crucial for improving intelligibility, speaker consistency, and overall quality of enhanced speech signals. To enrich the SE model with semantic information, we employ language models as an efficient semantic learner and propose a comprehensive framework tailored for language model-based speech enhancement, called \textit{GenSE}. Specifically, we approach SE as a conditional language modeling task rather than a continuous signal regression problem defined in existing works. This is achieved by tokenizing speech signals into semantic tokens using a pre-trained self-supervised model and into acoustic tokens using a custom-designed single-quantizer neural codec model. To improve the stability of language model predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages. Moreover, we introduce a token chain prompting mechanism during the acoustic token generation stage to ensure timbre consistency throughout the speech enhancement process. Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: Accepted by ICLR 2025

arXiv:2501.16081 [pdf, ps, other]

doi 10.1109/TSP.2025.3536023

Combating Interference for Over-the-Air Federated Learning: A Statistical Approach via RIS

Authors: Wei Shi, Jiacheng Yao, Wei Xu, Jindan Xu, Xiaohu You, Yonina C. Eldar, Chunming Zhao

Abstract: Over-the-air computation (AirComp) integrates analog communication with task-oriented computation, serving as a key enabling technique for communication-efficient federated learning (FL) over wireless networks. However, owing to its analog characteristics, AirComp-enabled FL (AirFL) is vulnerable to both unintentional and intentional interference. In this paper, we aim to attain robustness in AirC… ▽ More Over-the-air computation (AirComp) integrates analog communication with task-oriented computation, serving as a key enabling technique for communication-efficient federated learning (FL) over wireless networks. However, owing to its analog characteristics, AirComp-enabled FL (AirFL) is vulnerable to both unintentional and intentional interference. In this paper, we aim to attain robustness in AirComp aggregation against interference via reconfigurable intelligent surface (RIS) technology to artificially reconstruct wireless environments. Concretely, we establish performance objectives tailored for interference suppression in wireless FL systems, aiming to achieve unbiased gradient estimation and reduce its mean square error (MSE). Oriented at these objectives, we introduce the concept of phase-manipulated favorable propagation and channel hardening for AirFL, which relies on the adjustment of RIS phase shifts to realize statistical interference elimination and reduce the error variance of gradient estimation. Building upon this concept, we propose two robust aggregation schemes of power control and RIS phase shifts design, both ensuring unbiased gradient estimation in the presence of interference. Theoretical analysis of the MSE and FL convergence affirms the anti-interference capability of the proposed schemes. It is observed that computation and interference errors diminish by an order of $\mathcal{O}\left(\frac{1}{N}\right)$ where $N$ is the number of RIS elements, and the ideal convergence rate without interference can be asymptotically achieved by increasing $N$. Numerical results confirm the analytical results and validate the superior performance of the proposed schemes over existing baselines. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: Accepted by IEEE Transactions on Signal Processing

arXiv:2501.05127 [pdf, other]

DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification

Authors: Qing Wang, Jixun Yao, Zhaokai Sun, Pengcheng Guo, Lei Xie, John H. L. Hansen

Abstract: Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for both humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach that exploits the capabi… ▽ More Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for both humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach that exploits the capability of a diffusion-based voice conversion (DiffVC) model to generate adversarial fake audio with distinct target speaker attribution. By introducing adversarial constraints into the generative process of the diffusion-based voice conversion model, we craft fake samples that effectively mislead target models while preserving speaker-wise characteristics. Specifically, inspired by the use of randomly sampled Gaussian noise in conventional adversarial attacks and diffusion processes, we incorporate adversarial constraints into the reverse diffusion process. These constraints subtly guide the reverse diffusion process toward aligning with the target speaker distribution. Our experiments on the LibriTTS dataset indicate that DiffAttack significantly improves the attack success rate compared to vanilla DiffVC and other methods. Moreover, objective and subjective evaluations demonstrate that introducing adversarial constraints does not compromise the speech quality generated by the DiffVC model. △ Less

Submitted 9 January, 2025; originally announced January 2025.

Comments: 5 pages,4 figures, accepted by ICASSP 2025

arXiv:2501.01281 [pdf, other]

Towards Intelligent Antenna Positioning: Leveraging DRL for FAS-Aided ISAC Systems

Authors: Shunxing Yang, Junteng Yao, Jie Tang, Tuo Wu, Maged Elkashlan, Chau Yuen, Merouane Debbah, Hyundong Shin, Matthew Valenti

Abstract: Fluid antenna systems (FAS) enable dynamic antenna positioning, offering new opportunities to enhance integrated sensing and communication (ISAC) performance. However, existing studies primarily focus on communication enhancement or single-target sensing, leaving multi-target scenarios underexplored. Additionally, the joint optimization of beamforming and antenna positions poses a highly non-conve… ▽ More Fluid antenna systems (FAS) enable dynamic antenna positioning, offering new opportunities to enhance integrated sensing and communication (ISAC) performance. However, existing studies primarily focus on communication enhancement or single-target sensing, leaving multi-target scenarios underexplored. Additionally, the joint optimization of beamforming and antenna positions poses a highly non-convex problem, with traditional methods becoming impractical as the number of fluid antennas increases. To address these challenges, this letter proposes a block coordinate descent (BCD) framework integrated with a deep reinforcement learning (DRL)-based approach for intelligent antenna positioning. By leveraging the deep deterministic policy gradient (DDPG) algorithm, the proposed framework efficiently balances sensing and communication performance. Simulation results demonstrate the scalability and effectiveness of the proposed approach. △ Less

Submitted 2 January, 2025; originally announced January 2025.

arXiv:2412.19748 [pdf, ps, other]

UAV-Enabled Secure ISAC Against Dual Eavesdropping Threats: Joint Beamforming and Trajectory Design

Authors: Jianping Yao, Zeyu Yang, Zai Yang, Jie Xu, Tony Q. S. Quek

Abstract: In this work, we study an unmanned aerial vehicle (UAV)-enabled secure integrated sensing and communication (ISAC) system, where a UAV serves as an aerial base station (BS) to simultaneously perform communication with a user and detect a target on the ground, while a dual-functional eavesdropper attempts to intercept the signals for both sensing and communication. Facing the dual eavesdropping thr… ▽ More In this work, we study an unmanned aerial vehicle (UAV)-enabled secure integrated sensing and communication (ISAC) system, where a UAV serves as an aerial base station (BS) to simultaneously perform communication with a user and detect a target on the ground, while a dual-functional eavesdropper attempts to intercept the signals for both sensing and communication. Facing the dual eavesdropping threats, we aim to enhance the average achievable secrecy rate for the communication user by jointly designing the UAV trajectory together with the transmit information and sensing beamforming, while satisfying the requirements on sensing performance and sensing security, as well as the UAV power and flight constraints. To address the non-convex nature of the optimization problem, we employ the alternating optimization (AO) strategy, jointly with the successive convex approximation (SCA) and semidefinite relaxation (SDR) methods. Numerical results validate the proposed approach, demonstrating its ability to achieve a high secrecy rate while meeting the required sensing and security constraints. △ Less

Submitted 27 May, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

Comments: 8 pages, 6 figures, submitted for possible publication. It overlaps with the former version (arXiv:2412.19748)

arXiv:2412.15843 [pdf, other]

Rethinking Hardware Impairments in Multi-User Systems: Can FAS Make a Difference?

Authors: Junteng Yao, Tuo Wu, Liaoshi Zhou, Ming Jin, Cunhua Pan, Maged Elkashlan, Fumiyuki Adachi, George K. Karagiannidis, Naofal Al-Dhahir, Chau Yuen

Abstract: In this paper, we analyze the role of fluid antenna systems (FAS) in multi-user systems with hardware impairments (HIs). Specifically, we investigate a scenario where a base station (BS) equipped with multiple fluid antennas communicates with multiple users (CUs), each equipped with a single fluid antenna. Our objective is to maximize the minimum communication rate among all users by jointly optim… ▽ More In this paper, we analyze the role of fluid antenna systems (FAS) in multi-user systems with hardware impairments (HIs). Specifically, we investigate a scenario where a base station (BS) equipped with multiple fluid antennas communicates with multiple users (CUs), each equipped with a single fluid antenna. Our objective is to maximize the minimum communication rate among all users by jointly optimizing the BS's transmit beamforming, the positions of its transmit fluid antennas, and the positions of the CUs' receive fluid antennas. To address this non-convex problem, we propose a block coordinate descent (BCD) algorithm integrating semidefinite relaxation (SDR), rank-one constraint relaxation (SRCR), successive convex approximation (SCA), and majorization-minimization (MM). Simulation results demonstrate that FAS significantly enhances system performance and robustness, with notable gains when both the BS and CUs are equipped with fluid antennas. Even under low transmit power conditions, deploying FAS at the BS alone yields substantial performance gains. However, the effectiveness of FAS depends on the availability of sufficient movement space, as space constraints may limit its benefits compared to fixed antenna strategies. Our findings highlight the potential of FAS to mitigate HIs and enhance multi-user system performance, while emphasizing the need for practical deployment considerations. △ Less

Submitted 20 December, 2024; originally announced December 2024.

arXiv:2412.10844 [pdf, other]

Lyapunov-based reinforcement learning for distributed control with stability guarantee

Authors: Jingshi Yao, Minghao Han, Xunyuan Yin

Abstract: In this paper, we propose a Lyapunov-based reinforcement learning method for distributed control of nonlinear systems comprising interacting subsystems with guaranteed closed-loop stability. Specifically, we conduct a detailed stability analysis and derive sufficient conditions that ensure closed-loop stability under a model-free distributed control scheme based on the Lyapunov theorem. The Lyapun… ▽ More In this paper, we propose a Lyapunov-based reinforcement learning method for distributed control of nonlinear systems comprising interacting subsystems with guaranteed closed-loop stability. Specifically, we conduct a detailed stability analysis and derive sufficient conditions that ensure closed-loop stability under a model-free distributed control scheme based on the Lyapunov theorem. The Lyapunov-based conditions are leveraged to guide the design of local reinforcement learning control policies for each subsystem. The local controllers only exchange scalar-valued information during the training phase, yet they do not need to communicate once the training is completed and the controllers are implemented online. The effectiveness and performance of the proposed method are evaluated using a benchmark chemical process that contains two reactors and one separator. △ Less

Submitted 14 December, 2024; originally announced December 2024.

Comments: 28 pages, 10 figures, journal, Computers and Chemical Engineering

arXiv:2412.04724 [pdf, other]

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Authors: Jixun Yao, Yuguang Yang, Yu Pan, Ziqian Ning, Jiaohao Ye, Hongbin Zhou, Lei Xie

Abstract: Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or diffusion-based approaches, several challenges remain: 1) current approaches primarily focus on adapting timbre from unseen speakers and are unable to transfer s… ▽ More Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using language model-based or diffusion-based approaches, several challenges remain: 1) current approaches primarily focus on adapting timbre from unseen speakers and are unable to transfer style and timbre to different unseen speakers independently; 2) these approaches often suffer from slower inference speeds due to the autoregressive modeling methods or the need for numerous sampling steps; 3) the quality and similarity of the converted samples are still not fully satisfactory. To address these challenges, we propose a style controllable zero-shot VC approach named StableVC, which aims to transfer timbre and style from source speech to different unseen target speakers. Specifically, we decompose speech into linguistic content, timbre, and style, and then employ a conditional flow matching module to reconstruct the high-quality mel-spectrogram based on these decomposed features. To effectively capture timbre and style in a zero-shot manner, we introduce a novel dual attention mechanism with an adaptive gate, rather than using conventional feature concatenation. With this non-autoregressive design, StableVC can efficiently capture the intricate timbre and style from different unseen speakers and generate high-quality speech significantly faster than real-time. Experiments demonstrate that our proposed StableVC outperforms state-of-the-art baseline systems in zero-shot VC and achieves flexible control over timbre and style from different unseen speakers. Moreover, StableVC offers approximately 25x and 1.65x faster sampling compared to autoregressive and diffusion-based baselines. △ Less

Submitted 10 December, 2024; v1 submitted 5 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI 2025

arXiv:2412.03839 [pdf, other]

Fluid Antenna Systems Enabling 6G:Principles, Applications, and Research Directions

Authors: Tuo Wu, Kangda Zhi, Junteng Yao, Xiazhi Lai, Jianchao Zheng, Hong Niu, Maged Elkashlan, Kai-Kit Wong, Chan-Byoung Chae, Zhiguo Ding, George K. Karagiannidis, Merouane Debbah, Chau Yuen

Abstract: Fluid antenna system (FAS) as a new version of reconfigurable antenna technologies promoting shape and position flexibility, has emerged as an exciting and possibly transformative technology for wireless communications systems. FAS represents any software-controlled fluidic, conductive or dielectric structure that can dynamically alter antenna's shape and position to change the gain, the radiation… ▽ More Fluid antenna system (FAS) as a new version of reconfigurable antenna technologies promoting shape and position flexibility, has emerged as an exciting and possibly transformative technology for wireless communications systems. FAS represents any software-controlled fluidic, conductive or dielectric structure that can dynamically alter antenna's shape and position to change the gain, the radiation pattern, the operating frequency, and other critical radiation characteristics. With its capability, it is highly anticipated that FAS can contribute greatly to the upcoming sixth generation (6G) wireless networks. This article substantiates this thought by addressing four major questions: 1) Is FAS crucial to 6G? 2) How to characterize FAS? 3) What are the applications of FAS? 4) What are the relevant challenges and future research directions? In particular, five promising research directions that underscore the potential of FAS are discussed. We conclude this article by showcasing the impressive performance of FAS. △ Less

Submitted 4 December, 2024; originally announced December 2024.

arXiv:2411.18918 [pdf, other]

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Authors: Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang, YunLin Chen, Zhifei Li, Lei Xie

Abstract: Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker r… ▽ More Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre modeling approach to ensure timbre consistency and improve voice detail similarity. To improve speech quality and speaker similarity, we introduce dual classifier-free guidance, providing both content and timbre guidance during the generation process. Objective and subjective experiments affirm that CoDiff-VC significantly improves speaker similarity, generating natural and higher-quality speech. △ Less

Submitted 3 December, 2024; v1 submitted 28 November, 2024; originally announced November 2024.

arXiv:2411.09235 [pdf, ps, other]

FAS for Secure and Covert Communications

Authors: Junteng Yao, Liangxiao Xin, Tuo Wu, Ming Jin, Kai-Kit Wong, Chau Yuen, Hyundong Shin

Abstract: This letter considers a fluid antenna system (FAS)-aided secure and covert communication system, where the transmitter adjusts multiple fluid antennas' positions to achieve secure and covert transmission under the threat of an eavesdropper and the detection of a warden. This letter aims to maximize the secrecy rate while satisfying the covertness constraint. Unfortunately, the optimization problem… ▽ More This letter considers a fluid antenna system (FAS)-aided secure and covert communication system, where the transmitter adjusts multiple fluid antennas' positions to achieve secure and covert transmission under the threat of an eavesdropper and the detection of a warden. This letter aims to maximize the secrecy rate while satisfying the covertness constraint. Unfortunately, the optimization problem is non-convex due to the coupled variables. To tackle this, we propose an alternating optimization (AO) algorithm to alternatively optimize the optimization variables in an iterative manner. In particular, we use a penalty-based method and the majorization-minimization (MM) algorithm to optimize the transmit beamforming and fluid antennas' positions, respectively. Simulation results show that FAS can significantly improve the performance of secrecy and covertness compared to the fixed-position antenna (FPA)-based schemes. △ Less

Submitted 14 November, 2024; originally announced November 2024.

arXiv:2411.08386 [pdf, ps, other]

A Secure Beamforming Design: When Fluid Antenna Meets NOMA

Authors: Lifeng Mai, Junteng Yao, Jie Tang, Tuo Wu, Kai-Kit Wong, Hyundong Shin, Fumiyuki Adachi

Abstract: This letter proposes a secure beamforming design for downlink non-orthogonal multiple access (NOMA) systems utilizing fluid antenna systems (FAS). We consider a setup where a base station (BS) with $M$ fluid antennas (FAs) communicates to a cell-center user (CU) and a cell-edge user (CEU), each with a FA. The CU is the intended recipient while the CEU is regarded as a potential eavesdropper. Our a… ▽ More This letter proposes a secure beamforming design for downlink non-orthogonal multiple access (NOMA) systems utilizing fluid antenna systems (FAS). We consider a setup where a base station (BS) with $M$ fluid antennas (FAs) communicates to a cell-center user (CU) and a cell-edge user (CEU), each with a FA. The CU is the intended recipient while the CEU is regarded as a potential eavesdropper. Our aim is to maximize the achievable secrecy rate by jointly optimizing the secure beamforming vectors and the positions of FAs. To tackle this, we adopt an alternating optimization (AO) algorithm that optimizes secure beamforming and the positions of the FAs iteratively while keeping the other variables fixed. Numerical results illustrate that when FAs meet NOMA, the proposed scheme greatly enhances the secrecy rate compared to conventional multiple-input single-output (MISO) fixed antenna NOMA systems and other benchmark schemes. △ Less

Submitted 13 November, 2024; originally announced November 2024.

arXiv:2411.08383 [pdf, other]

FAS-Driven Spectrum Sensing for Cognitive Radio Networks

Authors: Junteng Yao, Ming Jin, Tuo Wu, Maged Elkashlan, Chau Yuen, Kai-Kit Wong, George K. Karagiannidis, Hyundong Shin

Abstract: Cognitive radio (CR) networks face significant challenges in spectrum sensing, especially under spectrum scarcity. Fluid antenna systems (FAS) can offer an unorthodox solution due to their ability to dynamically adjust antenna positions for improved channel gain. In this letter, we study a FAS-driven CR setup where a secondary user (SU) adjusts the positions of fluid antennas to detect signals fro… ▽ More Cognitive radio (CR) networks face significant challenges in spectrum sensing, especially under spectrum scarcity. Fluid antenna systems (FAS) can offer an unorthodox solution due to their ability to dynamically adjust antenna positions for improved channel gain. In this letter, we study a FAS-driven CR setup where a secondary user (SU) adjusts the positions of fluid antennas to detect signals from the primary user (PU). We aim to maximize the detection probability under the constraints of the false alarm probability and the received beamforming of the SU. To address this problem, we first derive a closed-form expression for the optimal detection threshold and reformulate the problem to find its solution. Then an alternating optimization (AO) scheme is proposed to decompose the problem into several sub-problems, addressing both the received beamforming and the antenna positions at the SU. The beamforming subproblem is addressed using a closed-form solution, while the fluid antenna positions are solved by successive convex approximation (SCA). Simulation results reveal that the proposed algorithm provides significant improvements over traditional fixed-position antenna (FPA) schemes in terms of spectrum sensing performance. △ Less

Submitted 13 November, 2024; originally announced November 2024.

arXiv:2411.02026 [pdf, other]

CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

Authors: Yu Pan, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

Abstract: Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that levera… ▽ More Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that leverages Content-aware Timbre Ensemble modeling and Flow Matching. Specifically, CTEFM-VC disentangles utterances into linguistic content and timbre representations, subsequently utilizing a conditional flow matching model and a vocoder to reconstruct the mel-spectrogram and waveform. To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the joint utilization of linguistic and timbre features through a cross-attention module. Experiments show that our CTEFM-VC system surpasses state-of-the-art VC methods in both speaker similarity and naturalness by at least 18.5% and 7.0%. △ Less

Submitted 4 November, 2024; originally announced November 2024.

Comments: Work in progress; 5 pages;

arXiv:2411.01398 [pdf, ps, other]

Paving the Way to 6G: Outage Probability Analysis for FAS-ARIS Systems

Authors: Jianchao Zheng, Xiazhi Lai, Junteng Yao, Jie Tang, Yijin Pan, Tuo Wu, Chau Yuen

Abstract: In this paper, we pave the way to six-generation (6G) by investigating the outage probability (OP) of fluid antenna system (FAS)-active reconfigurable intelligent surface (ARIS) communication systems. We consider a FAS-ARIS setup consisting of a base station (BS) with a single fixed-position antenna and a receiver equipped with a fluid antenna (FA). Utilizing the block-correlation model, we derive… ▽ More In this paper, we pave the way to six-generation (6G) by investigating the outage probability (OP) of fluid antenna system (FAS)-active reconfigurable intelligent surface (ARIS) communication systems. We consider a FAS-ARIS setup consisting of a base station (BS) with a single fixed-position antenna and a receiver equipped with a fluid antenna (FA). Utilizing the block-correlation model, we derive a closed-form expression for the OP. Our analysis, supported by numerical results, confirms the accuracy and effectiveness of the derivation. Furthermore, the results demonstrate that the FAS-ARIS system significantly outperforms other configurations in terms of OP, highlighting its potential to enhance communication performance and reliability in future 6G networks. △ Less

Submitted 2 November, 2024; originally announced November 2024.

arXiv:2410.23815 [pdf, other]

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

Authors: Dake Guo, Jixun Yao, Xinfa Zhu, Kangxiang Xia, Zhao Guo, Ziyu Zhang, Yao Wang, Jie Liu, Lei Xie

Abstract: This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking… ▽ More This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively. △ Less

Submitted 31 October, 2024; originally announced October 2024.

Comments: accepted by ISCSLP 2024

arXiv:2410.17609 [pdf, other]

Exploring the Impact of RIS on Cooperative NOMA URLLC Systems: A Theoretical Perspective

Authors: Jianchao Zheng, Tuo Wu, Junteng Yao, Chau Yuen, Zhiguo Ding, Fumiyuki Adachi

Abstract: In this paper, we conduct a theoretical analysis of how to integrate reconfigurable intelligent surfaces (RIS) with cooperative non-orthogonal multiple access (NOMA), considering URLLC. We consider a downlink two-user cooperative NOMA system employing short-packet communications, where the two users are denoted by the central user (CU) and the cell-edge user (CEU), respectively, and an RIS is depl… ▽ More In this paper, we conduct a theoretical analysis of how to integrate reconfigurable intelligent surfaces (RIS) with cooperative non-orthogonal multiple access (NOMA), considering URLLC. We consider a downlink two-user cooperative NOMA system employing short-packet communications, where the two users are denoted by the central user (CU) and the cell-edge user (CEU), respectively, and an RIS is deployed to enhance signal quality. Specifically, compared to CEU, CU lies nearer from BS and enjoys the higher channel gains. Closed-form expressions for the CU's average block error rate (BLER) are derived. Furthermore, we evaluate the CEU's BLER performance utilizing selective combining (SC) and derive a tight lower bound under maximum ratio combining (MRC). Simulation results are provided to our analyses and demonstrate that the RIS-assisted system significantly outperforms its counterpart without RIS in terms of BLER. Notably, MRC achieves a squared multiple of the diversity gain of the SC, leading to more reliable performance, especially for the CEU. Furthermore, by dividing the RIS into two zones, each dedicated to a specific user, the average BLER can be further reduced, particularly for the CEU. △ Less

Submitted 23 October, 2024; originally announced October 2024.

arXiv:2410.02371 [pdf, other]

doi 10.21437/SPSC.2024-13

NTU-NPU System for Voice Privacy 2024 Challenge

Authors: Nikita Kuzmin, Hieu-Thi Luong, Jixun Yao, Lei Xie, Kong Aik Lee, Eng Siong Chng

Abstract: In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker… ▽ More In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker and prosody anonymization techniques. Furthermore, we introduce Mean Reversion F0 for B5, which helps to enhance privacy without a loss in utility. Finally, we explore disentanglement models, namely $β$-VAE and NaturalSpeech3 FACodec. △ Less

Submitted 3 October, 2024; originally announced October 2024.

Comments: System description for VPC 2024

Journal ref: 2024 Challenge. Proc. 4th Symposium on Security and Privacy in Speech Communication, 72-79

arXiv:2410.01350 [pdf, other]

Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

Authors: Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, Jianjun Zhao

Abstract: Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems str… ▽ More Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems struggle to fully reproduce paralinguistic information in highly expressive speech, such as breathing, crying, and emotional nuances, limiting their practical applicability. To address these issues, we propose Takin-VC, a novel expressive zero-shot VC framework via adaptive hybrid content encoding and memory-augmented context-aware timbre modeling. Specifically, we introduce an innovative hybrid content encoder that incorporates an adaptive fusion module, capable of effectively integrating quantized features of the pre-trained WavLM and HybridFormer in an implicit manner, so as to extract precise linguistic features while enriching paralinguistic elements. For timbre modeling, we propose advanced memory-augmented and context-aware modules to generate high-quality target timbre features and fused representations that seamlessly align source content with target timbre. To enhance real-time performance, we advocate a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Experimental results show that our Takin-VC consistently surpasses state-of-the-art VC systems, achieving notable improvements in terms of speech naturalness, speech expressiveness, and speaker similarity, while offering enhanced inference speed. △ Less

Submitted 10 January, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Work in Progress; Under Review

arXiv:2409.12139 [pdf, other]

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Authors: Sijing Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Yu Pan, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jixun Yao, Quanlei Yan, Yuguang Yang, Jianhao Ye, Jingjing Yin, Yanzhen Yu, Huimin Zhang, Xiang Zhang, Guangcheng Zhao, Hongbin Zhou, Pengpeng Zou

Abstract: With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-… ▽ More With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to https://everest-ai.github.io/takinaudiollm/. △ Less

Submitted 23 September, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

Comments: Technical Report; 18 pages; typos corrected, references added, demo url modified, author name modified;

arXiv:2409.04173 [pdf, other]

NPU-NTU System for Voice Privacy 2024 Challenge

Authors: Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, Lei Xie

Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper,… ▽ More Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper, we describe our proposed speaker anonymization system for VPC 2024. Our system employs a disentangled neural codec architecture and a serial disentanglement strategy to gradually disentangle the global speaker identity and time-variant linguistic content and paralinguistic information. We introduce multiple distillation methods to disentangle linguistic content, speaker identity, and emotion. These methods include semantic distillation, supervised speaker distillation, and frame-level emotion distillation. Based on these distillations, we anonymize the original speaker identity using a weighted sum of a set of candidate speaker identities and a randomly generated speaker identity. Our system achieves the best trade-off of privacy protection and emotion preservation in VPC 2024. △ Less

Submitted 4 February, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

Comments: System description for VPC 2024

arXiv:2408.15474 [pdf, other]

Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Authors: Ziqian Ning, Shuai Wang, Yuepeng Jiang, Jixun Yao, Lei He, Shifeng Pan, Jie Ding, Lei Xie

Abstract: Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose… ▽ More Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.13447 [pdf, ps, other]

FAS-RIS Communication: Model, Analysis, and Optimization

Authors: Junteng Yao, Jianchao Zheng, Tuo Wu, Ming Jin, Chau Yuen, Kai-Kit Wong, Fumiyuki Adachi

Abstract: This correspondence investigates the novel fluid antenna system (FAS) technology, combining with reconfigurable intelligent surface (RIS) for wireless communications, where a base station (BS) communicates with a FAS-enabled user with the assistance of a RIS. To analyze this technology, we derive the outage probability based on the block-diagonal matrix approximation (BDMA) model. With this, we ob… ▽ More This correspondence investigates the novel fluid antenna system (FAS) technology, combining with reconfigurable intelligent surface (RIS) for wireless communications, where a base station (BS) communicates with a FAS-enabled user with the assistance of a RIS. To analyze this technology, we derive the outage probability based on the block-diagonal matrix approximation (BDMA) model. With this, we obtain the upper bound, lower bound, and asymptotic approximation of the outage probability to gain more insights. Moreover, we design the phase shift matrix of the RIS in order to minimize the system outage probability. Simulation results confirm the accuracy of our approximations and that the proposed schemes outperform benchmarks significantly. △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2408.13444 [pdf, ps, other]

FAS-RIS: A Block-Correlation Model Analysis

Authors: Xiazhi Lai, Junteng Yao, Kangda Zhi, Tuo Wu, David Morales-Jimenez, Kai-Kit Wong

Abstract: In this correspondence, we analyze the performance of a reconfigurable intelligent surface (RIS)-aided communication system that involves a fluid antenna system (FAS)-enabled receiver. By applying the central limit theorem (CLT), we derive approximate expressions for the system outage probability when the RIS has a large number of elements. Also, we adopt the block-correlation channel model to sim… ▽ More In this correspondence, we analyze the performance of a reconfigurable intelligent surface (RIS)-aided communication system that involves a fluid antenna system (FAS)-enabled receiver. By applying the central limit theorem (CLT), we derive approximate expressions for the system outage probability when the RIS has a large number of elements. Also, we adopt the block-correlation channel model to simplify the outage probability expressions, reducing the computational complexity and shedding light on the impact of the number of ports. Numerical results validate the effectiveness of our analysis, especially in scenarios with a large number of RIS elements. △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2408.12162 [pdf, ps, other]

doi 10.1007/s11432-024-4160-3

Empowering Over-the-Air Personalized Federated Learning via RIS

Authors: Wei Shi, Jiacheng Yao, Jindan Xu, Wei Xu, Lexi Xu, Chunming Zhao

Abstract: Over-the-air computation (AirComp) integrates analog communication with task-oriented computation, serving as a key enabling technique for communication-efficient federated learning (FL) over wireless networks. However, AirComp-enabled FL (AirFL) with a single global consensus model fails to address the data heterogeneity in real-life FL scenarios with non-independent and identically distributed l… ▽ More Over-the-air computation (AirComp) integrates analog communication with task-oriented computation, serving as a key enabling technique for communication-efficient federated learning (FL) over wireless networks. However, AirComp-enabled FL (AirFL) with a single global consensus model fails to address the data heterogeneity in real-life FL scenarios with non-independent and identically distributed local datasets. In this paper, we introduce reconfigurable intelligent surface (RIS) technology to enable efficient personalized AirFL, mitigating the data heterogeneity issue. First, we achieve statistical interference elimination across different clusters in the personalized AirFL framework via RIS phase shift configuration. Then, we propose two personalized aggregation schemes involving power control and denoising factor design from the perspectives of first- and second-order moments, respectively, to enhance the FL convergence. Numerical results validate the superior performance of our proposed schemes over existing baselines. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: Accepted by SCIENCE CHINA Information Sciences

arXiv:2408.09067 [pdf, ps, other]

FAS vs. ARIS: Which Is More Important for FAS-ARIS Communication Systems?

Authors: Junteng Yao, Liaoshi Zhou, Tuo Wu, Ming Jin, Chongwen Huang, Chau Yuen

Abstract: In this paper, we investigate the question of which technology, fluid antenna systems (FAS) or active reconfigurable intelligent surfaces (ARIS), plays a more crucial role in FAS-ARIS wireless communication systems. To address this, we develop a comprehensive system model and explore the problem from an optimization perspective. We introduce an alternating optimization (AO) algorithm incorporating… ▽ More In this paper, we investigate the question of which technology, fluid antenna systems (FAS) or active reconfigurable intelligent surfaces (ARIS), plays a more crucial role in FAS-ARIS wireless communication systems. To address this, we develop a comprehensive system model and explore the problem from an optimization perspective. We introduce an alternating optimization (AO) algorithm incorporating majorization-minimization (MM), successive convex approximation (SCA), and sequential rank-one constraint relaxation (SRCR) to tackle the non-convex challenges inherent in these systems. Specifically, for the transmit beamforming of the BS optimization, we propose a closed-form rank-one solution with low-complexity. For the optimization the positions of fluid antennas (FAs) of the BS, the Taylor expansions and MM algorithm are utilized to construct the effective lower bounds and upper bounds of the objective function and constraints, transforming the non-convex optimization problem into a convex one. Furthermore, we use the SCA and SRCR to optimize the reflection coefficient matrix of the ARIS and effectively solve the rank-one constraint. Simulation results reveal that the relative importance of FAS and ARIS varies depending on the scenario: FAS proves more critical in simpler models with fewer reflecting elements or limited transmission paths, while ARIS becomes more significant in complex scenarios with a higher number of reflecting elements or transmission paths. Ultimately, the integration of both FAS and ARIS creates a win-win scenario, resulting in a more robust and efficient communication system. This study underscores the importance of combining FAS with ARIS, as their complementary use provides the most substantial benefits across different communication environments. △ Less

Submitted 16 August, 2024; originally announced August 2024.

arXiv:2407.18054 [pdf, other]

LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels

Authors: Ziwei Cui, Jingfeng Yao, Lunbin Zeng, Juan Yang, Wenyu Liu, Xinggang Wang

Abstract: The segmentation of cell nuclei in tissue images stained with the blood dye hematoxylin and eosin (H$\&$E) is essential for various clinical applications and analyses. Due to the complex characteristics of cellular morphology, a large receptive field is considered crucial for generating high-quality segmentation. However, previous methods face challenges in achieving a balance between the receptiv… ▽ More The segmentation of cell nuclei in tissue images stained with the blood dye hematoxylin and eosin (H$\&$E) is essential for various clinical applications and analyses. Due to the complex characteristics of cellular morphology, a large receptive field is considered crucial for generating high-quality segmentation. However, previous methods face challenges in achieving a balance between the receptive field and computational burden. To address this issue, we propose LKCell, a high-accuracy and efficient cell segmentation method. Its core insight lies in unleashing the potential of large convolution kernels to achieve computationally efficient large receptive fields. Specifically, (1) We transfer pre-trained large convolution kernel models to the medical domain for the first time, demonstrating their effectiveness in cell segmentation. (2) We analyze the redundancy of previous methods and design a new segmentation decoder based on large convolution kernels. It achieves higher performance while significantly reducing the number of parameters. We evaluate our method on the most challenging benchmark and achieve state-of-the-art results (0.5080 mPQ) in cell nuclei instance segmentation with only 21.6% FLOPs compared with the previous leading method. Our source code and models are available at https://github.com/hustvl/LKCell. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.17460 [pdf, other]

SoNIC: Safe Social Navigation with Adaptive Conformal Inference and Constrained Reinforcement Learning

Authors: Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury, Jiachen Li

Abstract: Reinforcement learning (RL) enables social robots to generate trajectories without relying on human-designed rules or interventions, making it generally more effective than rule-based systems in adapting to complex, dynamic real-world scenarios. However, social navigation is a safety-critical task that requires robots to avoid collisions with pedestrians, whereas existing RL-based solutions often… ▽ More Reinforcement learning (RL) enables social robots to generate trajectories without relying on human-designed rules or interventions, making it generally more effective than rule-based systems in adapting to complex, dynamic real-world scenarios. However, social navigation is a safety-critical task that requires robots to avoid collisions with pedestrians, whereas existing RL-based solutions often fall short of ensuring safety in complex environments. In this paper, we propose SoNIC, which to the best of our knowledge is the first algorithm that integrates adaptive conformal inference (ACI) with constrained reinforcement learning (CRL) to enable safe policy learning for social navigation. Specifically, our method not only augments RL observations with ACI-generated nonconformity scores, which inform the agent of the quantified uncertainty but also employs these uncertainty estimates to effectively guide the behaviors of RL agents by using constrained reinforcement learning. This integration regulates the behaviors of RL agents and enables them to handle safety-critical situations. On the standard CrowdNav benchmark, our method achieves a success rate of 96.93%, which is 11.67% higher than the previous state-of-the-art RL method and results in 4.5 times fewer collisions and 2.8 times fewer intrusions to ground-truth human future trajectories as well as enhanced robustness in out-of-distribution scenarios. To further validate our approach, we deploy our algorithm on a real robot by developing a ROS2-based navigation system. Our experiments demonstrate that the system can generate robust and socially polite decision-making when interacting with both sparse and dense crowds. The video demos can be found on our project website: https://sonic-social-nav.github.io/. △ Less

Submitted 6 February, 2025; v1 submitted 24 July, 2024; originally announced July 2024.

Comments: Project website: https://sonic-social-nav.github.io/; 16 pages

arXiv:2407.12648 [pdf, ps, other]

Blind Beamforming for Coverage Enhancement with Intelligent Reflecting Surface

Authors: Fan Xu, Jiawei Yao, Wenhai Lai, Kaiming Shen, Xin Li, Xin Chen, Zhi-Quan Luo

Abstract: Conventional policy for configuring an intelligent reflecting surface (IRS) typically requires channel state information (CSI), thus incurring substantial overhead costs and facing incompatibility with the current network protocols. This paper proposes a blind beamforming strategy in the absence of CSI, aiming to boost the minimum signal-to-noise ratio (SNR) among all the receiver positions, namel… ▽ More Conventional policy for configuring an intelligent reflecting surface (IRS) typically requires channel state information (CSI), thus incurring substantial overhead costs and facing incompatibility with the current network protocols. This paper proposes a blind beamforming strategy in the absence of CSI, aiming to boost the minimum signal-to-noise ratio (SNR) among all the receiver positions, namely the coverage enhancement. Although some existing works already consider the IRS-assisted coverage enhancement without CSI, they assume certain position-channel models through which the channels can be recovered from the geographic locations. In contrast, our approach solely relies on the received signal power data, not assuming any position-channel model. We examine the achievability and converse of the proposed blind beamforming method. If the IRS has $N$ reflective elements and there are $U$ receiver positions, then our method guarantees the minimum SNR of $Ω(N^2/U)$ -- which is fairly close to the upper bound $O(N+N^2\sqrt{\ln (NU)}/\sqrt[4]{U})$. Aside from the simulation results, we justify the practical use of blind beamforming in a field test at 2.6 GHz. According to the real-world experiment, the proposed blind beamforming method boosts the minimum SNR across seven random positions in a conference room by 18.22 dB, while the position-based method yields a boost of 12.08 dB. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: 17 pages

arXiv:2407.11629 [pdf, other]

MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement

Authors: Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Yuguang Yang, Yu Pan, Lei Xie

Abstract: Speaker anonymization is an effective privacy protection solution designed to conceal the speaker's identity while preserving the linguistic content and para-linguistic information of the original speech. While most prior studies focus solely on a single language, an ideal speaker anonymization system should be capable of handling multiple languages. This paper proposes MUSA, a Multi-lingual Speak… ▽ More Speaker anonymization is an effective privacy protection solution designed to conceal the speaker's identity while preserving the linguistic content and para-linguistic information of the original speech. While most prior studies focus solely on a single language, an ideal speaker anonymization system should be capable of handling multiple languages. This paper proposes MUSA, a Multi-lingual Speaker Anonymization approach that employs a serial disentanglement strategy to perform a step-by-step disentanglement from a global time-invariant representation to a temporal time-variant representation. By utilizing semantic distillation and self-supervised speaker distillation, the serial disentanglement strategy can avoid strong inductive biases and exhibit superior generalization performance across different languages. Meanwhile, we propose a straightforward anonymization strategy that employs empty embedding with zero values to simulate the speaker identity concealment process, eliminating the need for conversion to a pseudo-speaker identity and thereby reducing the complexity of speaker anonymization process. Experimental results on VoicePrivacy official datasets and multi-lingual datasets demonstrate that MUSA can effectively protect speaker privacy while preserving linguistic content and para-linguistic information. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Submitted to TASLP

arXiv:2407.11307 [pdf, ps, other]

Fluid Antenna-Assisted Simultaneous Wireless Information and Power Transfer Systems

Authors: Liaoshi Zhou, Junteng Yao, Tuo Wu, Ming Jin, Chau Yuen, Fumiyuki Adachi

Abstract: This paper examines a fluid antenna (FA)-assisted simultaneous wireless information and power transfer (SWIPT) system. Unlike traditional SWIPT systems with fixed-position antennas (FPAs), our FA-assisted system enables dynamic reconfiguration of the radio propagation environment by adjusting the positions of FAs. This capability enhances both energy harvesting and communication performance. The s… ▽ More This paper examines a fluid antenna (FA)-assisted simultaneous wireless information and power transfer (SWIPT) system. Unlike traditional SWIPT systems with fixed-position antennas (FPAs), our FA-assisted system enables dynamic reconfiguration of the radio propagation environment by adjusting the positions of FAs. This capability enhances both energy harvesting and communication performance. The system comprises a base station (BS) equipped with multiple FAs that transmit signals to an energy receiver (ER) and an information receiver (IR), both equipped with a single FA. Our objective is to maximize the communication rate between the BS and the IR while satisfying the harvested power requirement of the ER. This involves jointly optimizing the BS's transmit beamforming and the positions of all FAs. To address this complex convex optimization problem, we employ an alternating optimization (AO) approach, decomposing it into three sub-problems and solving them iteratively using first and second-order Taylor expansions. Simulation results validate the effectiveness of our proposed FA-assisted SWIPT system, demonstrating significant performance improvements over traditional FPA-based systems. △ Less

Submitted 23 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.08141 [pdf, ps, other]

A Framework of FAS-RIS Systems: Performance Analysis and Throughput Optimization

Authors: Junteng Yao, Xiazhi Lai, Kangda Zhi, Tuo Wu, Ming Jin, Cunhua Pan, Maged Elkashlan, Chau Yuen, Kai-Kit Wong

Abstract: In this paper, we investigate reconfigurable intelligent surface (RIS)-assisted communication systems which involve a fixed-antenna base station (BS) and a mobile user (MU) that is equipped with fluid antenna system (FAS). Specifically, the RIS is utilized to enable communication for the user whose direct link from the base station is blocked by obstacles. We propose a comprehensive framework that… ▽ More In this paper, we investigate reconfigurable intelligent surface (RIS)-assisted communication systems which involve a fixed-antenna base station (BS) and a mobile user (MU) that is equipped with fluid antenna system (FAS). Specifically, the RIS is utilized to enable communication for the user whose direct link from the base station is blocked by obstacles. We propose a comprehensive framework that provides transmission design for both static scenarios with the knowledge of channel state information (CSI) and harsh environments where CSI is hard to acquire. It leads to two approaches: a CSI-based scheme where CSI is available, and a CSI-free scheme when CSI is inaccessible. Given the complex spatial correlations in FAS, we employ block-diagonal matrix approximation and independent antenna equivalent models to simplify the derivation of outage probabilities in both cases. Based on the derived outage probabilities, we then optimize the throughput of the FAS-RIS system. For the CSI-based scheme, we first propose a gradient ascent-based algorithm to obtain a near-optimal solution. Then, to address the possible high computational complexity in the gradient algorithm, we approximate the objective function and confirm a unique optimal solution accessible through a bisection search method. For the CSI-free scheme, we apply the partial gradient ascent algorithm, reducing complexity further than full gradient algorithms. We also approximate the objective function and derive a locally optimal closed-form solution to maximize throughput. Simulation results validate the effectiveness of the proposed framework for the transmission design in FAS-RIS systems. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: submitted to IEEE journal for possible publication

arXiv:2407.00718 [pdf, other]

ASPS: Augmented Segment Anything Model for Polyp Segmentation

Authors: Huiqian Li, Dingwen Zhang, Jieru Yao, Longfei Han, Zhongyu Li, Junwei Han

Abstract: Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performan… ▽ More Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performance in polyp segmentation. Firstly, its Transformer-based structure prioritizes global and low-frequency information, potentially overlooking local details, and introducing bias into the learned features. Secondly, when applied to endoscopy images, its poor out-of-distribution (OOD) performance results in substandard predictions and biased confidence output. To tackle these challenges, we introduce a novel approach named Augmented SAM for Polyp Segmentation (ASPS), equipped with two modules: Cross-branch Feature Augmentation (CFA) and Uncertainty-guided Prediction Regularization (UPR). CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge while enhancing local features and high-frequency details. Moreover, UPR ingeniously leverages SAM's IoU score to mitigate uncertainty during the training procedure, thereby improving OOD performance and domain generalization. Extensive experimental results demonstrate the effectiveness and utility of the proposed method in improving SAM's performance in polyp segmentation. Our code is available at https://github.com/HuiqianLi/ASPS. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: Accepted by MICCAI2024

Showing 1–50 of 145 results for author: Yao, J