Search | arXiv e-print repository

Joint Beamforming for NOMA Assisted Pinching Antenna Systems (PASS)

Authors: Deqiao Gan, Xiaoxia Xu, Jiakuo Zuo, Xiaohu Ge, Yuanwei Liu

Abstract: Pinching antenna system (PASS) configures the positions of pinching antennas (PAs) along dielectric waveguides to change both large-scale fading and small-scale scattering, which is known as pinching beamforming. A novel non-orthogonal multiple access (NOMA) assisted PASS framework is proposed for downlink multi-user multiple-input multiple-output (MIMO) communications. The transmit power minimiza… ▽ More Pinching antenna system (PASS) configures the positions of pinching antennas (PAs) along dielectric waveguides to change both large-scale fading and small-scale scattering, which is known as pinching beamforming. A novel non-orthogonal multiple access (NOMA) assisted PASS framework is proposed for downlink multi-user multiple-input multiple-output (MIMO) communications. The transmit power minimization problem is formulated to jointly optimize the transmit beamforming, pinching beamforming, and power allocation. To solve this highly nonconvex problem, both gradient-based and swarm-based optimization methods are developed. 1) For gradient-based method, a majorization-minimization and penalty dual decomposition (MM-PDD) algorithm is developed. The Lipschitz gradient surrogate function is constructed based on MM to tackle the nonconvex terms of this problem. Then, the joint optimization problem is decomposed into subproblems that are alternatively optimized based on PDD to obtain stationary closed-form solutions. 2) For swarm-based method, a fast-convergent particle swarm optimization and zero forcing (PSO-ZF) algorithm is proposed. Specifically, the PA position-seeking particles are constructed to explore high-quality pinching beamforming solutions. Moreover, ZF-based transmit beamforming is utilized by each particle for fast fitness function evaluation. Simulation results demonstrate that: i) The proposed NOMA assisted PASS and algorithms outperforms the conventional NOMA assisted massive antenna system. The proposed framework reduces over 95.22% transmit power compared to conventional massive MIMO-NOMA systems. ii) Swarm-based optimization outperforms gradient-based optimization by searching effective solution subspace to avoid stuck in undesirable local optima. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.01014 [pdf, ps, other]

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

Authors: Jialong Zuo, Shengpeng Ji, Minghui Fang, Mingze Li, Ziyue Jiang, Xize Cheng, Xiaoda Yang, Chen Feiyang, Xinyu Duan, Zhou Zhao

Abstract: Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and eff… ▽ More Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Comments: Accepted by ACL 2025 (Main Conference)

arXiv:2505.24496 [pdf, other]

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

Authors: Wenrui Liu, Qian Chen, Wen Wang, Yafeng Chen, Jin Xu, Zhifang Guo, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Minghui Fang, Jialong Zuo, Bai Jionghao, Zemin Liu

Abstract: Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-r… ▽ More Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences of speech tokens, posing a significant challenge for downstream language models in long-context modeling. We observe that speech token sequences exhibit short-range dependency: due to the monotonic alignment between text and speech in text-to-speech (TTS) tasks, the prediction of the current token primarily relies on its local context, while long-range tokens contribute less to the current token prediction and often contain redundant information. Inspired by this observation, we propose a \textbf{compressed-to-fine language modeling} approach to address the challenge of long sequence speech tokens within neural codec language models: (1) \textbf{Fine-grained Initial and Short-range Information}: Our approach retains the prompt and local tokens during prediction to ensure text alignment and the integrity of paralinguistic information; (2) \textbf{Compressed Long-range Context}: Our approach compresses long-range token spans into compact representations to reduce redundant information while preserving essential semantics. Extensive experiments on various neural audio codecs and downstream language models validate the effectiveness and generalizability of the proposed approach, highlighting the importance of token compression in improving speech generation within neural codec language models. The demo of audio samples will be available at https://anonymous.4open.science/r/SpeechTokenPredictionViaCompressedToFinedLM. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.09558 [pdf, other]

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Authors: Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao

Abstract: End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT.… ▽ More End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2504.20653 [pdf, other]

ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code

Authors: Jian Zuo, Junzhe Liu, Xianyong Wang, Yicheng Liu, Navya Goli, Tong Xu, Hao Zhang, Umamaheswara Rao Tida, Zhenge Jia, Mengying Zhao

Abstract: Recent advances have demonstrated the promising capabilities of large language models (LLMs) in generating register-transfer level (RTL) code, such as Verilog. However, existing LLM-based frameworks still face significant challenges in accurately handling the complexity of real-world RTL designs, particularly those that are large-scale and involve multi-level module instantiations. To address this… ▽ More Recent advances have demonstrated the promising capabilities of large language models (LLMs) in generating register-transfer level (RTL) code, such as Verilog. However, existing LLM-based frameworks still face significant challenges in accurately handling the complexity of real-world RTL designs, particularly those that are large-scale and involve multi-level module instantiations. To address this issue, we present ComplexVCoder, an open-source LLM-driven framework that enhances both the generation quality and efficiency of complex Verilog code. Specifically, we introduce a two-stage generation mechanism, which leverages an intermediate representation to enable a more accurate and structured transition from natural language descriptions to intricate Verilog designs. In addition, we introduce a rule-based alignment method and a domain-specific retrieval-augmented generation (RAG) to further improve the correctness of the synthesized code by incorporating relevant design knowledge during generation. To evaluate our approach, we construct a comprehensive dataset comprising 55 complex Verilog designs derived from real-world implementations. We also release an open-source benchmark suite for systematically assessing the quality of auto-generated RTL code together with the ComplexVCoder framework. Experimental results show that ComplexVCoder outperforms SOTA frameworks such as CodeV and RTLCoder by 14.6% and 22.2%, respectively, in terms of function correctness on complex Verilog benchmarks. Furthermore, ComplexVcoder achieves comparable generation performances in terms of functionality correctness using a lightweight 32B model (Qwen2.5), rivaling larger-scale models such as GPT-3.5 and DeepSeek-V3. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2502.18924 [pdf, other]

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Authors: Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao

Abstract: While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalnes… ▽ More While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/. △ Less

Submitted 28 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

arXiv:2502.05471 [pdf, other]

Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

Authors: Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao

Abstract: This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous metho… ▽ More This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page https://speechai-demo.github.io/PFlow-VC/. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: Accepted by ICASSP 2025

arXiv:2412.13917 [pdf, other]

Speech Watermarking with Discrete Intermediate Representations

Authors: Shengpeng Ji, Ziyue Jiang, Jialong Zuo, Minghui Fang, Yifu Chen, Tao Jin, Zhou Zhao

Abstract: Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robus… ▽ More Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vector-quantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity. Audio samples are available at https://DiscreteWM.github.io/discrete_wm. △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI 2025

arXiv:2411.13577 [pdf, other]

WavChat: A Survey of Spoken Dialogue Models

Authors: Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao

Abstract: Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue model… ▽ More Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at https://github.com/jishengpeng/WavChat. △ Less

Submitted 26 November, 2024; v1 submitted 14 November, 2024; originally announced November 2024.

Comments: 60 papes, working in progress

arXiv:2410.21269 [pdf, other]

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Authors: Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao

Abstract: The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtrac… ▽ More The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}. △ Less

Submitted 28 October, 2024; originally announced October 2024.

Comments: Working in progress

arXiv:2408.16532 [pdf, other]

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Authors: Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Zhou Zhao

Abstract: Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domai… ▽ More Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer. △ Less

Submitted 25 February, 2025; v1 submitted 29 August, 2024; originally announced August 2024.

Comments: Accepted by ICLR 2025

arXiv:2407.14006 [pdf, other]

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis

Authors: Qian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang, Mingze Li, Zhou Zhao, Feiyang Chen, Zhefeng Wang, Baoxing Huai

Abstract: We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for… ▽ More We introduce an open source high-quality Mandarin TTS dataset MSceneSpeech (Multiple Scene Speech Dataset), which is intended to provide resources for expressive speech synthesis. MSceneSpeech comprises numerous audio recordings and texts performed and recorded according to daily life scenarios. Each scenario includes multiple speakers and a diverse range of prosodic styles, making it suitable for speech synthesis that entails multi-speaker style and prosody modeling. We have established a robust baseline, through the prompting mechanism, that can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. The open source MSceneSpeech Dataset and audio samples of our baseline are available at https://speechai-demo.github.io/MSceneSpeech/. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.01205 [pdf, ps, other]

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

Authors: Shengpeng Ji, Qian Chen, Wen Wang, Jialong Zuo, Minghui Fang, Ziyue Jiang, Hai Huang, Zehan Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker's voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeec… ▽ More In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker's voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging task: a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture codec representations corresponding to timbre, content, and style in a discrete decoupling codec space. Moreover, we analyze the many-to-many issue in textual style control and propose the Style Mixture Semantic Density (SMSD) module, which is based on Gaussian mixture density networks, to resolve this problem. To facilitate empirical validations, we make available a new style controllable dataset called VccmDataset. Our experimental results demonstrate that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech . △ Less

Submitted 4 June, 2025; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: ACL 2025 Main

arXiv:2405.00842 [pdf, other]

Quickest Change Detection with Confusing Change

Authors: Yu-Zhen Janice Chen, Jinhang Zuo, Venugopal V. Veeravalli, Don Towsley

Abstract: In the problem of quickest change detection (QCD), a change occurs at some unknown time in the distribution of a sequence of independent observations. This work studies a QCD problem where the change is either a bad change, which we aim to detect, or a confusing change, which is not of our interest. Our objective is to detect a bad change as quickly as possible while avoiding raising a false alarm… ▽ More In the problem of quickest change detection (QCD), a change occurs at some unknown time in the distribution of a sequence of independent observations. This work studies a QCD problem where the change is either a bad change, which we aim to detect, or a confusing change, which is not of our interest. Our objective is to detect a bad change as quickly as possible while avoiding raising a false alarm for pre-change or a confusing change. We identify a specific set of pre-change, bad change, and confusing change distributions that pose challenges beyond the capabilities of standard Cumulative Sum (CuSum) procedures. Proposing novel CuSum-based detection procedures, S-CuSum and J-CuSum, leveraging two CuSum statistics, we offer solutions applicable across all kinds of pre-change, bad change, and confusing change distributions. For both S-CuSum and J-CuSum, we provide analytical performance guarantees and validate them by numerical results. Furthermore, both procedures are computationally efficient as they only require simple recursive updates. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2403.05557 [pdf, other]

Re-thinking Human Activity Recognition with Hierarchy-aware Label Relationship Modeling

Authors: Jingwei Zuo, Hakim Hacid

Abstract: Human Activity Recognition (HAR) has been studied for decades, from data collection, learning models, to post-processing and result interpretations. However, the inherent hierarchy in the activities remains relatively under-explored, despite its significant impact on model performance and interpretation. In this paper, we propose H-HAR, by rethinking the HAR tasks from a fresh perspective by delvi… ▽ More Human Activity Recognition (HAR) has been studied for decades, from data collection, learning models, to post-processing and result interpretations. However, the inherent hierarchy in the activities remains relatively under-explored, despite its significant impact on model performance and interpretation. In this paper, we propose H-HAR, by rethinking the HAR tasks from a fresh perspective by delving into their intricate global label relationships. Rather than building multiple classifiers separately for multi-layered activities, we explore the efficacy of a flat model enhanced with graph-based label relationship modeling. Being hierarchy-aware, the graph-based label modeling enhances the fundamental HAR model, by incorporating intricate label relationships into the model. We validate the proposal with a multi-label classifier on complex human activity data. The results highlight the advantages of the proposal, which can be vertically integrated into advanced HAR models to further enhance their performances. △ Less

Submitted 11 February, 2024; originally announced March 2024.

Comments: Accepted by PAKDD 2024

arXiv:2402.12208 [pdf, ps, other]

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models

Authors: Shengpeng Ji, Minghui Fang, Jialong Zuo, Ziyue Jiang, Dingdong Wang, Hanting Wang, Hai Huang, Zhou Zhao

Abstract: In recent years, large language models have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifi… ▽ More In recent years, large language models have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) Due to the reconstruction paradigm of the Codec model and the structure of residual vector quantization, the initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. 2) numerous codebooks increases the burden on downstream speech language models. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Masked Channel Residual Vector Quantization (MCRVQ) mechanism along with improved fourier transform structures and attention blocks, refined discriminator design to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec . △ Less

Submitted 4 June, 2025; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: ACL 2025 Main

arXiv:2402.09378 [pdf, other]

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Authors: Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

Abstract: Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, mod… ▽ More Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{https://mobilespeech.github.io/} . △ Less

Submitted 2 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted by ACL 2024 (Main Conference)

arXiv:2311.03175 [pdf]

Frequency Domain Decomposition Translation for Enhanced Medical Image Translation Using GANs

Authors: Zhuhui Wang, Jianwei Zuo, Xuliang Deng, Jiajia Luo

Abstract: Medical Image-to-image translation is a key task in computer vision and generative artificial intelligence, and it is highly applicable to medical image analysis. GAN-based methods are the mainstream image translation methods, but they often ignore the variation and distribution of images in the frequency domain, or only take simple measures to align high-frequency information, which can lead to d… ▽ More Medical Image-to-image translation is a key task in computer vision and generative artificial intelligence, and it is highly applicable to medical image analysis. GAN-based methods are the mainstream image translation methods, but they often ignore the variation and distribution of images in the frequency domain, or only take simple measures to align high-frequency information, which can lead to distortion and low quality of the generated images. To solve these problems, we propose a novel method called frequency domain decomposition translation (FDDT). This method decomposes the original image into a high-frequency component and a low-frequency component, with the high-frequency component containing the details and identity information, and the low-frequency component containing the style information. Next, the high-frequency and low-frequency components of the transformed image are aligned with the transformed results of the high-frequency and low-frequency components of the original image in the same frequency band in the spatial domain, thus preserving the identity information of the image while destroying as little stylistic information of the image as possible. We conduct extensive experiments on MRI images and natural images with FDDT and several mainstream baseline models, and we use four evaluation metrics to assess the quality of the generated images. Compared with the baseline models, optimally, FDDT can reduce Fréchet inception distance by up to 24.4%, structural similarity by up to 4.4%, peak signal-to-noise ratio by up to 5.8%, and mean squared error by up to 31%. Compared with the previous method, optimally, FDDT can reduce Fréchet inception distance by up to 23.7%, structural similarity by up to 1.8%, peak signal-to-noise ratio by up to 6.8%, and mean squared error by up to 31.6%. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2308.14430 [pdf, other]

doi 10.1109/ICASSP48485.2024.10445879

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

Abstract: Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to th… ▽ More Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. 2) Furthermore, to address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle. This architecture treats text controllable TTS as a language model task, utilizing audio codec codes as an intermediate representation to replace the conventional mel-spectrogram. Finally, we successfully demonstrate the ability of the proposed model by showing a comparable performance in the controllable TTS task. Audio samples are available at https://sall-e.github.io/ △ Less

Submitted 28 August, 2023; originally announced August 2023.

Journal ref: 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:2308.11691 [pdf, other]

Practical Insights on Incremental Learning of New Human Physical Activity on the Edge

Authors: George Arvanitakis, Jingwei Zuo, Mthandazo Ndhlovu, Hakim Hacid

Abstract: Edge Machine Learning (Edge ML), which shifts computational intelligence from cloud-based systems to edge devices, is attracting significant interest due to its evident benefits including reduced latency, enhanced data privacy, and decreased connectivity reliance. While these advantages are compelling, they introduce unique challenges absent in traditional cloud-based approaches. In this paper, we… ▽ More Edge Machine Learning (Edge ML), which shifts computational intelligence from cloud-based systems to edge devices, is attracting significant interest due to its evident benefits including reduced latency, enhanced data privacy, and decreased connectivity reliance. While these advantages are compelling, they introduce unique challenges absent in traditional cloud-based approaches. In this paper, we delve into the intricacies of Edge-based learning, examining the interdependencies among: (i) constrained data storage on Edge devices, (ii) limited computational power for training, and (iii) the number of learning classes. Through experiments conducted using our MAGNETO system, that focused on learning human activities via data collected from mobile sensors, we highlight these challenges and offer valuable perspectives on Edge ML. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted by DSAA 2023 (Industrial Track)

arXiv:2308.10124 [pdf, other]

doi 10.1109/SECON58729.2023.10287493

Intelligent Communication Planning for Constrained Environmental IoT Sensing with Reinforcement Learning

Authors: Yi Hu, Jinhang Zuo, Bob Iannucci, Carlee Joe-Wong

Abstract: Internet of Things (IoT) technologies have enabled numerous data-driven mobile applications and have the potential to significantly improve environmental monitoring and hazard warnings through the deployment of a network of IoT sensors. However, these IoT devices are often power-constrained and utilize wireless communication schemes with limited bandwidth. Such power constraints limit the amount o… ▽ More Internet of Things (IoT) technologies have enabled numerous data-driven mobile applications and have the potential to significantly improve environmental monitoring and hazard warnings through the deployment of a network of IoT sensors. However, these IoT devices are often power-constrained and utilize wireless communication schemes with limited bandwidth. Such power constraints limit the amount of information each device can share across the network, while bandwidth limitations hinder sensors' coordination of their transmissions. In this work, we formulate the communication planning problem of IoT sensors that track the state of the environment. We seek to optimize sensors' decisions in collecting environmental data under stringent resource constraints. We propose a multi-agent reinforcement learning (MARL) method to find the optimal communication policies for each sensor that maximize the tracking accuracy subject to the power and bandwidth limitations. MARL learns and exploits the spatial-temporal correlation of the environmental data at each sensor's location to reduce the redundant reports from the sensors. Experiments on wildfire spread with LoRA wireless network simulators show that our MARL method can learn to balance the need to collect enough data to predict wildfire spread with unknown bandwidth limitations. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: To be published in the 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON 2023)

arXiv:2306.12572 [pdf, ps, other]

Uniqueness of Iris Pattern Based on AR Model

Authors: Katelyn M. Hampel, Jinyu Zuo, Priyanka Das, Natalia A. Schmid, Stephanie Schuckers, Joseph Skufca, Matthew C. Valenti

Abstract: The assessment of iris uniqueness plays a crucial role in analyzing the capabilities and limitations of iris recognition systems. Among the various methodologies proposed, Daugman's approach to iris uniqueness stands out as one of the most widely accepted. According to Daugman, uniqueness refers to the iris recognition system's ability to enroll an increasing number of classes while maintaining a… ▽ More The assessment of iris uniqueness plays a crucial role in analyzing the capabilities and limitations of iris recognition systems. Among the various methodologies proposed, Daugman's approach to iris uniqueness stands out as one of the most widely accepted. According to Daugman, uniqueness refers to the iris recognition system's ability to enroll an increasing number of classes while maintaining a near-zero probability of collision between new and enrolled classes. Daugman's approach involves creating distinct IrisCode templates for each iris class within the system and evaluating the sustainable population under a fixed Hamming distance between codewords. In our previous work [23], we utilized Rate-Distortion Theory (as it pertains to the limits of error-correction codes) to establish boundaries for the maximum possible population of iris classes supported by Daugman's IrisCode, given the constraint of a fixed Hamming distance between codewords. Building upon that research, we propose a novel methodology to evaluate the scalability of an iris recognition system, while also measuring iris quality. We achieve this by employing a sphere-packing bound for Gaussian codewords and adopting a approach similar to Daugman's, which utilizes relative entropy as a distance measure between iris classes. To demonstrate the efficacy of our methodology, we illustrate its application on two small datasets of iris images. We determine the sustainable maximum population for each dataset based on the quality of the images. By providing these illustrations, we aim to assist researchers in comprehending the limitations inherent in their recognition systems, depending on the quality of their iris databases. △ Less

Submitted 21 June, 2023; originally announced June 2023.

arXiv:2305.13612 [pdf, other]

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

Authors: Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, Zhou Zhao

Abstract: Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To… ▽ More Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To tackle the challenges in stutter removal, we propose FluentSpeech, a stutter-oriented automatic speech editing model. Specifically, 1) we propose a context-aware diffusion model that iteratively refines the modified mel-spectrogram with the guidance of context features; 2) we introduce a stutter predictor module to inject the stutter information into the hidden sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE) dataset that contains spontaneous speech recordings with time-aligned stutter labels to train the automatic stutter localization model. Experimental results on VCTK and LibriTTS datasets demonstrate that our model achieves state-of-the-art performance on speech editing. Further experiments on our SASE dataset show that FluentSpeech can effectively improve the fluency of stuttering speech in terms of objective and subjective metrics. Code and audio samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted by ACL 2023 (Findings)

arXiv:2304.13185 [pdf, ps, other]

Non-Orthogonal Multiple Access For Near-Field Communications

Authors: Jiakuo Zuo, Xidong Mu, Yuanwei Liu

Abstract: The novel concept of near-field non-orthogonal multiple access (NF-NOMA) communications is proposed. The near-filed beamfocusing enables NOMA to be carried out in both angular and distance domains. Two novel frameworks are proposed, namely, single-location-beamfocusing NF-NOMA (SLB-NF-NOMA) and multiple-location-beamfocusing NF-NOMA (MLB-NF-NOMA). 1) For SLB-NF-NOMA, two NOMA users in the same ang… ▽ More The novel concept of near-field non-orthogonal multiple access (NF-NOMA) communications is proposed. The near-filed beamfocusing enables NOMA to be carried out in both angular and distance domains. Two novel frameworks are proposed, namely, single-location-beamfocusing NF-NOMA (SLB-NF-NOMA) and multiple-location-beamfocusing NF-NOMA (MLB-NF-NOMA). 1) For SLB-NF-NOMA, two NOMA users in the same angular direction with distinct quality of service (QoS) requirements can be grouped into one cluster. The hybrid beamformer design and power allocation problem is formulated to maximize the sum rate of the users with higher QoS (H-QoS) requirements. To solve this problem, the analog beamformer is first designed to focus the energy on the H-QoS users and the zero-forcing (ZF) digital beamformer is employed. Then, the optimal power allocation is obtained. 2) For MLB-NF-NOMA, the two NOMA users in the same cluster can have different angular directions. The analog beamformer is first designed to focus the energy on both two NOMA users. Then, a singular value decomposition (SVD) based ZF (SVD-ZF) digital beamformer is designed. Furthermore, a novel antenna allocation algorithm is proposed. Finally, a suboptimal power allocation algorithm is proposed. Numerical results demonstrate that the NF-NOMA can achieve a higher spectral efficiency and provide a higher flexibility than conventional far-field NOMA. △ Less

Submitted 18 May, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

arXiv:2302.10771 [pdf]

doi 10.1016/j.ress.2023.109123

Data-driven prognostics based on time-frequency analysis and symbolic recurrent neural network for fuel cells under dynamic load

Authors: Chu Wang, Manfeng Dou, Zhongliang Li, Rachid Outbib, Dongdong Zhao, Jian Zuo, Yuanlin Wang, Bin Liang, Peng Wang

Abstract: Data-centric prognostics is beneficial to improve the reliability and safety of proton exchange membrane fuel cell (PEMFC). For the prognostics of PEMFC operating under dynamic load, the challenges come from extracting degradation features, improving prediction accuracy, expanding the prognostics horizon, and reducing computational cost. To address these issues, this work proposes a data-driven PE… ▽ More Data-centric prognostics is beneficial to improve the reliability and safety of proton exchange membrane fuel cell (PEMFC). For the prognostics of PEMFC operating under dynamic load, the challenges come from extracting degradation features, improving prediction accuracy, expanding the prognostics horizon, and reducing computational cost. To address these issues, this work proposes a data-driven PEMFC prognostics approach, in which Hilbert-Huang transform is used to extract health indicator in dynamic operating conditions and symbolic-based gated recurrent unit model is used to enhance the accuracy of life prediction. Comparing with other state-of-the-art methods, the proposed data-driven prognostics approach provides a competitive prognostics horizon with lower computational cost. The prognostics performance shows consistency and generalizability under different failure threshold settings. △ Less

Submitted 3 February, 2023; originally announced February 2023.

arXiv:2302.04432 [pdf, ps, other]

Active Simultaneously Transmitting and Reflecting (STAR)-RISs: Modelling and Analysis

Authors: Jiaqi Xu, Jiakuo Zuo, Joey Tianyi Zhou, Yuanwei Liu

Abstract: A hardware model for active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) is proposed consisting of reflection-type amplifiers. The amplitude gains of the STAR element are derived for both coupled and independent phase-shift scenarios. Based on the proposed hardware model, an active STAR-RIS-aided two-user downlink communication system is investigated.… ▽ More A hardware model for active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) is proposed consisting of reflection-type amplifiers. The amplitude gains of the STAR element are derived for both coupled and independent phase-shift scenarios. Based on the proposed hardware model, an active STAR-RIS-aided two-user downlink communication system is investigated. Closed-form expressions are obtained for the outage probabilities of both the coupled and independent phase-shift scenarios. To obtain further insights, scaling laws and diversity orders are derived for both users. Analytical results confirm that active STAR-RIS achieves the same diversity orders as passive ones while their scaling laws are different. It is proved that average received SNRs scale with M and M^2 for active and passive STAR-RISs, respectively. Numerical results show that active STAR-RISs outperform passive STAR-RISs in terms of outage probability especially when the number of elements is small. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Comments: 13 pages

arXiv:2210.02725 [pdf, ps, other]

Exploiting NOMA and RIS in Integrated Sensing and Communication

Authors: Jiakuo Zuo, Yuanwei Liu, Chenming Zhu, Yixuan Zou, Dengyin Zhang, Naofal Al-Dhahir

Abstract: A novel integrated sensing and communication (ISAC) system is proposed, where a dual-functional base station is utilized to transmit the superimposed non-orthogonal multiple access (NOMA) communication signal for serving communication users and sensing targets simultaneously. Furthermore, a new reconfigurable intelligent surface (RIS)-aided-sensing structure is also proposed to address the signifi… ▽ More A novel integrated sensing and communication (ISAC) system is proposed, where a dual-functional base station is utilized to transmit the superimposed non-orthogonal multiple access (NOMA) communication signal for serving communication users and sensing targets simultaneously. Furthermore, a new reconfigurable intelligent surface (RIS)-aided-sensing structure is also proposed to address the significant path loss or blockage of LoS links for the sensing task. Based on this setup, the beampattern gain at the RIS for the radar target is derived and adopted as a sensing metric. The objective of this paper is to maximize the minimum beampattern gain by jointly optimizing active beamforming, power allocation coefficients and passive beamforming. To tackle the non-convexity of the formulated optimization problem, the beampattern gain and constraints are first transformed into more tractable forms. Then, an iterative block coordinate descent (IBCD) algorithm is proposed by employing successive convex approximation (SCA), Schur complement, semidefinite relaxation (SDR) and sequential rank-one constraint relaxation (SRCR) methods. To reduce the complexity of the proposed IBCD algorithm, a low-complexity iterative alternating optimization (IAO) algorithm is proposed. Particularly, the active beamforming is optimized by solving a semidefinite programming (SDP) problem and the closed-form solutions of the power allocation coefficients are derived. Numerical results show that: i) the proposed RIS-NOMA-ISAC system always outperforms the RIS-ISAC system without NOMA in beampattern gain and illumination power; ii) the low-complexity IAO algorithm achieves a comparable performance to that achieved by the IBCD algorithm. iii) high beampattern gain can be achieved by the proposed joint optimization algorithms in underloaded and overloaded communication scenarios. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: arXiv admin note: text overlap with arXiv:2208.04786

arXiv:2208.04786 [pdf, ps, other]

Reconfigurable Intelligent Surface Assisted NOMA Empowered Integrated Sensing and Communication

Authors: Jiakuo Zuo, Yuanwei Liu

Abstract: This paper exploits the potential of reconfigurable intelligent surface (RIS) to improve radar sensing in a non-orthogonal multiple access (NOMA) empowered integrated sensing and communication (NOMA-ISAC) network. The objective is to maximize the minimum radar beampattern gain by jointly optimizing the active beamforming, power allocation coefficients and passive beamforming. To tackle the formula… ▽ More This paper exploits the potential of reconfigurable intelligent surface (RIS) to improve radar sensing in a non-orthogonal multiple access (NOMA) empowered integrated sensing and communication (NOMA-ISAC) network. The objective is to maximize the minimum radar beampattern gain by jointly optimizing the active beamforming, power allocation coefficients and passive beamforming. To tackle the formulated non-convex problem, we propose an efficient joint optimization algorithm by invoking alternating optimization, successive convex approximation (SCA) and sequential rank-one constraint relaxation (SRCR) algorithm. Numerical results show that the proposed RIS assisted NOMA-ISAC system, with the aid of proposed scheme, outperforms the RIS assisted ISAC system without NOMA. △ Less

Submitted 28 September, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

arXiv:2112.05240 [pdf]

doi 10.34133/2022/9786242

Label-free virtual HER2 immunohistochemical staining of breast tissue using deep learning

Authors: Bijie Bai, Hongda Wang, Yuzhu Li, Kevin de Haan, Francesco Colonnese, Yujie Wan, Jingyi Zuo, Ngan B. Doan, Xiaoran Zhang, Yijie Zhang, Jingxi Li, Wenjie Dong, Morgan Angus Darrow, Elham Kamangar, Han Sung Lee, Yair Rivenson, Aydogan Ozcan

Abstract: The immunohistochemical (IHC) staining of the human epidermal growth factor receptor 2 (HER2) biomarker is widely practiced in breast tissue analysis, preclinical studies and diagnostic decisions, guiding cancer treatment and investigation of pathogenesis. HER2 staining demands laborious tissue treatment and chemical processing performed by a histotechnologist, which typically takes one day to pre… ▽ More The immunohistochemical (IHC) staining of the human epidermal growth factor receptor 2 (HER2) biomarker is widely practiced in breast tissue analysis, preclinical studies and diagnostic decisions, guiding cancer treatment and investigation of pathogenesis. HER2 staining demands laborious tissue treatment and chemical processing performed by a histotechnologist, which typically takes one day to prepare in a laboratory, increasing analysis time and associated costs. Here, we describe a deep learning-based virtual HER2 IHC staining method using a conditional generative adversarial network that is trained to rapidly transform autofluorescence microscopic images of unlabeled/label-free breast tissue sections into bright-field equivalent microscopic images, matching the standard HER2 IHC staining that is chemically performed on the same tissue sections. The efficacy of this virtual HER2 staining framework was demonstrated by quantitative analysis, in which three board-certified breast pathologists blindly graded the HER2 scores of virtually stained and immunohistochemically stained HER2 whole slide images (WSIs) to reveal that the HER2 scores determined by inspecting virtual IHC images are as accurate as their immunohistochemically stained counterparts. A second quantitative blinded study performed by the same diagnosticians further revealed that the virtually stained HER2 images exhibit a comparable staining quality in the level of nuclear detail, membrane clearness, and absence of staining artifacts with respect to their immunohistochemically stained counterparts. This virtual HER2 staining framework bypasses the costly, laborious, and time-consuming IHC staining procedures in laboratory, and can be extended to other types of biomarkers to accelerate the IHC tissue staining used in life sciences and biomedical workflow. △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: 26 Pages, 5 Figures

Journal ref: BME Frontiers (2022)

arXiv:2107.01601 [pdf, other]

doi 10.1016/j.actaastro.2021.07.046

Applications And Potentials Of Intelligent Swarms For Magnetospheric Studies

Authors: Raj Thilak Rajan, Shoshana Ben-Maor, Shaziana Kaderali, Calum Turner, Mohammed Milhim, Catrina Melograna, Dawn Haken, Gary Paul, Vedant, Sreekumar V, Johannes Weppler, Yosephine Gumulya, Riccardo Bunt, Asia Bulgarini, Maurice Marnat, Kadri Bussov, Frederick Pringle, Jusha Ma, Rushanka Amrutkar, Miguel Coto, Jiang He, Zijian Shi, Shahd Hayder, Dina Saad Fayez Jaber, Junchao Zuo , et al. (10 additional authors not shown)

Abstract: Earth's magnetosphere is vital for today's technologically dependent society. To date, numerous design studies have been conducted and over a dozen science missions have own to study the magnetosphere. However, a majority of these solutions relied on large monolithic satellites, which limited the spatial resolution of these investigations, as did the technological limitations of the past. To count… ▽ More Earth's magnetosphere is vital for today's technologically dependent society. To date, numerous design studies have been conducted and over a dozen science missions have own to study the magnetosphere. However, a majority of these solutions relied on large monolithic satellites, which limited the spatial resolution of these investigations, as did the technological limitations of the past. To counter these limitations, we propose the use of a satellite swarm carrying numerous and distributed payloads for magnetospheric measurements. Our mission is named APIS (Applications and Potentials of Intelligent Swarms), which aims to characterize fundamental plasma processes in the Earth's magnetosphere and measure the effect of the solar wind on our magnetosphere. We propose a swarm of 40 CubeSats in two highly-elliptical orbits around the Earth, which perform radio tomography in the magnetotail at 8-12 Earth Radii (RE) downstream, and the subsolar magnetosphere at 8-12RE upstream. In addition, in-situ measurements of the magnetic and electric fields, plasma density will be performed by on-board instruments. In this article, we present an outline of previous missions and designs for magnetospheric studies, along with the science drivers and motivation for the APIS mission. Furthermore, preliminary design results are included to show the feasibility of such a mission. The science requirements drive the APIS mission design, the mission operation and the system requirements. In addition to the various science payloads, critical subsystems of the satellites are investigated e.g., navigation, communication, processing and power systems. We summarize our findings, along with the potential next steps to strengthen our design study. △ Less

Submitted 4 July, 2021; originally announced July 2021.

Comments: Accepted in Acta Astronautica

Journal ref: Acta Astronautica, Elsevier, 2021

arXiv:2106.03001 [pdf, ps, other]

doi 10.1109/TWC.2022.3197079

Joint Design for Simultaneously Transmitting And Reflecting (STAR) RIS Assisted NOMA Systems

Authors: Jiakuo Zuo, Yuanwei Liu, Zhiguo Ding, Lingyang Song, H. Vincent Poor

Abstract: Different from traditional reflection-only reconfigurable intelligent surfaces (RISs), simultaneously transmitting and reflecting RISs (STAR-RISs) represent a novel technology, which extends the half-space coverage to full-space coverage by simultaneously transmitting and reflecting incident signals. STAR-RISs provide new degrees-of-freedom (DoF) for manipulating signal propagation. Motivated by t… ▽ More Different from traditional reflection-only reconfigurable intelligent surfaces (RISs), simultaneously transmitting and reflecting RISs (STAR-RISs) represent a novel technology, which extends the half-space coverage to full-space coverage by simultaneously transmitting and reflecting incident signals. STAR-RISs provide new degrees-of-freedom (DoF) for manipulating signal propagation. Motivated by the above, a novel STAR-RIS assisted non-orthogonal multiple access (NOMA) (STAR-RIS-NOMA) system is proposed in this paper. Our objective is to maximize the achievable sum rate by jointly optimizing the decoding order, power allocation coefficients, active beamforming, and transmission and reflection beamforming. However, the formulated problem is non-convex with intricately coupled variables. To tackle this challenge, a suboptimal two-layer iterative algorithm is proposed. Specifically, in the inner-layer iteration, for a given decoding order, the power allocation coefficients, active beamforming, transmission and reflection beamforming are optimized alternatingly. For the outer-layer iteration, the decoding order of NOMA users in each cluster is updated with the solutions obtained from the inner-layer iteration. Moreover, an efficient decoding order determination scheme is proposed based on the equivalent-combined channel gains. Simulation results are provided to demonstrate that the proposed STAR-RIS-NOMA system, aided by our proposed algorithm, outperforms conventional RIS-NOMA and RIS assisted orthogonal multiple access (RIS-OMA) systems. △ Less

Submitted 17 September, 2022; v1 submitted 5 June, 2021; originally announced June 2021.

arXiv:2012.10111 [pdf, ps, other]

Reconfigurable Intelligent Surface Enhanced NOMA Assisted Backscatter Communication System

Authors: Jiakuo Zuo, Yuanwei Liu, Liang Yang, Lingyang Song, Ying-Chang Liang

Abstract: A reconfigurable intelligent surface (RIS) enhanced non-orthogonal multiple access assisted backscatter communication (RIS-NOMABC) system is considered. A joint optimization problem over power reflection coefficients and phase shifts is formulated. To solve this non-convex problem, a low complexity algorithm is proposed by invoking the alternative optimization, successive convex approximation and… ▽ More A reconfigurable intelligent surface (RIS) enhanced non-orthogonal multiple access assisted backscatter communication (RIS-NOMABC) system is considered. A joint optimization problem over power reflection coefficients and phase shifts is formulated. To solve this non-convex problem, a low complexity algorithm is proposed by invoking the alternative optimization, successive convex approximation and manifold optimization algorithms. Numerical results corroborate that the proposed RIS-NOMABC system outperforms the conventional non-orthogonal multiple access assisted backscatter communication (NOMABC) system without RIS, and demonstrate the feasibility and effectiveness of the proposed algorithm. △ Less

Submitted 18 December, 2020; originally announced December 2020.

arXiv:2011.08975 [pdf, ps, other]

Reconfigurable Intelligent Surface Assisted Cooperative Non-orthogonal Multiple Access Systems

Authors: Jiakuo Zuo, Yuanwei Liu, Naofal Al-Dhahir

Abstract: This paper considers downlink of reconfigurable intelligent surface (RIS) assisted cooperative non-orthogonal multiple access (CNOMA) systems. Our objective is to minimize the total transmit power by jointly optimizing the active beamforming vectors, transmit-relaying power, and RIS phase shifts. The formulated problem is a mixed-integer nonlinear programming (MINLP) problem. To tackle this proble… ▽ More This paper considers downlink of reconfigurable intelligent surface (RIS) assisted cooperative non-orthogonal multiple access (CNOMA) systems. Our objective is to minimize the total transmit power by jointly optimizing the active beamforming vectors, transmit-relaying power, and RIS phase shifts. The formulated problem is a mixed-integer nonlinear programming (MINLP) problem. To tackle this problem, the alternating optimization approach is utilized to decouple the variables. In each alternative procedure, the optimal solutions for the active beamforming vectors, transmit-relaying power and phase shifts are obtained. However, the proposed algorithm has high complexity since the optimal phase shifts are solved by integer linear programming (ILP) whose computational complexity is exponential in the number of variables. To strike a good computational complexity-optimality trade-off, a low-complexity suboptimal algorithm is proposed by invoking the iterative penalty function based semidefinite programming (SDP) and the successive refinement approaches. Numerical results illustrate that: i) the proposed RIS-CNOMA system, aided by our proposed algorithms, outperforms the conventional CNOMA system. ii) the proposed low-complexity suboptimal algorithm can achieve the near-optimal performance. iii) whether the RIS-CNOMA system outperforms the RIS assisted non-orthogonal multiple access (RIS-NOMA) system depends not only on the users' locations but also on the RIS' location. △ Less

Submitted 5 December, 2020; v1 submitted 17 November, 2020; originally announced November 2020.

arXiv:2005.01562 [pdf, ps, other]

Intelligent Reflecting Surface Enhanced Millimeter-Wave NOMA Systems

Authors: Jiakuo Zuo, Yuanwei Liu, Ertugrul Basar, Octavia A. Dobre

Abstract: In this paper, a downlink intelligent reflecting surface (IRS) enhanced millimeter-wave (mmWave) non-orthogonal multiple access (NOMA) system is considered. A joint optimization problem over active beamforming, passive beamforming and power allocation is formulated. Due to the highly coupled variables, the formulated optimization problem is non-convex. To solve this problem, an alternative optimiz… ▽ More In this paper, a downlink intelligent reflecting surface (IRS) enhanced millimeter-wave (mmWave) non-orthogonal multiple access (NOMA) system is considered. A joint optimization problem over active beamforming, passive beamforming and power allocation is formulated. Due to the highly coupled variables, the formulated optimization problem is non-convex. To solve this problem, an alternative optimization and successive convex approximation based iterative algorithm is proposed. Numerical results illustrate that: 1) the proposed scheme offers significant sum-rate gains, which confirms the effectiveness of introducing IRS for mmWave-NOMA systems; 2) the proposed algorithm with discrete phase shifts can achieve close performance to that of continuous phase shifts. △ Less

Submitted 4 May, 2020; originally announced May 2020.

arXiv:2003.08923 [pdf, other]

RF-Rhythm: Secure and Usable Two-Factor RFID Authentication

Authors: Jiawei Li, Chuyu Wang, Ang Li, Dianqi Han, Yan Zhang, Jinhang Zuo, Rui Zhang, Lei Xie, Yanchao Zhang

Abstract: Passive RFID technology is widely used in user authentication and access control. We propose RF-Rhythm, a secure and usable two-factor RFID authentication system with strong resilience to lost/stolen/cloned RFID cards. In RF-Rhythm, each legitimate user performs a sequence of taps on his/her RFID card according to a self-chosen secret melody. Such rhythmic taps can induce phase changes in the back… ▽ More Passive RFID technology is widely used in user authentication and access control. We propose RF-Rhythm, a secure and usable two-factor RFID authentication system with strong resilience to lost/stolen/cloned RFID cards. In RF-Rhythm, each legitimate user performs a sequence of taps on his/her RFID card according to a self-chosen secret melody. Such rhythmic taps can induce phase changes in the backscattered signals, which the RFID reader can detect to recover the user's tapping rhythm. In addition to verifying the RFID card's identification information as usual, the backend server compares the extracted tapping rhythm with what it acquires in the user enrollment phase. The user passes authentication checks if and only if both verifications succeed. We also propose a novel phase-hopping protocol in which the RFID reader emits Continuous Wave (CW) with random phases for extracting the user's secret tapping rhythm. Our protocol can prevent a capable adversary from extracting and then replaying a legitimate tapping rhythm from sniffed RFID signals. Comprehensive user experiments confirm the high security and usability of RF-Rhythm with false-positive and false-negative rates close to zero. △ Less

Submitted 19 March, 2020; originally announced March 2020.

Comments: To appear at IEEE INFOCOM 2020

arXiv:2002.01765 [pdf, ps, other]

Resource Allocation in Intelligent Reflecting Surface Assisted NOMA Systems

Authors: Jiakuo Zuo, Yuanwei Liu, Zhijin Qin, Naofal Al-Dhahir

Abstract: This paper investigates the downlink communications of intelligent reflecting surface (IRS) assisted non-orthogonal multiple access (NOMA) systems. To maximize the system throughput, we formulate a joint optimization problem over the channel assignment, decoding order of NOMA users, power allocation, and reflection coefficients. The formulated problem is proved to be NP-hard. To tackle this proble… ▽ More This paper investigates the downlink communications of intelligent reflecting surface (IRS) assisted non-orthogonal multiple access (NOMA) systems. To maximize the system throughput, we formulate a joint optimization problem over the channel assignment, decoding order of NOMA users, power allocation, and reflection coefficients. The formulated problem is proved to be NP-hard. To tackle this problem, a three-step novel resource allocation algorithm is proposed. Firstly, the channel assignment problem is solved by a many-to-one matching algorithm. Secondly, by considering the IRS reflection coefficients design, a low-complexity decoding order optimization algorithm is proposed. Thirdly, given a channel assignment and decoding order, a joint optimization algorithm is proposed for solving the joint power allocation and reflection coefficient design problem. Numerical results illustrate that: i) with the aid of IRS, the proposed IRS-NOMA system outperforms the conventional NOMA system without the IRS in terms of system throughput; ii) the proposed IRS-NOMA system achieves higher system throughput than the IRS assisted orthogonal multiple access (IRS-OMA) systems; iii) simulation results show that the performance gains of the IRS-NOMA and the IRS-OMA systems can be enhanced via carefully choosing the location of the IRS. △ Less

Submitted 5 February, 2020; originally announced February 2020.

Showing 1–36 of 36 results for author: Zuo, J