Skip to main content

Showing 1–50 of 389 results for author: Guo, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.04383  [pdf, ps, other

    eess.IV cs.CV

    ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition

    Authors: You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo, Shuchang Lyu, Guangliang Cheng, Qi Zhao

    Abstract: Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  2. arXiv:2507.01348  [pdf, ps, other

    eess.AS cs.SD

    SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech

    Authors: Zhuangfei Cheng, Guangyan Zhang, Zehai Tu, Yangyang Song, Shuiyang Mao, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Jiasong Wu

    Abstract: Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classif… ▽ More

    Submitted 8 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: 10 pages, includes references, 4 figures, 4 tables

    ACM Class: I.2.7

  3. arXiv:2506.22023  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

    Authors: Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

    Abstract: Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 17 pages, 8 figures, 5 tables

  4. arXiv:2506.21074  [pdf, ps, other

    eess.AS cs.SD

    CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

    Authors: Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Xie Chen, Kai Yu

    Abstract: Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address th… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 16 pages, 5 figures, 9 tables

  5. arXiv:2506.10562  [pdf

    eess.SY

    Joint System Modeling Approach for Fault Simulation of Start-er/Generator and Gas Generator in All-Electric APU

    Authors: Haotian Mao, Yingqing Guo

    Abstract: This paper presents a joint system modeling approach for fault simulation of all-electric auxiliary power unit (APU), integrating starter/generator turn-to-turn short circuit (TTSC) faults with gas generator gas-path faults.To address challenges in electromechanical coupling, simulation precision and computational efficiency balance, we propose a multi-rate continuous-discrete hybrid simulation ar… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  6. arXiv:2506.08404  [pdf, ps, other

    eess.SY

    Compact Amplified Laser Power Stabilization Using Robust Active Disturbance Rejection Control with Sensor Noise Decoupling

    Authors: Yanpei Shi, Jingxuan Zhang, Zhuo Shi, Chenyao Zhang, Yuze Guo, Rui Feng

    Abstract: Laser power instability, encompassing random jitter and slow drift, severely limits the performance of optically pumped magnetometers (OPMs) in detecting ultra-weak magnetic fields, especially in large-scale OPM arrays for magnetoencephalography. Although a unified amplified laser (AL) architecture improves integration, fluctuations in the pump beam progressively degrade performance across all cha… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  7. arXiv:2506.07358  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework

    Authors: Kuiyuan Zhang, Wenjie Pei, Rushi Lan, Yifang Guo, Zhongyun Hua

    Abstract: Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  8. arXiv:2506.00358  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    $\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

    Authors: Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo

    Abstract: While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: Under review. For uniformity, all TTA experiments are done with a batch size of 16

  9. arXiv:2505.23379  [pdf, ps, other

    eess.AS cs.SD

    Vision-Integrated High-Quality Neural Speech Coding

    Authors: Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling

    Abstract: This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual in… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by interspeech2025

  10. arXiv:2505.22515  [pdf, ps, other

    cs.SD eess.AS

    Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

    Authors: Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu

    Abstract: Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Initial Upload

  11. arXiv:2505.19539  [pdf, ps, other

    eess.SP

    Water Level Sensing via Communication Signals in a Bi-Static System

    Authors: Zhongqin Wang, J. Andrew Zhang, Kai Wu, Y. Jay Guo

    Abstract: Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for w… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  12. arXiv:2505.18641  [pdf, ps, other

    eess.SP

    FDMA-Based Passive Multiple Users SWIPT Utilizing Resonant Beams

    Authors: Yixuan Guo, Mingliang Xiong, Wen Fang, Qingwei Jiang, Qingwen Liu, Gang Yan

    Abstract: The rapid development of IoT technology has led to a shortage of spectrum resources and energy, giving rise to simultaneous wireless information and power transfer (SWIPT) technology. However, traditional multiple input multiple output (MIMO)-based SWIPT faces challenges in target detection. We have designed a passive multi-user resonant beam system (MU-RBS) that can achieve efficient power transf… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  13. arXiv:2505.16845  [pdf, ps, other

    eess.AS cs.AI cs.SD

    Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate

    Authors: Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu

    Abstract: Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In t… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  14. arXiv:2505.16091  [pdf, ps, other

    eess.IV cs.CV

    OSCAR: One-Step Diffusion Codec for Image Compression Across Multiple Bit-rates

    Authors: Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, Yulun Zhang

    Abstract: Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial… ▽ More

    Submitted 28 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

  15. arXiv:2505.11516  [pdf, other

    cs.RO eess.IV

    SELECT: A Submodular Approach for Active LiDAR Semantic Segmentation

    Authors: Ruiyu Mao, Sarthak Kumar Maharana, Xulong Tang, Yunhui Guo

    Abstract: LiDAR-based semantic segmentation plays a vital role in autonomous driving by enabling detailed understanding of 3D environments. However, annotating LiDAR point clouds is extremely costly and requires assigning semantic labels to millions of points with complex geometric structures. Active Learning (AL) has emerged as a promising approach to reduce labeling costs by querying only the most informa… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  16. arXiv:2505.10577  [pdf, ps, other

    eess.IV cs.AI cs.CV

    GRNN:Recurrent Neural Network based on Ghost Features for Video Super-Resolution

    Authors: Yutong Guo

    Abstract: Modern video super-resolution (VSR) systems based on convolutional neural networks (CNNs) require huge computational costs. The problem of feature redundancy is present in most models in many domains, but is rarely discussed in VSR. We experimentally observe that many features in VSR models are also similar to each other, so we propose to use "Ghost features" to reduce this redundancy. We also ana… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Accepted by 2023 IEEE International Conference on Multimedia and Expo (ICME 2023)

  17. arXiv:2505.03244  [pdf, other

    cs.SD eess.AS

    SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation

    Authors: Yu-Ren Guo, Wen-Kai Tai

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) and multimodal learning, with successful applications in text generation and speech synthesis, enabling a deeper understanding and generation of multimodal content. In the field of sound effects (SFX) generation, LLMs have been leveraged to orchestrate multiple models for audio synthesis. Ho… ▽ More

    Submitted 13 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

    Comments: 8 pages, 5 figures

  18. arXiv:2504.10836  [pdf, other

    eess.SP cs.AI

    Uplink Assisted Joint Channel Estimation and CSI Feedback: An Approach Based on Deep Joint Source-Channel Coding

    Authors: Yiran Guo, Wei Chen, Bo Ai

    Abstract: In frequency division duplex (FDD) multiple-input multiple-output (MIMO) wireless communication systems, the acquisition of downlink channel state information (CSI) is essential for maximizing spatial resource utilization and improving system spectral efficiency. The separate design of modules in AI-based CSI feedback architectures under traditional modular communication frameworks, including chan… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  19. arXiv:2504.07119  [pdf, other

    eess.SP

    UAV-Assisted MEC for Disaster Response: Stackelberg Game-Based Resource Optimization

    Authors: Yafei Guo, Ziye Jia, Lei Zhang, Jia He, Yu Zhang, Qihui Wu

    Abstract: The unmanned aerial vehicle assisted multi-access edge computing (UAV-MEC) technology has been widely applied in the sixth-generation era. However, due to the limitations of energy and computing resources in disaster areas, how to efficiently offload the tasks of damaged user equipments (UEs) to UAVs is a key issue. In this work, we consider a multiple UAVMECs assisted task offloading scenario, wh… ▽ More

    Submitted 26 March, 2025; originally announced April 2025.

  20. arXiv:2504.02061  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Aligned Better, Listen Better for Audio-Visual Large Language Models

    Authors: Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

    Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak un… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: Accepted to ICLR 2025

  21. arXiv:2504.01038  [pdf, other

    eess.IV cs.CV cs.HC

    An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

    Authors: Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong

    Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One C… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

    Comments: 26 pages, 4 figures, 6 tables

  22. arXiv:2503.24086  [pdf, other

    math.OC eess.SY

    Distributed AC Optimal Power Flow: A Scalable Solution for Large-Scale Problems

    Authors: Xinliang Dai, Yuning Jiang, Yi Guo, Colin N. Jones, Moritz Diehl, Veit Hagenmeyer

    Abstract: This paper introduces a novel distributed optimization framework for large-scale AC Optimal Power Flow (OPF) problems, offering both theoretical convergence guarantees and rapid convergence in practice. By integrating smoothing techniques and the Schur complement, the proposed approach addresses the scalability challenges and reduces communication overhead in distributed AC OPF. Additionally, opti… ▽ More

    Submitted 4 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

  23. arXiv:2503.21942  [pdf, other

    cs.NI eess.SP

    Enhancing Mobile Crowdsensing Efficiency: A Coverage-aware Resource Allocation Approach

    Authors: Yaru Fu, Yue Zhang, Zheng Shi, Yongna Guo, Yalin Liu

    Abstract: In this study, we investigate the resource management challenges in next-generation mobile crowdsensing networks with the goal of minimizing task completion latency while ensuring coverage performance, i.e., an essential metric to ensure comprehensive data collection across the monitored area, yet it has been commonly overlooked in existing studies. To this end, we formulate a weighted latency and… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  24. arXiv:2503.14986  [pdf

    eess.SY

    Enhancing Fault Detection and Isolation in an All-Electric Auxiliary Power Unit (APU) Gas Generator by Utilizing Starter/Generator Signal

    Authors: Haotian Mao, Khashayar Khorasani, Yingqing Guo

    Abstract: This study proposes a novel paradigm for enhancing fault detection and isolation (FDI) of gas generators in all-electric auxiliary power unit (APU) by utilizing shaft power information from the starter/generator. First, we conduct a pioneering investigation into the challenges and opportunities for FDI brought about by APU electrification. Our analysis reveals that the electrification of APU opens… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  25. arXiv:2503.14892  [pdf, other

    eess.IV cs.CV

    Degradation Alchemy: Self-Supervised Unknown-to-Known Transformation for Blind Hyperspectral Image Fusion

    Authors: He Huang, Yong Chen, Yujun Guo, Wei He

    Abstract: Hyperspectral image (HSI) fusion is an efficient technique that combines low-resolution HSI (LR-HSI) and high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Existing supervised learning methods (SLMs) can yield promising results when test data degradation matches the training ones, but they face challenges in generalizing to unknown degradations. To unleash the… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  26. arXiv:2503.10522  [pdf, other

    cs.MM cs.CV cs.LG cs.SD eess.AS

    AudioX: Diffusion Transformer for Anything-to-Audio Generation

    Authors: Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

    Abstract: Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anyt… ▽ More

    Submitted 23 April, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: The code and datasets will be available at https://zeyuet.github.io/AudioX/

  27. arXiv:2503.08638  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    YuE: Scaling Open Foundation Models for Long-Form Music Generation

    Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

    Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: https://github.com/multimodal-art-projection/YuE

  28. arXiv:2503.05843  [pdf, other

    physics.ao-ph cs.CV eess.IV eess.SP physics.geo-ph

    Decadal analysis of sea surface temperature patterns, climatology, and anomalies in temperate coastal waters with Landsat-8 TIRS observations

    Authors: Yiqing Guo, Nagur Cherukuru, Eric Lehmann, Xiubin Qi, Mark Doubelld, S. L. Kesav Unnithan, Ming Feng

    Abstract: Sea surface temperature (SST) is a fundamental physical parameter characterising the thermal state of sea surface. Due to the intricate thermal interactions between land, sea, and atmosphere, the spatial gradients of SST in coastal waters often appear at finer spatial scales than those in open ocean waters. The Thermal Infrared Sensor (TIRS) onboard Landsat-8, with its 100-meter spatial resolution… ▽ More

    Submitted 13 May, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: Submitted to GIScience & Remote Sensing

  29. arXiv:2503.04157  [pdf, other

    eess.SP

    Deep Joint CSI Estimation-Feedback-Precoding for MU-MIMO OFDM Systems

    Authors: Yiran Guo, Wei Chen, Bo Ai, Lun Li

    Abstract: As the number of antennas in frequency-division duplex (FDD) multiple-input multiple-output (MIMO) systems increases, acquiring channel state information (CSI) becomes increasingly challenging due to limited spectral resources and feedback overhead. In this paper, we propose an end-to-end network that conducts joint design with pilot design, CSI estimation, CSI feedback, and precoding design in th… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  30. arXiv:2503.01710  [pdf, other

    cs.SD cs.AI eess.AS

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

    Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Submitted to ACL 2025

  31. arXiv:2503.01265  [pdf, other

    eess.IV cs.CV

    Interactive Gadolinium-Free MRI Synthesis: A Transformer with Localization Prompt Learning

    Authors: Linhao Li, Changhui Su, Yu Guo, Huimao Zhang, Dong Liang, Kun Shang

    Abstract: Contrast-enhanced magnetic resonance imaging (CE-MRI) is crucial for tumor detection and diagnosis, but the use of gadolinium-based contrast agents (GBCAs) in clinical settings raises safety concerns due to potential health risks. To circumvent these issues while preserving diagnostic accuracy, we propose a novel Transformer with Localization Prompts (TLP) framework for synthesizing CE-MRI from no… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

  32. arXiv:2502.18913  [pdf, other

    cs.CL cs.SD eess.AS

    CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

    Authors: Jiaming Zhou, Yujie Guo, Shiwan Zhao, Haoqin Sun, Hui Wang, Jiabei He, Aobo Kong, Shiyao Wang, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

    Abstract: Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for r… ▽ More

    Submitted 11 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  33. arXiv:2502.16584  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Audio-FLAN: A Preliminary Release

    Authors: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

    Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

  34. arXiv:2502.06490  [pdf, other

    eess.AS cs.AI cs.MM cs.SD eess.SP

    Recent Advances in Discrete Speech Tokens: A Review

    Authors: Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

    Abstract: The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framewor… ▽ More

    Submitted 16 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

    Comments: 23 pages, 8 figures, 3 tables. Work in progress

  35. arXiv:2502.05471  [pdf, other

    cs.SD eess.AS

    Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

    Authors: Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao

    Abstract: This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous metho… ▽ More

    Submitted 8 February, 2025; originally announced February 2025.

    Comments: Accepted by ICASSP 2025

  36. arXiv:2502.04988  [pdf, other

    eess.IV cs.CV

    CMamba: Learned Image Compression with State Space Models

    Authors: Zhuojie Wu, Heming Du, Shuyun Wang, Ming Lu, Haiyang Sun, Yandong Guo, Xin Yu

    Abstract: Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propos… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

  37. arXiv:2502.04128  [pdf, other

    eess.AS cs.AI cs.CL cs.MM cs.SD

    Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

    Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

    Abstract: Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a pa… ▽ More

    Submitted 22 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

  38. arXiv:2502.03496  [pdf, other

    eess.IV cs.GR

    FreqPrior: Improving Video Diffusion Models with Frequency Filtering Gaussian Noise

    Authors: Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei Zhang, Hang Xu, Li Zhang

    Abstract: Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs the Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this co… ▽ More

    Submitted 19 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

    Comments: ICLR 2025

  39. arXiv:2502.03338  [pdf, other

    eess.SY

    Optimal PMU Placement for Kalman Filtering of DAE Power System Models

    Authors: Milos Katanic, Yi Guo, John Lygeros, Gabriela Hug

    Abstract: Optimal sensor placement is essential for minimizing costs and ensuring accurate state estimation in power systems. This paper introduces a novel method for optimal sensor placement for dynamic state estimation of power systems modeled by differential-algebraic equations. The method identifies optimal sensor locations by minimizing the steady-state covariance matrix of the Kalman filter, thus mini… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

  40. arXiv:2502.00699  [pdf, other

    eess.SP cs.ET

    Measurement and Analysis of Scattering From Building Surfaces at Millimeter-Wave Frequency

    Authors: Yulu Guo, Tongjia Zhang, Shu Sun, Meixia Tao, Ruifeng Gao

    Abstract: In future air-to-ground integrated networks, the scattering effects from ground-based scatterers, such as buildings, cannot be neglected in millimeter-wave and higher frequency bands, and have a significant impact on channel characteristics. However, current scattering measurement studies primarily focus on single incident angles within the incident plane, leading to insufficient characterization… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

    Comments: 6 pages, 7 figures. 2025 IEEE Wireless Communications and Networking Conference Workshops (WCNC Wkshps), Milan, Italy, 2025

  41. arXiv:2502.00358  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

    Authors: Jia Li, Wenjie Zhao, Ziru Huang, Yunhui Guo, Yapeng Tian

    Abstract: Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models… ▽ More

    Submitted 20 February, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

  42. arXiv:2502.00043  [pdf, other

    eess.SY cs.AI

    Mitigating Traffic Oscillations in Mixed Traffic Flow with Scalable Deep Koopman Predictive Control

    Authors: Hao Lyu, Yanyong Guo, Pan Liu, Nan Zheng, Ting Wang, Quansheng Yue

    Abstract: The use of connected automated vehicle (CAV) is advocated to mitigate traffic oscillations in mixed traffic flow consisting of CAVs and human driven vehicles (HDVs). This study proposes an adaptive deep Koopman predictive control framework (AdapKoopPC) for regulating mixed traffic flow. Firstly, a Koopman theory-based adaptive trajectory prediction deep network (AdapKoopnet) is designed for modeli… ▽ More

    Submitted 22 April, 2025; v1 submitted 27 January, 2025; originally announced February 2025.

  43. arXiv:2501.18878  [pdf, ps, other

    eess.SP

    Integrated Sensing and Communication System Based on Radio Frequency Resonance Beam

    Authors: Yixuan Guo, Shuaifan Xia, Mingliang Xiong, Qingwen Liu, Wen Fang, Qingwei Jiang, Gang Yan, Jiangchuan Mu

    Abstract: To address the complex beam control in traditional multiple-input multiple-output (MIMO) systems, researchers have proposed adaptive beam alignment using retro-directive antenna (RDA) arrays. This approach creates echo resonance between the base station (BS) and user equipment (UE), significantly reducing computational load. However, conventional resonant beam systems (RBS) suffer from echo interf… ▽ More

    Submitted 5 June, 2025; v1 submitted 30 January, 2025; originally announced January 2025.

  44. arXiv:2501.16471  [pdf, other

    cs.LG cs.AI eess.AS eess.IV q-bio.NC

    SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

    Authors: Simon Dahan, Gabriel Bénédict, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Robert Leech, Emma C. Robinson

    Abstract: Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subjec… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

    Comments: 27 pages, accepted to ICLR 2025

  45. arXiv:2501.14367  [pdf, other

    cs.NI eess.SP

    Joint System Latency and Data Freshness Optimization for Cache-enabled Mobile Crowdsensing Networks

    Authors: Kexin Shi, Yaru Fu, Yongna Guo, Fu Lee Wang, Yan Zhang

    Abstract: Mobile crowdsensing (MCS) networks enable large-scale data collection by leveraging the ubiquity of mobile devices. However, frequent sensing and data transmission can lead to significant resource consumption. To mitigate this issue, edge caching has been proposed as a solution for storing recently collected data. Nonetheless, this approach may compromise data freshness. In this paper, we investig… ▽ More

    Submitted 24 January, 2025; originally announced January 2025.

  46. arXiv:2501.08139  [pdf, other

    eess.SP cs.AI cs.LG

    EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

    Authors: Zirui Wang, Zhenxi Song, Yi Guo, Yuxin Liu, Guoyang Xu, Min Zhang, Zhiguo Zhang

    Abstract: The development of EEG decoding algorithms confronts challenges such as data sparsity, subject variability, and the need for precise annotations, all of which are vital for advancing brain-computer interfaces and enhancing the diagnosis of diseases. To address these issues, we propose a novel two-stage approach named Self-Supervised State Reconstruction-Primed Riemannian Dynamics (EEG-ReMinD) , wh… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  47. arXiv:2501.07057  [pdf, other

    math.OC eess.SY

    Optimization with Multi-sourced Reference Information and Unknown Trust: A Distributionally Robust Approach

    Authors: Yanru Guo, Ruiwei Jiang, Siqian Shen

    Abstract: In problems that involve input parameter information gathered from multiple data sources with varying reliability, incorporating users' trust about different sources in decision-optimization models can potentially improve solution performance and reliability. In this work, we propose a novel multi-reference distributionally robust optimization (MR-DRO) framework, where the model inputs are uncerta… ▽ More

    Submitted 12 January, 2025; originally announced January 2025.

    Comments: 38 pages, 9 figures, 7 tables

  48. arXiv:2501.01684  [pdf

    eess.SP

    Millimeter-Wave Energy-Efficient Hybrid Beamforming Architecture and Algorithm

    Authors: Hongpu Zhang, Yulu Guo, Liuxun Xue, Xingchen Liu, Shu Sun, Ruifeng Gao, Xianghao Yu, Meixia Tao

    Abstract: This paper studies energy-efficient hybrid beamforming architectures and its algorithm design in millimeter-wave communication systems, aiming to address the challenges faced by existing hybrid beamforming due to low hardware flexibility and high power consumption. To solve the problems of existing hybrid beamforming, a novel energy-efficient hybrid beamforming architecture is proposed, where radi… ▽ More

    Submitted 3 January, 2025; originally announced January 2025.

    Comments: 21 pages, in Chinese language, 8 figures, published to Mobile Communications

    Journal ref: Mobile Communications, vol. 48, no. 12, pp. 86-96, December 2024

  49. arXiv:2412.17048  [pdf, other

    eess.AS cs.CL cs.SD

    Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

    Authors: Hankun Wang, Haoran Wang, Yiwei Guo, Zhihan Li, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much long… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

  50. arXiv:2412.15597  [pdf, other

    eess.SP

    Resonant Beam Multi-Target DOA Estimation

    Authors: Yixuan Guo, Qingwei Jiang, Mingliang Xiong, Wen Fang, Mingqing Liu, Qingqing Zhang, Qingwen Liu, Gang Yan

    Abstract: With the increasing demand for internet of things (IoT) applications, especially for location-based services, how to locate passive mobile targets (MTs) with minimal beam control has become a challenge. Resonant beam systems are considered promising IoT technologies with advantages such as beam self-alignment and energy concentration. To establish a resonant system in the radio frequency (RF) band… ▽ More

    Submitted 13 February, 2025; v1 submitted 20 December, 2024; originally announced December 2024.