Skip to main content

Showing 1–50 of 156 results for author: Ma, L

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.05727  [pdf, ps, other

    eess.AS cs.CL cs.SD

    ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

    Authors: He Wang, Linhan Ma, Dake Guo, Xiong Wang, Lei Xie, Jin Xu, Junyang Lin

    Abstract: Automatic Speech Recognition (ASR) has been extensively investigated, yet prior evaluative efforts have largely been restricted to contextless paradigms. This constraint stems from the limited proficiency of conventional ASR models in context modeling and their deficiency in memory and reasoning based on world knowledge. Recent breakthroughs in the development of Large Language Models (LLMs) and c… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 18 pages, 4 figures

  2. arXiv:2506.23986  [pdf, ps, other

    cs.SD eess.AS

    StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

    Authors: Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie

    Abstract: Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token s… ▽ More

    Submitted 1 July, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

  3. arXiv:2506.20945  [pdf, ps, other

    cs.SD eess.AS

    A Multi-Stage Framework for Multimodal Controllable Speech Synthesis

    Authors: Rui Niu, Weihao Wu, Jie Chen, Long Ma, Zhiyong Wu

    Abstract: Controllable speech synthesis aims to control the style of generated speech using reference input, which can be of various modalities. Existing face-based methods struggle with robustness and generalization due to data quality constraints, while text prompt methods offer limited diversity and fine-grained control. Although multimodal approaches aim to integrate various modalities, their reliance o… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted by ICME2025

  4. arXiv:2506.11160  [pdf, ps, other

    eess.AS cs.SD

    S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

    Authors: Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Despite recent advances in multilingual speech-to-speech translation (S2ST), several critical challenges persist: 1) achieving high-quality translation remains a major hurdle, and 2) most existing methods heavily rely on large-scale parallel speech corpora, which are costly and difficult to obtain. To address these issues, we propose \textit{S2ST-Omni}, an efficient and scalable framework for mult… ▽ More

    Submitted 8 July, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: Working in progress

  5. arXiv:2506.11119  [pdf

    cs.CL cs.SD eess.AS

    Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech

    Authors: Jingyu Li, Lingchao Mao, Hairong Wang, Zhendong Wang, Xi Mao, Xuelei Sherry Ni

    Abstract: Background: Alzheimer's disease and related dementias (ADRD) are progressive neurodegenerative conditions where early detection is vital for timely intervention and care. Spontaneous speech contains rich acoustic and linguistic markers that may serve as non-invasive biomarkers for cognitive decline. Foundation models, pre-trained on large-scale audio or text data, produce high-dimensional embeddin… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    MSC Class: 68T10 (Primary); 68U99 (Secondary) ACM Class: I.2.1; J.3

  6. arXiv:2506.01039  [pdf, ps, other

    eess.AS cs.SD

    PseudoVC: Improving One-shot Voice Conversion with Pseudo Paired Data

    Authors: Songjun Cao, Qinghua Wu, Jie Chen, Jin Li, Long Ma

    Abstract: As parallel training data is scarce for one-shot voice conversion (VC) tasks, waveform reconstruction is typically performed by various VC systems. A typical one-shot VC system comprises a content encoder and a speaker encoder. However, two types of mismatches arise: one for the inputs to the content encoder during training and inference, and another for the inputs to the speaker encoder. To addre… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 5 pages, 3 figures

  7. arXiv:2505.20902  [pdf

    eess.IV cs.CV

    Multitemporal Latent Dynamical Framework for Hyperspectral Images Unmixing

    Authors: Ruiying Li, Bin Pan, Lan Ma, Xia Xu, Zhenwei Shi

    Abstract: Multitemporal hyperspectral unmixing can capture dynamical evolution of materials. Despite its capability, current methods emphasize variability of endmembers while neglecting dynamics of abundances, which motivates our adoption of neural ordinary differential equations to model abundances temporally. However, this motivation is hindered by two challenges: the inherent complexity in defining, mode… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 11 Pages,8 figures

    MSC Class: 68T07 ACM Class: I.4.10

  8. arXiv:2505.18453  [pdf, other

    cs.SD cs.AI eess.AS

    MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt

    Authors: Zhichao Wu, Yueteng Kang, Songjun Cao, Long Ma, Qiulin Li, Qun Yang

    Abstract: Most existing Zero-Shot Text-To-Speech(ZS-TTS) systems generate the unseen speech based on single prompt, such as reference speech or text descriptions, which limits their flexibility. We propose a customized emotion ZS-TTS system based on multi-modal prompt. The system disentangles speech into the content, timbre, emotion and prosody, allowing emotion prompts to be provided as text, image or spee… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Accepted by InterSpeech

  9. arXiv:2505.13805  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

    Authors: Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted by InterSpeech 2025

  10. arXiv:2505.09939  [pdf, other

    cs.CV eess.IV

    Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

    Authors: Zhe Shan, Lei Zhou, Liu Mao, Shaofan Chen, Chuanqiu Ren, Xia Xie

    Abstract: In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world an… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: Accepted to IGARSS 2025

  11. arXiv:2505.05159  [pdf, other

    eess.AS

    FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech

    Authors: Linhan Ma, Dake Guo, He Wang, Jin Xu, Lei Xie

    Abstract: Current speech generation research can be categorized into two primary classes: non-autoregressive and autoregressive. The fundamental distinction between these approaches lies in the duration prediction strategy employed for predictable-length sequences. The NAR methods ensure stability in speech generation by explicitly and independently modeling the duration of each phonetic unit. Conversely, A… ▽ More

    Submitted 15 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: 10 pages, 5 figures

  12. arXiv:2504.10974  [pdf, ps, other

    cs.CV eess.IV

    Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

    Authors: Zhisheng Zhang, Peng Zhang, Fengxiang Wang, Liangli Ma, Fuchun Sun

    Abstract: Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect t… ▽ More

    Submitted 29 May, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

  13. arXiv:2504.04705  [pdf, other

    math.OC eess.SY

    Trajectory Optimization of Stochastic Systems under Chance Constraints via Set Erosion

    Authors: Zishun Liu, Liqian Ma, Yongxin Chen

    Abstract: We study the trajectory optimization problem under chance constraints for continuous-time stochastic systems. To address chance constraints imposed on the entire stochastic trajectory, we propose a framework based on the set erosion strategy, which converts the chance constraints into safety constraints on an eroded subset of the safe set along the corresponding deterministic trajectory. The depth… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  14. arXiv:2503.17005  [pdf

    cs.RO eess.SY

    Autonomous Exploration-Based Precise Mapping for Mobile Robots through Stepwise and Consistent Motions

    Authors: Muhua Zhang, Lei Ma, Ying Wu, Kai Shen, Yongkui Sun, Henry Leung

    Abstract: This paper presents an autonomous exploration framework. It is designed for indoor ground mobile robots that utilize laser Simultaneous Localization and Mapping (SLAM), ensuring process completeness and precise mapping results. For frontier search, the local-global sampling architecture based on multiple Rapidly Exploring Random Trees (RRTs) is employed. Traversability checks during RRT expansion… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 8 pages, 11 figures. This work has been submitted to the IEEE for possible publication

  15. arXiv:2503.04762  [pdf, other

    cs.LO cs.FL eess.SY

    Safety Verification of Stochastic Systems under Signal Temporal Logic Specifications

    Authors: Liqian Ma, Zishun Liu, Hongzhe Yu, Yongxin Chen

    Abstract: We study the verification problem of stochastic systems under signal temporal logic (STL) specifications. We propose a novel approach that enables the verification of the probabilistic satisfaction of STL specifications for nonlinear systems subject to both bounded deterministic disturbances and stochastic disturbances. Our method, referred to as the STL erosion strategy, reduces the probabilistic… ▽ More

    Submitted 10 February, 2025; originally announced March 2025.

    Comments: 6 pages

  16. arXiv:2503.03691  [pdf, other

    eess.SP

    Ambiguity-Free Broadband DOA Estimation Relying on Parameterized Time-Frequency Transform

    Authors: Wei Wang, Shefeng Yan, Linlin Mao, Zeping Sui, Jirui Yang

    Abstract: An ambiguity-free direction-of-arrival (DOA) estimation scheme is proposed for sparse uniform linear arrays under low signal-to-noise ratios (SNRs) and non-stationary broadband signals. First, for achieving better DOA estimation performance at low SNRs while using non-stationary signals compared to the conventional frequency-difference (FD) paradigms, we propose parameterized time-frequency transf… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: 6 figures

  17. arXiv:2502.19924  [pdf, other

    cs.SD cs.AI eess.AS

    DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models

    Authors: Weihao wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu, Songjun Cao, Long Ma, Zhiyong Wu

    Abstract: Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, lim… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: Accepted by ICASSP 2025

  18. arXiv:2502.06098  [pdf, other

    cs.SD eess.AS

    An adaptive filter bank based neural network approach for time delay estimation and speech enhancement

    Authors: Lu Ma

    Abstract: Time delay estimation (TDE) plays a key role in acoustic echo cancellation (AEC) using adaptive filter method. Considerable residual echo will be left if estimation error arises. Here, in this paper, we proposed an adaptive filter bank based neural network approach where the delay is estimated by a bank of adaptive filters with overlapped time scope, and all the energy of filter weights are concat… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: audio 3A

  19. arXiv:2502.00702  [pdf, other

    cs.HC cs.NI cs.SD eess.AS eess.IV

    CardioLive: Empowering Video Streaming with Online Cardiac Monitoring

    Authors: Sheng Lyu, Ruiming Huang, Sijie Ji, Yasar Abbas Ur Rehman, Lan Ma, Chenshu Wu

    Abstract: Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, online affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the firs… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

    Comments: Preprint

  20. arXiv:2501.16327  [pdf, other

    cs.CL cs.SD eess.AS

    LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

    Authors: Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun

    Abstract: The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E s… ▽ More

    Submitted 27 January, 2025; originally announced January 2025.

    Comments: Demo Link: https://github.com/VITA-MLLM/LUCY

  21. arXiv:2501.06514  [pdf, other

    cs.SD cs.AI eess.AS

    Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition

    Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Songjun Cao, Long Ma, Chenxing Li, Haonnan Cheng, Long Ye

    Abstract: Current research in audio deepfake detection is gradually transitioning from binary classification to multi-class tasks, referred as audio deepfake source tracing task. However, existing studies on source tracing consider only closed-set scenarios and have not considered the challenges posed by open-set conditions. In this paper, we define the Neural Codec Source Tracing (NCST) task, which is capa… ▽ More

    Submitted 11 January, 2025; originally announced January 2025.

  22. arXiv:2501.01957  [pdf, other

    cs.CV cs.SD eess.AS

    VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

    Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality difference… ▽ More

    Submitted 21 January, 2025; v1 submitted 3 January, 2025; originally announced January 2025.

    Comments: https://github.com/VITA-MLLM/VITA (2K+ Stars by now)

  23. arXiv:2412.18107  [pdf, other

    eess.AS cs.AI cs.SD

    SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training

    Authors: Jiaxing Yu, Xinda Wu, Yunfei Xu, Tieyao Zhang, Songruoyao Wu, Le Ma, Kejun Zhang

    Abstract: Lyric-to-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melo… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

    Comments: Extended version of paper accepted to AAAI 2025

  24. arXiv:2412.03121  [pdf, other

    cs.CV eess.IV

    Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

    Authors: Yijia Guo, Wenkai Huang, Yang Li, Gaolei Li, Hang Zhang, Liwen Hu, Jianhua Li, Tiejun Huang, Lei Ma

    Abstract: 3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

  25. arXiv:2412.01053  [pdf, ps, other

    cs.SD eess.AS

    FreeCodec: A disentangled neural speech codec with fewer tokens

    Authors: Youqiang Zheng, Weiping Tu, Yueteng Kang, Jie Chen, Yike Zhang, Li Xiao, Yuhong Yang, Long Ma

    Abstract: Neural speech codecs have gained great attention for their outstanding reconstruction with discrete token representations. It is a crucial component in generative tasks such as speech coding and large language models (LLM). However, most works based on residual vector quantization perform worse with fewer tokens due to low coding efficiency for modeling complex coupled information. In this p… ▽ More

    Submitted 28 June, 2025; v1 submitted 1 December, 2024; originally announced December 2024.

    Comments: 5 pages, 2 figures, 3 tables.Code and Demo page:https://github.com/exercise-book-yq/FreeCodec. Accepted to Interspeech 2025

  26. arXiv:2411.19514  [pdf, other

    eess.IV cs.CV cs.LG

    Enhancing AI microscopy for foodborne bacterial classification via adversarial domain adaptation across optical and biological variability

    Authors: Siddhartha Bhattacharya, Aarham Wasit, Mason Earles, Nitin Nitin, Luyao Ma, Jiyoon Yi

    Abstract: Rapid detection of foodborne bacteria is critical for food safety and quality, yet traditional culture-based methods require extended incubation and specialized sample preparation. This study addresses these challenges by i) enhancing the generalizability of AI-enabled microscopy for bacterial classification using adversarial domain adaptation and ii) comparing the performance of single-target and… ▽ More

    Submitted 29 November, 2024; originally announced November 2024.

  27. Versatile Cataract Fundus Image Restoration Model Utilizing Unpaired Cataract and High-quality Images

    Authors: Zheng Gong, Zhuo Deng, Weihao Gao, Wenda Zhou, Yuhang Yang, Hanqing Zhao, Zhiyuan Niu, Lei Shao, Wenbin Wei, Lan Ma

    Abstract: Cataract is one of the most common blinding eye diseases and can be treated by surgery. However, because cataract patients may also suffer from other blinding eye diseases, ophthalmologists must diagnose them before surgery. The cloudy lens of cataract patients forms a hazy degeneration in the fundus images, making it challenging to observe the patient's fundus vessels, which brings difficulties t… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

    Comments: 12 pages, 8 figures

  28. arXiv:2411.12273  [pdf, other

    eess.IV cs.CV

    Acquire Precise and Comparable Fundus Image Quality Score: FTHNet and FQS Dataset

    Authors: Zheng Gong, Zhuo Deng, Run Gan, Zhiyuan Niu, Lu Chen, Canfeng Huang, Jia Liang, Weihao Gao, Fang Li, Shaochong Zhang, Lan Ma

    Abstract: The retinal fundus images are utilized extensively in the diagnosis, and their quality can directly affect the diagnosis results. However, due to the insufficient dataset and algorithm application, current fundus image quality assessment (FIQA) methods are not powerful enough to meet ophthalmologists` demands. In this paper, we address the limitations of datasets and algorithms in FIQA. First, we… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

    Comments: 11 pages, 7 figures

  29. arXiv:2411.03612  [pdf, other

    eess.SP

    Multi-bit Distributed Detection of Sparse Stochastic Signals over Error-Prone Reporting Channels

    Authors: Linlin Mao, Shefeng Yan, Zeping Sui, Hongbin Li

    Abstract: We consider a distributed detection problem within a wireless sensor network (WSN), where a substantial number of sensors cooperate to detect the existence of sparse stochastic signals. To achieve a trade-off between detection performance and system constraints, multi-bit quantizers are employed at local sensors. Then, two quantization strategies, namely raw quantization (RQ) and likelihood ratio… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.

    Comments: Accepted by IEEE Transactions on Signal and Information Processing over Networks

  30. arXiv:2411.02026  [pdf, other

    cs.SD cs.AI eess.AS

    CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

    Authors: Yu Pan, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

    Abstract: Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that levera… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: Work in progress; 5 pages;

  31. arXiv:2411.00774  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

    Authors: Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, Long Ma

    Abstract: Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal… ▽ More

    Submitted 8 December, 2024; v1 submitted 1 November, 2024; originally announced November 2024.

    Comments: Project Page: https://freeze-omni.github.io/

  32. arXiv:2410.10005  [pdf

    cs.LG eess.IV

    SmoothSegNet: A Global-Local Framework for Liver Tumor Segmentation with Clinical KnowledgeInformed Label Smoothing

    Authors: Hairong Wang, Lingchao Mao, Zihan Zhang, Jing Li

    Abstract: Liver cancer is a leading cause of mortality worldwide, and accurate Computed Tomography (CT)-based tumor segmentation is essential for diagnosis and treatment. Manual delineation is time-intensive, prone to variability, and highlights the need for reliable automation. While deep learning has shown promise for automated liver segmentation, precise liver tumor segmentation remains challenging due t… ▽ More

    Submitted 29 April, 2025; v1 submitted 13 October, 2024; originally announced October 2024.

  33. arXiv:2410.01350  [pdf, other

    cs.SD cs.AI eess.AS

    Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling

    Authors: Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, Jianjun Zhao

    Abstract: Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems str… ▽ More

    Submitted 10 January, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: Work in Progress; Under Review

  34. arXiv:2408.13978  [pdf, other

    eess.IV cs.CV

    Histology Virtual Staining with Mask-Guided Adversarial Transfer Learning for Tertiary Lymphoid Structure Detection

    Authors: Qiuli Wang, Yongxu Liu, Li Ma, Xianqi Wang, Wei Chen, Xiaohong Yao

    Abstract: Histological Tertiary Lymphoid Structures (TLSs) are increasingly recognized for their correlation with the efficacy of immunotherapy in various solid tumors. Traditionally, the identification and characterization of TLSs rely on immunohistochemistry (IHC) staining techniques, utilizing markers such as CD20 for B cells. Despite the specificity of IHC, Hematoxylin-Eosin (H&E) staining offers a more… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: 8 pages, 8 figures

  35. arXiv:2408.09491  [pdf, other

    cs.SD eess.AS

    A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

    Authors: Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

    Abstract: Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  36. arXiv:2407.07728   

    cs.SD cs.AI cs.MM eess.AS

    SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement

    Authors: Zihao Wang, Le Ma, Yongsheng Feng, Xin Pan, Yuhang Jin, Kejun Zhang

    Abstract: Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and… ▽ More

    Submitted 15 November, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: This paper needs major changes for resubmit

    MSC Class: 68Txx(Primary)14F05; 91Fxx(Secondary) ACM Class: I.2.7; J.5

  37. arXiv:2407.06662  [pdf, other

    eess.SP

    Experimental Demonstration of 16D Voronoi Constellation with Two-Level Coding over 50km Four-Core Fiber

    Authors: Can Zhao, Bin Chen, Jiaqi Cai, Zhiwei Liang, Yi Lei, Junjie Xiong, Lin Ma, Daohui Hu, Lin Sun, Gangxiang Shen

    Abstract: A 16-dimensional Voronoi constellation concatenated with multilevel coding is experimentally demonstrated over a 50km four-core fiber transmission system. The proposed scheme reduces the required launch power by 6dB and provides a 17dB larger operating range than 16QAM with BICM at the outer HD-FEC BER threshold.

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: 4 pages, 4 figures, accepted by 2024 European Conference on Optical Communication (ECOC)

  38. arXiv:2407.03374  [pdf

    cs.AI cs.SE eess.SP eess.SY

    An Outline of Prognostics and Health Management Large Model: Concepts, Paradigms, and Challenges

    Authors: Laifa Tao, Shangyu Li, Haifei Liu, Qixuan Huang, Liang Ma, Guoao Ning, Yiling Chen, Yunlong Wu, Bin Li, Weiwei Zhang, Zhengduo Zhao, Wenchao Zhan, Wenyan Cao, Chao Wang, Hongmei Liu, Jian Ma, Mingliang Suo, Yujie Cheng, Yu Ding, Dengwei Song, Chen Lu

    Abstract: Prognosis and Health Management (PHM), critical for ensuring task completion by complex systems and preventing unexpected failures, is widely adopted in aerospace, manufacturing, maritime, rail, energy, etc. However, PHM's development is constrained by bottlenecks like generalization, interpretation and verification abilities. Presently, generative artificial intelligence (AI), represented by Larg… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  39. arXiv:2406.18102  [pdf

    eess.IV cs.CV

    A Lung Nodule Dataset with Histopathology-based Cancer Type Annotation

    Authors: Muwei Jian, Hongyu Chen, Zaiyong Zhang, Nan Yang, Haorang Zhang, Lifu Ma, Wenjing Xu, Huixiang Zhi

    Abstract: Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accuratel… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  40. arXiv:2406.13145  [pdf, other

    eess.SY cs.LG

    Constructing and Evaluating Digital Twins: An Intelligent Framework for DT Development

    Authors: Longfei Ma, Nan Cheng, Xiucheng Wang, Jiong Chen, Yinjun Gao, Dongxiao Zhang, Jun-Jie Zhang

    Abstract: The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, spec… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  41. arXiv:2406.10236  [pdf, other

    eess.IV cs.AI

    Lightening Anything in Medical Images

    Authors: Ben Fei, Yixuan Li, Weidong Yang, Hengjun Gao, Jingyi Xu, Lipeng Ma, Yatian Yang, Pinghong Zhou

    Abstract: The development of medical imaging techniques has made a significant contribution to clinical decision-making. However, the existence of suboptimal imaging quality, as indicated by irregular illumination or imbalanced intensity, presents significant obstacles in automating disease screening, analysis, and diagnosis. Existing approaches for natural image enhancement are mostly trained with numerous… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 23 pages, 6 figures

  42. arXiv:2406.09844  [pdf, other

    cs.SD eess.AS

    Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

    Authors: Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

    Abstract: Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model im… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  43. arXiv:2406.05763  [pdf, other

    eess.AS

    WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

    Authors: Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

    Abstract: With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio… ▽ More

    Submitted 19 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  44. arXiv:2406.03912  [pdf, other

    cs.AI cs.LG cs.RO eess.SY

    GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

    Authors: Zhehua Zhou, Xuan Xie, Jiayang Song, Zhan Shu, Lei Ma

    Abstract: Safe Reinforcement Learning (SRL) aims to realize a safe learning process for Deep Reinforcement Learning (DRL) algorithms by incorporating safety constraints. However, the efficacy of SRL approaches often relies on accurate function approximations, which are notably challenging to achieve in the early learning stages due to data insufficiency. To address this issue, we introduce in this work a no… ▽ More

    Submitted 14 January, 2025; v1 submitted 6 June, 2024; originally announced June 2024.

  45. arXiv:2405.14802  [pdf, other

    eess.IV cs.CV

    Fast-DDPM: Fast Denoising Diffusion Probabilistic Models for Medical Image-to-Image Generation

    Authors: Hongxu Jiang, Muhammad Imran, Linhai Ma, Teng Zhang, Yuyin Zhou, Muxuan Liang, Kuang Gong, Wei Shao

    Abstract: Denoising diffusion probabilistic models (DDPMs) have achieved unprecedented success in computer vision. However, they remain underutilized in medical imaging, a field crucial for disease diagnosis and treatment planning. This is primarily due to the high computational cost associated with (1) the use of large number of time steps (e.g., 1,000) in diffusion processes and (2) the increased dimensio… ▽ More

    Submitted 23 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  46. arXiv:2405.11380  [pdf, other

    cs.RO cs.AI eess.SY

    Meta-Control: Automatic Model-based Control Synthesis for Heterogeneous Robot Skills

    Authors: Tianhao Wei, Liqian Ma, Rui Chen, Weiye Zhao, Changliu Liu

    Abstract: The requirements for real-world manipulation tasks are diverse and often conflicting; some tasks require precise motion while others require force compliance; some tasks require avoidance of certain regions, while others require convergence to certain states. Satisfying these varied requirements with a fixed state-action representation and control strategy is challenging, impeding the development… ▽ More

    Submitted 11 December, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

  47. arXiv:2405.09291  [pdf, other

    cs.CV cs.AI eess.IV

    Sensitivity Decouple Learning for Image Compression Artifacts Reduction

    Authors: Li Ma, Yifan Zhao, Peixi Peng, Yonghong Tian

    Abstract: With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing ta… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: Accepted by Transactions on Image Processing

  48. arXiv:2405.04867  [pdf, other

    eess.IV cs.CV

    MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

    Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhijing Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Haijin Zeng, Kai Feng , et al. (24 additional authors not shown)

    Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

  49. arXiv:2405.04629  [pdf, other

    eess.IV cs.AI physics.med-ph

    ResNCT: A Deep Learning Model for the Synthesis of Nephrographic Phase Images in CT Urography

    Authors: Syed Jamal Safdar Gardezi, Lucas Aronson, Peter Wawrzyn, Hongkun Yu, E. Jason Abel, Daniel D. Shapiro, Meghan G. Lubner, Joshua Warner, Giuseppe Toia, Lu Mao, Pallavi Tiwari, Andrew L. Wentland

    Abstract: Purpose: To develop and evaluate a transformer-based deep learning model for the synthesis of nephrographic phase images in CT urography (CTU) examinations from the unenhanced and urographic phases. Materials and Methods: This retrospective study was approved by the local Institutional Review Board. A dataset of 119 patients (mean $\pm$ SD age, 65 $\pm$ 12 years; 75/44 males/females) with three-… ▽ More

    Submitted 28 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

    Comments: 10 pages, 5 Figures,2 Tables

    MSC Class: eess.IV ACM Class: J.3

  50. arXiv:2405.02151  [pdf, other

    cs.SD cs.AI eess.AS

    GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

    Authors: Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

    Abstract: The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer le… ▽ More

    Submitted 23 September, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

    Comments: Accepted to SLT2024