Skip to main content

Showing 1–50 of 333 results for author: Ma, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.01608  [pdf, ps, other

    cs.CV eess.IV

    Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference

    Authors: Xu Zhang, Ming Lu, Yan Chen, Zhan Ma

    Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire v… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: International Conference on Multimedia and Expo (ICME), 2025

  2. arXiv:2506.23484  [pdf, ps, other

    cs.MM cs.CV eess.IV

    TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

    Authors: Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu

    Abstract: AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025 (2025 IEEE/CVF International Conference on Computer Vision)

    ACM Class: I.3.3; I.4.9

  3. arXiv:2506.20400  [pdf

    cs.MA cs.CE cs.HC eess.SY

    A Visualization Framework for Exploring Multi-Agent-Based Simulations Case Study of an Electric Vehicle Home Charging Ecosystem

    Authors: Kristoffer Christensen, Bo Nørregaard Jørgensen, Zheng Grace Ma

    Abstract: Multi-agent-based simulations (MABS) of electric vehicle (EV) home charging ecosystems generate large, complex, and stochastic time-series datasets that capture interactions between households, grid infrastructure, and energy markets. These interactions can lead to unexpected system-level events, such as transformer overloads or consumer dissatisfaction, that are difficult to detect and explain th… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  4. arXiv:2506.16957  [pdf, ps, other

    eess.SP

    Wi-Fi Sensing Tool Release: Gathering 802.11ax Channel State Information from a Commercial Wi-Fi Access Point

    Authors: Zisheng Wang, Feng Li, Hangbin Zhao, Zihuan Mao, Yaodong Zhang, Qisheng Huang, Bo Cao, Mingming Cao, Baolin He, Qilin Hou

    Abstract: Wi-Fi sensing has emerged as a powerful technology, leveraging channel state information (CSI) extracted from wireless data packets to enable diverse applications, ranging from human presence detection to gesture recognition and health monitoring. However, CSI extraction from commercial Wi-Fi access point lacks and out of date. This paper introduces ZTECSITool,a toolkit designed to capture high-re… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  5. arXiv:2506.13339  [pdf, ps, other

    cs.CL eess.AS

    NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

    Authors: Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

    Abstract: This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, languag… ▽ More

    Submitted 4 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: Accepted by Interspeech 2025 MLC-SLM challenge (5th place). System report

  6. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  7. arXiv:2506.00385  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

    Authors: Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

    Abstract: Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottlenec… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: 18 pages, 3 figures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec

  8. arXiv:2505.24583  [pdf, ps, other

    cs.ET eess.SP

    Cognitive-Radio Functionality: A Novel Configuration for STAR-RIS assisted RSMA Networks

    Authors: Saeed Ibrahim, Yue Xiao, Dimitrios Tyrovolas, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, Zheng Ma, George K. Karagiannidis, Pinghzi Fan

    Abstract: Cognitive radio rate-splitting multiple access (CR-RSMA) has emerged as a promising multiple access framework that can efficiently manage interference and adapt dynamically to heterogeneous quality-of-service (QoS) requirements. To effectively support such demanding access schemes, programmable wireless environments have attracted considerable attention, especially through simultaneously transmitt… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  9. arXiv:2505.19931  [pdf, ps, other

    eess.AS cs.SD

    Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

    Authors: Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen

    Abstract: Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free appro… ▽ More

    Submitted 4 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  10. arXiv:2505.19294  [pdf, other

    cs.SD cs.CL cs.HC cs.MM eess.AS

    Towards Reliable Large Audio Language Model

    Authors: Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

    Abstract: Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: ACL 2025 Findings

  11. arXiv:2505.16211  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

    Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu , et al. (6 additional authors not shown)

    Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More

    Submitted 1 July, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: Technical Report

  12. arXiv:2505.13181  [pdf, other

    cs.CL cs.SD eess.AS

    Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

    Authors: Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang

    Abstract: We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying con… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Demos and code are available at https://github.com/ictnlp/SLED-TTS

  13. arXiv:2505.13032  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

    Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

    Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Open-source at https://github.com/ddlBoJack/MMAR

  14. arXiv:2505.12875  [pdf

    eess.IV

    3D Gaussian Adaptive Reconstruction for Fourier Light-Field Microscopy

    Authors: Chenyu Xu, Zhouyu Jin, Chengkang Shen, Hao Zhu, Zhan Ma, Bo Xiong, You Zhou, Xun Cao, Ning Gu

    Abstract: Compared to light-field microscopy (LFM), which enables high-speed volumetric imaging but suffers from non-uniform spatial sampling, Fourier light-field microscopy (FLFM) introduces sub-aperture division at the pupil plane, thereby ensuring spatially invariant sampling and enhancing spatial resolution. Conventional FLFM reconstruction methods, such as Richardson-Lucy (RL) deconvolution, exhibit po… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  15. arXiv:2505.09433  [pdf, ps, other

    cs.CV eess.IV

    Efficient LiDAR Reflectance Compression via Scanning Serialization

    Authors: Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

    Abstract: Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serializati… ▽ More

    Submitted 27 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

  16. arXiv:2505.08281  [pdf, ps, other

    cs.CV eess.IV

    Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

    Authors: Anle Ke, Xu Zhang, Tong Chen, Ming Lu, Chao Zhou, Jiawen Gu, Zhan Ma

    Abstract: Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual sign… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Journal ref: ICML 2025

  17. arXiv:2505.05521  [pdf, other

    eess.SY

    Model-Based Closed-Loop Control Algorithm for Stochastic Partial Differential Equation Control

    Authors: Peiyan Hu, Haodong Feng, Yue Wang, Zhiming Ma

    Abstract: Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one ha… ▽ More

    Submitted 15 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

  18. arXiv:2505.03186  [pdf, other

    cs.SD cs.CV eess.AS

    CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

    Authors: Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

    Abstract: The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a… ▽ More

    Submitted 15 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

  19. arXiv:2505.01768  [pdf, ps, other

    eess.IV cs.CV

    Continuous Filtered Backprojection by Learnable Interpolation Network

    Authors: Hui Lin, Dong Zeng, Qi Xie, Zerui Mao, Jianhua Ma, Deyu Meng

    Abstract: Accurate reconstruction of computed tomography (CT) images is crucial in medical imaging field. However, there are unavoidable interpolation errors in the backprojection step of the conventional reconstruction methods, i.e., filtered-back-projection based methods, which are detrimental to the accurate reconstruction. In this study, to address this issue, we propose a novel deep learning model, nam… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: 14 pages, 10 figures

  20. arXiv:2504.20334  [pdf, other

    eess.AS

    Towards Flow-Matching-based TTS without Classifier-Free Guidance

    Authors: Yuzhe Liang, Wenzhe Liu, Chunyu Qiang, Zhikang Niu, Yushen Chen, Ziyang Ma, Wenxi Chen, Nan Li, Chen Zhang, Xie Chen

    Abstract: Flow matching has demonstrated strong generative capabilities and has become a core component in modern Text-to-Speech (TTS) systems. To ensure high-quality speech synthesis, Classifier-Free Guidance (CFG) is widely used during the inference of flow-matching-based TTS models. However, CFG incurs substantial computational cost as it requires two forward passes, which hinders its applicability in re… ▽ More

    Submitted 2 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

  21. arXiv:2504.12867  [pdf, other

    eess.AS cs.AI cs.CL

    EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

    Authors: Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

    Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLM… ▽ More

    Submitted 21 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  22. arXiv:2504.12711  [pdf, other

    cs.CV cs.AI eess.IV

    NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More

    Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

  23. arXiv:2504.07758  [pdf, other

    cs.CV eess.IV

    PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution

    Authors: Shuangfan Zhou, Chu Zhou, Youwei Lyu, Heng Guo, Zhanyu Ma, Boxin Shi, Imari Sato

    Abstract: Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that m… ▽ More

    Submitted 22 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

    Comments: CVPR 2025

  24. arXiv:2504.00493  [pdf, other

    eess.SY

    Perturbation-Based Pinning Control Strategy for Enhanced Synchronization in Complex Networks

    Authors: Ziang Mao, Tianlong Fan, Linyuan Lü

    Abstract: Synchronization is essential for the stability and coordinated operation of complex networked systems. Pinning control, which selectively controls a subset of nodes, provides a scalable solution to enhance network synchronizability. However, existing strategies face key limitations: heuristic centrality-based methods lack a direct connection to synchronization dynamics, while spectral approaches,… ▽ More

    Submitted 10 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  25. arXiv:2503.22202  [pdf, other

    eess.SP

    mmHRR: Monitoring Heart Rate Recovery with Millimeter Wave Radar

    Authors: Ziheng Mao, Yuan He, Jia Zhang, Yimiao Sun, Yadong Xie, Xiuzhen Guo

    Abstract: Heart rate recovery (HRR) within the initial minute following exercise is a widely utilized metric for assessing cardiac autonomic function in individuals and predicting mortality risk in patients with cardiovascular disease. However, prevailing solutions for HRR monitoring typically involve the use of specialized medical equipment or contact wearable sensors, resulting in high costs and poor user… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  26. arXiv:2503.12382  [pdf, other

    cs.CV eess.IV

    RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds

    Authors: Kang You, Tong Chen, Dandan Ding, M. Salman Asif, Zhan Ma

    Abstract: Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression - an indispensable criterion for numerous industrial applications - remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  27. arXiv:2503.11190  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Cross-Modal Learning for Music-to-Music-Video Description Generation

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV descri… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Accepted by RepL4NLP 2025 @ NAACL 2025

  28. arXiv:2503.11075  [pdf, other

    eess.SY

    Inverter Control with Time-Varying and Nonconvex State and Input Constraints

    Authors: Zixiao Ma, Baosen Zhang

    Abstract: The growing integration of inverter-based resources (IBRs) into modern power systems poses significant challenges for maintaining reliable operation under dynamic and constrained conditions. This paper focuses on the power tracking problem for grid-connected IBRs, addressing the complexities introduced by voltage and power factor constraints. Voltage constraints, being time-varying and nonlinear i… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  29. arXiv:2503.10047  [pdf, other

    eess.IV cs.CV

    Dual-domain Modulation Network for Lightweight Image Super-Resolution

    Authors: Wenjie Li, Heng Guo, Yuefeng Hou, Guangwei Gao, Zhanyu Ma

    Abstract: Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  30. arXiv:2503.08638  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    YuE: Scaling Open Foundation Models for Long-Form Music Generation

    Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

    Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: https://github.com/multimodal-art-projection/YuE

  31. arXiv:2503.01710  [pdf, other

    cs.SD cs.AI eess.AS

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

    Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Submitted to ACL 2025

  32. arXiv:2503.01116  [pdf, other

    eess.SP cs.LG

    Large AI Model for Delay-Doppler Domain Channel Prediction in 6G OTFS-Based Vehicular Networks

    Authors: Jianzhe Xue, Dongcheng Yuan, Zhanxi Ma, Tiankai Jiang, Yu Sun, Haibo Zhou, Xuemin Shen

    Abstract: Channel prediction is crucial for high-mobility vehicular networks, as it enables the anticipation of future channel conditions and the proactive adjustment of communication strategies. However, achieving accurate vehicular channel prediction is challenging due to significant Doppler effects and rapid channel variations resulting from high-speed vehicle movement and complex propagation environment… ▽ More

    Submitted 8 May, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

    Comments: This manuscript has been accepted by SCIENCE CHINA Information Sciences

  33. arXiv:2502.17810  [pdf, other

    cs.CL eess.AS

    URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models

    Authors: Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, Xie Chen

    Abstract: In recent years, with advances in large language models (LLMs), end-to-end spoken dialogue models (SDMs) have made significant strides. Compared to text-based LLMs, the evaluation of SDMs needs to take speech-related aspects into account, such as paralinguistic information and speech quality. However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios.… ▽ More

    Submitted 1 March, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

  34. arXiv:2502.12623  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance… ▽ More

    Submitted 20 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

  35. arXiv:2502.11729  [pdf, other

    eess.IV

    On Quantizing Neural Representation for Variable-Rate Video Coding

    Authors: Junqi Shi, Zhujia Chen, Hanfei Li, Qi Zhao, Ming Lu, Tong Chen, Zhan Ma

    Abstract: This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Ou… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: to be pulished in ICLR'25

  36. arXiv:2501.15368  [pdf, other

    cs.CL cs.SD eess.AS

    Baichuan-Omni-1.5 Technical Report

    Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

    Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  37. arXiv:2501.15091  [pdf, other

    cs.IT eess.SP

    Deep Reinforcement Learning for Energy Efficiency Maximization in RSMA-IRS-Assisted ISAC System

    Authors: Zhangfeng Ma, Ruichen Zhang, Bo Ai, Zhuxian Lian, Linzhou Zeng, Dusit Niyato

    Abstract: This paper proposes a three-dimensional (3D) geometry-based channel model to accurately represent intelligent reflecting surfaces (IRS)-enhanced integrated sensing and communication (ISAC) networks using rate-splitting multiple access (RSMA) in practical urban environments. Based on this model, we formulate an energy efficiency (EE) maximization problem that incorporates transceiver beamforming co… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

    Comments: 5 pages, 4 figures

  38. arXiv:2501.11263  [pdf, other

    cs.CV eess.IV

    Towards Loss-Resilient Image Coding for Unstable Satellite Networks

    Authors: Hongwei Sha, Muchen Dong, Quanyou Luo, Ming Lu, Hao Chen, Zhan Ma

    Abstract: Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image comp… ▽ More

    Submitted 19 January, 2025; originally announced January 2025.

    Comments: Accepted as a poster presentation at AAAI 2025

  39. arXiv:2501.07246  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

    Authors: Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen

    Abstract: Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALM… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

  40. arXiv:2501.03592  [pdf, other

    eess.IV cs.CV physics.optics

    A Value Mapping Virtual Staining Framework for Large-scale Histological Imaging

    Authors: Junjia Wang, Bo Xiong, You Zhou, Xun Cao, Zhan Ma

    Abstract: The emergence of virtual staining technology provides a rapid and efficient alternative for researchers in tissue pathology. It enables the utilization of unlabeled microscopic samples to generate virtual replicas of chemically stained histological slices, or facilitate the transformation of one staining type into another. The remarkable performance of generative networks, such as CycleGAN, offers… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

  41. arXiv:2501.01108  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

    Authors: Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, Xie Chen

    Abstract: Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random project… ▽ More

    Submitted 3 January, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

  42. arXiv:2501.00009  [pdf, other

    eess.SP cs.AI

    Model-Driven Deep Neural Network for Enhanced AoA Estimation Using 5G gNB

    Authors: Shengheng Liu, Xingkang Li, Zihuan Mao, Peng Liu, Yongming Huang

    Abstract: High-accuracy positioning has become a fundamental enabler for intelligent connected devices. Nevertheless, the present wireless networks still rely on model-driven approaches to achieve positioning functionality, which are susceptible to performance degradation in practical scenarios, primarily due to hardware impairments. Integrating artificial intelligence into the positioning framework present… ▽ More

    Submitted 9 December, 2024; originally announced January 2025.

    Comments: Presented at AAAI 2024 (Main Technical Track)

  43. arXiv:2412.20675  [pdf

    cs.RO eess.SP physics.ins-det

    Improved ICNN-LSTM Model Classification Based on Attitude Sensor Data for Hazardous State Assessment of Magnetic Adhesion Climbing Wall Robots

    Authors: Zhen Ma, He Xu, Jielong Dou, Yi Qin, Xueyu Zhang

    Abstract: Magnetic adhesion tracked climbing robots are widely utilized in high-altitude inspection, welding, and cleaning tasks due to their ability to perform various operations against gravity on vertical or inclined walls. However, during operation, the robot may experience overturning torque caused by its own weight and load, which can lead to the detachment of magnetic plates and subsequently pose saf… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

    Comments: 20 pages, 8 figures, manuscript for Journal of Autonomous Robots

    MSC Class: 68T05; 68T07; 68T40 ACM Class: I.2.6; I.2.7; K.6.7

  44. arXiv:2412.19841  [pdf

    cs.CV eess.IV

    FlameGS: Reconstruct flame light field via Gaussian Splatting

    Authors: Yunhao Shui, Fuhao Zhang, Can Gao, Hao Xue, Zhiyin Ma, Gang Xun, Xuesong Li

    Abstract: To address the time-consuming and computationally intensive issues of traditional ART algorithms for flame combustion diagnosis, inspired by flame simulation technology, we propose a novel representation method for flames. By modeling the luminous process of flames and utilizing 2D projection images for supervision, our experimental validation shows that this model achieves an average structural s… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

  45. arXiv:2412.16619  [pdf, ps, other

    cs.CV cs.LG eess.IV math.AT math.GT

    Topology-Aware 3D Gaussian Splatting: Leveraging Persistent Homology for Optimized Structural Integrity

    Authors: Tianqi Shen, Shaohua Liu, Jiaqi Feng, Ziye Ma, Ning An

    Abstract: Gaussian Splatting (GS) has emerged as a crucial technique for representing discrete volumetric radiance fields. It leverages unique parametrization to mitigate computational demands in scene optimization. This work introduces Topology-Aware 3D Gaussian Splatting (Topology-GS), which addresses two key limitations in current approaches: compromised pixel-level structural integrity due to incomplete… ▽ More

    Submitted 14 June, 2025; v1 submitted 21 December, 2024; originally announced December 2024.

    Comments: 18 pages, 12 figures, includes appendix. Accepted as oral presentation at AAAI 2025 (Conference on Artificial Intelligence). Official conference version: 10 pages, 6 figures. ISBN (Print): 978-1-57735-897-8. Conference website: https://aaai.org/conference/aaai/aaai-25/

    MSC Class: 55N31; 68T45 ACM Class: I.2.10; I.3.7; I.4.5

  46. arXiv:2412.16102  [pdf, other

    eess.AS

    Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers

    Authors: Yifan Yang, Ziyang Ma, Shujie Liu, Jinyu Li, Hui Wang, Lingwei Meng, Haiyang Sun, Yuzhe Liang, Ruiyang Xu, Yuxuan Hu, Yan Lu, Rui Zhao, Xie Chen

    Abstract: This paper introduces Interleaved Speech-Text Language Model (IST-LM) for streaming zero-shot Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed ratio, eliminating the need for additional efforts in duration prediction and grapheme-to-phoneme alignment. The ratio of text chunk size to speech chunk size i… ▽ More

    Submitted 23 December, 2024; v1 submitted 20 December, 2024; originally announced December 2024.

    Comments: Submitted to ICME 2025

  47. arXiv:2412.15649  [pdf, other

    eess.AS

    SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

    Authors: Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

    Abstract: Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a v… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

  48. arXiv:2412.10644  [pdf, other

    eess.SP cs.AI

    Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

    Authors: Shengheng Liu, Zihuan Mao, Xingkang Li, Mengguan Pan, Peng Liu, Yongming Huang, Xiaohu You

    Abstract: Pervasive and high-accuracy positioning has become increasingly important as a fundamental enabler for intelligent connected devices in mobile networks. Nevertheless, current wireless networks heavily rely on pure model-driven techniques to achieve positioning functionality, often succumbing to performance deterioration due to hardware impairments in practical scenarios. Here we reformulate the di… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

    Comments: To appear in ACM TOSN. A preliminary version of this article was presented at the AAAI'2024 Main Technical Track

  49. arXiv:2412.00721   

    cs.AI cs.CL cs.SD eess.AS

    A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario

    Authors: Zheshu Song, Ziyang Ma, Yifan Yang, Jianheng Zhuo, Xie Chen

    Abstract: Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant trend in the Automatic Speech Recognition (ASR) field. Previous works mainly concentrated on leveraging LLMs for speech recognition in English and Chinese. However, their potential for addressing speech recognition challenges in l… ▽ More

    Submitted 4 December, 2024; v1 submitted 1 December, 2024; originally announced December 2024.

    Comments: This work hasn't been finished yet

  50. arXiv:2411.17100  [pdf, other

    eess.AS

    k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

    Authors: Yifan Yang, Jianheng Zhuo, Zengrui Jin, Ziyang Ma, Xiaoyu Yang, Zengwei Yao, Liyong Guo, Wei Kang, Fangjun Kuang, Long Lin, Daniel Povey, Xie Chen

    Abstract: Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the g… ▽ More

    Submitted 22 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: Accepted in ICME 2025