Search | arXiv e-print repository

Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference

Authors: Xu Zhang, Ming Lu, Yan Chen, Zhan Ma

Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire v… ▽ More In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: International Conference on Multimedia and Expo (ICME), 2025

arXiv:2506.23484 [pdf, ps, other]

TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity

Authors: Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu

Abstract: AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and… ▽ More AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and high performance of generative image editing applications have elevated the risks of malicious tampering, creating new demands. 1) The tamper robustness of current lossless visual quality watermarks remains constrained by the modification-sensitive diffusion inversion process, necessitating enhanced robustness. 2) The improved tampering quality and rapid iteration cycles render passive tampering detection methods inadequate, making proactive tampering localization capability a desired feature for watermarks. To address these requirements, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results indicate that TAG-WM achieves SOTA tampering robustness and tampering localization capability with distortions while maintaining lossless generation quality and a considerable capacity of 256 bits. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: Accepted by ICCV 2025 (2025 IEEE/CVF International Conference on Computer Vision)

ACM Class: I.3.3; I.4.9

arXiv:2506.20400 [pdf]

A Visualization Framework for Exploring Multi-Agent-Based Simulations Case Study of an Electric Vehicle Home Charging Ecosystem

Authors: Kristoffer Christensen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Abstract: Multi-agent-based simulations (MABS) of electric vehicle (EV) home charging ecosystems generate large, complex, and stochastic time-series datasets that capture interactions between households, grid infrastructure, and energy markets. These interactions can lead to unexpected system-level events, such as transformer overloads or consumer dissatisfaction, that are difficult to detect and explain th… ▽ More Multi-agent-based simulations (MABS) of electric vehicle (EV) home charging ecosystems generate large, complex, and stochastic time-series datasets that capture interactions between households, grid infrastructure, and energy markets. These interactions can lead to unexpected system-level events, such as transformer overloads or consumer dissatisfaction, that are difficult to detect and explain through static post-processing. This paper presents a modular, Python-based dashboard framework, built using Dash by Plotly, that enables efficient, multi-level exploration and root-cause analysis of emergent behavior in MABS outputs. The system features three coordinated views (System Overview, System Analysis, and Consumer Analysis), each offering high-resolution visualizations such as time-series plots, spatial heatmaps, and agent-specific drill-down tools. A case study simulating full EV adoption with smart charging in a Danish residential network demonstrates how the dashboard supports rapid identification and contextual explanation of anomalies, including clustered transformer overloads and time-dependent charging failures. The framework facilitates actionable insight generation for researchers and distribution system operators, and its architecture is adaptable to other distributed energy resources and complex energy systems. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.16957 [pdf, ps, other]

Wi-Fi Sensing Tool Release: Gathering 802.11ax Channel State Information from a Commercial Wi-Fi Access Point

Authors: Zisheng Wang, Feng Li, Hangbin Zhao, Zihuan Mao, Yaodong Zhang, Qisheng Huang, Bo Cao, Mingming Cao, Baolin He, Qilin Hou

Abstract: Wi-Fi sensing has emerged as a powerful technology, leveraging channel state information (CSI) extracted from wireless data packets to enable diverse applications, ranging from human presence detection to gesture recognition and health monitoring. However, CSI extraction from commercial Wi-Fi access point lacks and out of date. This paper introduces ZTECSITool,a toolkit designed to capture high-re… ▽ More Wi-Fi sensing has emerged as a powerful technology, leveraging channel state information (CSI) extracted from wireless data packets to enable diverse applications, ranging from human presence detection to gesture recognition and health monitoring. However, CSI extraction from commercial Wi-Fi access point lacks and out of date. This paper introduces ZTECSITool,a toolkit designed to capture high-resolution CSI measurements from commercial Wi-Fi 6 (802.11ax) access points, supporting bandwidths up to 160 MHz and 512 subcarriers. ZTECSITool bridges a critical gap in Wi-Fi sensing research, facilitating the development of next-generation sensing systems. The toolkit includes customized firmware and open-source software tools for configuring, collecting, and parsing CSI data, offering researchers a robust platform for advanced sensing applications. We detail the command protocols for CSI extraction, including band selection,STA filtering, and report configuration, and provide insights into the data structure of the reported CSI. Additionally, we present a Python-based graphical interface for real-time CSI visualization and analysis △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.13339 [pdf, ps, other]

NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025

Authors: Yizhou Peng, Bin Wang, Yi-Wen Chao, Ziyang Ma, Haoyang Zhang, Hexin Liu, Xie Chen, Eng Siong Chng

Abstract: This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, languag… ▽ More This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models. △ Less

Submitted 4 July, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025 MLC-SLM challenge (5th place). System report

arXiv:2506.09344 [pdf, ps, other]

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 18 pages,8 figures

arXiv:2506.00385 [pdf, ps, other]

MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation

Authors: Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Abstract: Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottlenec… ▽ More Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: 18 pages, 3 figures. The code and pre-trained models are available at https://github.com/Ereboas/MagiCodec

arXiv:2505.24583 [pdf, ps, other]

Cognitive-Radio Functionality: A Novel Configuration for STAR-RIS assisted RSMA Networks

Authors: Saeed Ibrahim, Yue Xiao, Dimitrios Tyrovolas, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, Zheng Ma, George K. Karagiannidis, Pinghzi Fan

Abstract: Cognitive radio rate-splitting multiple access (CR-RSMA) has emerged as a promising multiple access framework that can efficiently manage interference and adapt dynamically to heterogeneous quality-of-service (QoS) requirements. To effectively support such demanding access schemes, programmable wireless environments have attracted considerable attention, especially through simultaneously transmitt… ▽ More Cognitive radio rate-splitting multiple access (CR-RSMA) has emerged as a promising multiple access framework that can efficiently manage interference and adapt dynamically to heterogeneous quality-of-service (QoS) requirements. To effectively support such demanding access schemes, programmable wireless environments have attracted considerable attention, especially through simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs), which can enable full-space control of signal propagation in asymmetric user deployments. In this paper, we propose the cognitive radio (CR) functionality for STAR-RIS-assisted CR-RSMA systems, leveraging the unique capability of the STAR-RIS to combine element and power splitting for adaptive control of transmission and reflection in CR scenarios. Specifically, the proposed CR functionality partitions the STAR-RIS into two regions independently controlling the transmission and reflection of signals, simultaneously ensuring the required QoS for the primary user and enhancing the performance of the secondary user. To accurately characterize the system performance, we derive analytical expressions for the ergodic rate of the secondary user and the outage rate of the primary user under Nakagami-m fading. Finally, simulation results show that the proposed approach effectively manages interference, guarantees the QoS of the primary user, and significantly improves the throughput of the secondary user, highlighting STAR-RIS as an efficient solution for CR-RSMA-based services. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.19931 [pdf, ps, other]

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

Authors: Qixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiaofei Wang, Kai Yu, Xie Chen

Abstract: Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free appro… ▽ More Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference speed a key challenge. Reducing the number of sampling steps can greatly improve inference efficiency. To this end, we introduce Fast F5-TTS, a training-free approach to accelerate the inference of flow-matching-based TTS models. By inspecting the sampling trajectory of F5-TTS, we identify redundant steps and propose Empirically Pruned Step Sampling (EPSS), a non-uniform time-step sampling strategy that effectively reduces the number of sampling steps. Our approach achieves a 7-step generation with an inference RTF of 0.030 on an NVIDIA RTX 3090 GPU, making it 4 times faster than the original F5-TTS while maintaining comparable performance. Furthermore, EPSS performs well on E2 TTS models, demonstrating its strong generalization ability. △ Less

Submitted 4 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.19294 [pdf, other]

Towards Reliable Large Audio Language Model

Authors: Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Abstract: Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance… ▽ More Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a "meta ability", which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: ACL 2025 Findings

arXiv:2505.16211 [pdf, ps, other]

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Authors: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu , et al. (6 additional authors not shown)

Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safet… ▽ More The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust. △ Less

Submitted 1 July, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Technical Report

arXiv:2505.13181 [pdf, other]

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Authors: Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang

Abstract: We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying con… ▽ More We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: Demos and code are available at https://github.com/ictnlp/SLED-TTS

arXiv:2505.13032 [pdf, other]

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: Open-source at https://github.com/ddlBoJack/MMAR

arXiv:2505.12875 [pdf]

3D Gaussian Adaptive Reconstruction for Fourier Light-Field Microscopy

Authors: Chenyu Xu, Zhouyu Jin, Chengkang Shen, Hao Zhu, Zhan Ma, Bo Xiong, You Zhou, Xun Cao, Ning Gu

Abstract: Compared to light-field microscopy (LFM), which enables high-speed volumetric imaging but suffers from non-uniform spatial sampling, Fourier light-field microscopy (FLFM) introduces sub-aperture division at the pupil plane, thereby ensuring spatially invariant sampling and enhancing spatial resolution. Conventional FLFM reconstruction methods, such as Richardson-Lucy (RL) deconvolution, exhibit po… ▽ More Compared to light-field microscopy (LFM), which enables high-speed volumetric imaging but suffers from non-uniform spatial sampling, Fourier light-field microscopy (FLFM) introduces sub-aperture division at the pupil plane, thereby ensuring spatially invariant sampling and enhancing spatial resolution. Conventional FLFM reconstruction methods, such as Richardson-Lucy (RL) deconvolution, exhibit poor axial resolution and signal degradation due to the ill-posed nature of the inverse problem. While data-driven approaches enhance spatial resolution by leveraging high-quality paired datasets or imposing structural priors, Neural Radiance Fields (NeRF)-based methods employ physics-informed self-supervised learning to overcome these limitations, yet they are hindered by substantial computational costs and memory demands. Therefore, we propose 3D Gaussian Adaptive Tomography (3DGAT) for FLFM, a 3D gaussian splatting based self-supervised learning framework that significantly improves the volumetric reconstruction quality of FLFM while maintaining computational efficiency. Experimental results indicate that our approach achieves higher resolution and improved reconstruction accuracy, highlighting its potential to advance FLFM imaging and broaden its applications in 3D optical microscopy. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.09433 [pdf, ps, other]

Efficient LiDAR Reflectance Compression via Scanning Serialization

Authors: Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

Abstract: Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serializati… ▽ More Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2x volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22% reduction of compressed bits while using only 2% of its parameters. Moreover, a lightweight version of SerLiC achieves > 10 fps (frames per second) with just 111K parameters, which is attractive for real-world applications. △ Less

Submitted 27 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.08281 [pdf, ps, other]

Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

Authors: Anle Ke, Xu Zhang, Tong Chen, Ming Lu, Chao Zhou, Jiawen Gu, Zhan Ma

Abstract: Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual sign… ▽ More Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with - 80.7%, -66.3% BD-rate saving in terms of LPIPS and FID. Project page is available at https: //njuvision.github.io/ResULIC/. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Journal ref: ICML 2025

arXiv:2505.05521 [pdf, other]

Model-Based Closed-Loop Control Algorithm for Stochastic Partial Differential Equation Control

Authors: Peiyan Hu, Haodong Feng, Yue Wang, Zhiming Ma

Abstract: Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one ha… ▽ More Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one hand, the regularity of the system's state (which can be intuitively understood as smoothness) deteriorates, making modeling and generalization more challenging. On the other hand, this stochasticity also renders control more unstable and thus less accurate. To address this gap, we propose the Model-Based Closed-Loop Control Algorithm (MB-CC), the first model-based closed-loop control method for SPDEs. MB-CC introduces two key innovations to enhance control robustness and efficiency: a Regularity Feature (RF) block and a closed-loop strategy with an operator-encoded policy network. The RF block, inspired by the regularity structure theory of SPDEs, addresses noise-induced irregularities by transforming the network's input, including the system state and noise-perturbed external forces, into a refined feature space for improved forward prediction. Compared to previous works using regularity features, we introduce a new parameterization, data augmentation, and extend the RF block as a plug-and-play component. Additionally, to achieve closed-loop control, we introduce an operator-encoded policy network to map the current state to optimal control, which integrates physical priors and swiftly makes decisions based on states returned by the environment. We conduct a systematic evaluation of MB-CC on two notable SPDEs, showcasing its effectiveness and efficiency. The ablation studies show its ability to handle stochasticity more effectively. △ Less

Submitted 15 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.03186 [pdf, other]

CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Authors: Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

Abstract: The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a… ▽ More The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized for Audio-Visual Speech Recognition (AVSR) on LRS2, these representations contribute to achieving a state-of-the-art Word Error Rate (WER) of 1.27. They also enable strong performance in Visual Speech Recognition (VSR) with a WER of 20.5 on LRS2, and significantly improve performance in noisy environments by over 70%. Furthermore, CoGenAV representations benefit speech reconstruction tasks, boosting performance in Speech Enhancement and Separation, and achieve competitive results in audio-visual synchronization tasks like Active Speaker Detection (ASD). Our model will be open-sourced to facilitate further development and collaboration within both academia and industry. △ Less

Submitted 15 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.01768 [pdf, ps, other]

Continuous Filtered Backprojection by Learnable Interpolation Network

Authors: Hui Lin, Dong Zeng, Qi Xie, Zerui Mao, Jianhua Ma, Deyu Meng

Abstract: Accurate reconstruction of computed tomography (CT) images is crucial in medical imaging field. However, there are unavoidable interpolation errors in the backprojection step of the conventional reconstruction methods, i.e., filtered-back-projection based methods, which are detrimental to the accurate reconstruction. In this study, to address this issue, we propose a novel deep learning model, nam… ▽ More Accurate reconstruction of computed tomography (CT) images is crucial in medical imaging field. However, there are unavoidable interpolation errors in the backprojection step of the conventional reconstruction methods, i.e., filtered-back-projection based methods, which are detrimental to the accurate reconstruction. In this study, to address this issue, we propose a novel deep learning model, named Leanable-Interpolation-based FBP or LInFBP shortly, to enhance the reconstructed CT image quality, which achieves learnable interpolation in the backprojection step of filtered backprojection (FBP) and alleviates the interpolation errors. Specifically, in the proposed LInFBP, we formulate every local piece of the latent continuous function of discrete sinogram data as a linear combination of selected basis functions, and learn this continuous function by exploiting a deep network to predict the linear combination coefficients. Then, the learned latent continuous function is exploited for interpolation in backprojection step, which first time takes the advantage of deep learning for the interpolation in FBP. Extensive experiments, which encompass diverse CT scenarios, demonstrate the effectiveness of the proposed LInFBP in terms of enhanced reconstructed image quality, plug-and-play ability and generalization capability. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: 14 pages, 10 figures

arXiv:2504.20334 [pdf, other]

Towards Flow-Matching-based TTS without Classifier-Free Guidance

Authors: Yuzhe Liang, Wenzhe Liu, Chunyu Qiang, Zhikang Niu, Yushen Chen, Ziyang Ma, Wenxi Chen, Nan Li, Chen Zhang, Xie Chen

Abstract: Flow matching has demonstrated strong generative capabilities and has become a core component in modern Text-to-Speech (TTS) systems. To ensure high-quality speech synthesis, Classifier-Free Guidance (CFG) is widely used during the inference of flow-matching-based TTS models. However, CFG incurs substantial computational cost as it requires two forward passes, which hinders its applicability in re… ▽ More Flow matching has demonstrated strong generative capabilities and has become a core component in modern Text-to-Speech (TTS) systems. To ensure high-quality speech synthesis, Classifier-Free Guidance (CFG) is widely used during the inference of flow-matching-based TTS models. However, CFG incurs substantial computational cost as it requires two forward passes, which hinders its applicability in real-time scenarios. In this paper, we explore removing CFG from flow-matching-based TTS models to improve inference efficiency, while maintaining performance. Specifically, we reformulated the flow matching training target to directly approximate the CFG optimization trajectory. This training method eliminates the need for unconditional model evaluation and guided tuning during inference, effectively cutting the computational overhead in half. Furthermore, It can be seamlessly integrated with existing optimized sampling strategies. We validate our approach using the F5-TTS model on the LibriTTS dataset. Experimental results show that our method achieves a 9$\times$ inference speed-up compared to the baseline F5-TTS, while preserving comparable speech quality. We will release the code and models to support reproducibility and foster further research in this area. △ Less

Submitted 2 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.12867 [pdf, other]

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Authors: Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLM… ▽ More Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://yanghaha0908.github.io/EmoVoice/. Dataset, code, and checkpoints will be released. △ Less

Submitted 21 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.12711 [pdf, other]

NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

Authors: Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou , et al. (112 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ… ▽ More This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/. △ Less

Submitted 19 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

Comments: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

arXiv:2504.07758 [pdf, other]

PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution

Authors: Shuangfan Zhou, Chu Zhou, Youwei Lyu, Heng Guo, Zhanyu Ma, Boxin Shi, Imari Sato

Abstract: Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that m… ▽ More Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that make polarization-related parameters such as the degree of polarization (DoP) and angle of polarization (AoP) prone to error. Besides, limited by the hardware design, the resolution of a polarization camera is often much lower than that of a conventional RGB camera. Existing polarized image demosaicing (PID) methods are limited in that they cannot enhance resolution, while polarized image super-resolution (PISR) methods, though designed to obtain high-resolution (HR) polarized images from the demosaicing results, tend to retain or even amplify errors in the DoP and AoP introduced by demosaicing artifacts. In this paper, we propose PIDSR, a joint framework that performs complementary Polarized Image Demosaicing and Super-Resolution, showing the ability to robustly obtain high-quality HR polarized images with more accurate DoP and AoP from a CPFA raw image in a direct manner. Experiments show our PIDSR not only achieves state-of-the-art performance on both synthetic and real data, but also facilitates downstream tasks. △ Less

Submitted 22 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

Comments: CVPR 2025

arXiv:2504.00493 [pdf, other]

Perturbation-Based Pinning Control Strategy for Enhanced Synchronization in Complex Networks

Authors: Ziang Mao, Tianlong Fan, Linyuan Lü

Abstract: Synchronization is essential for the stability and coordinated operation of complex networked systems. Pinning control, which selectively controls a subset of nodes, provides a scalable solution to enhance network synchronizability. However, existing strategies face key limitations: heuristic centrality-based methods lack a direct connection to synchronization dynamics, while spectral approaches,… ▽ More Synchronization is essential for the stability and coordinated operation of complex networked systems. Pinning control, which selectively controls a subset of nodes, provides a scalable solution to enhance network synchronizability. However, existing strategies face key limitations: heuristic centrality-based methods lack a direct connection to synchronization dynamics, while spectral approaches, though effective, are computationally intensive. To address these challenges, we propose a perturbation-based optimized strategy (PBO) that dynamically evaluates each node's spectral impact on the Laplacian matrix, achieving improved synchronizability with significantly reduced computational costs (with complexity O(kM)). Extensive experiments demonstrate that the proposed method outperforms traditional strategies in synchronizability, convergence rate, and pinning robustness to node failures. Notably, in all the empirical networks tested and some generated networks, PBO significantly outperforms the brute-force greedy strategy, demonstrating its ability to avoid local optima and adapt to complex connectivity patterns. Our study establishes the theoretical relationship between network synchronizability and convergence rate, offering new insights into efficient synchronization strategies for large-scale complex networks. △ Less

Submitted 10 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2503.22202 [pdf, other]

mmHRR: Monitoring Heart Rate Recovery with Millimeter Wave Radar

Authors: Ziheng Mao, Yuan He, Jia Zhang, Yimiao Sun, Yadong Xie, Xiuzhen Guo

Abstract: Heart rate recovery (HRR) within the initial minute following exercise is a widely utilized metric for assessing cardiac autonomic function in individuals and predicting mortality risk in patients with cardiovascular disease. However, prevailing solutions for HRR monitoring typically involve the use of specialized medical equipment or contact wearable sensors, resulting in high costs and poor user… ▽ More Heart rate recovery (HRR) within the initial minute following exercise is a widely utilized metric for assessing cardiac autonomic function in individuals and predicting mortality risk in patients with cardiovascular disease. However, prevailing solutions for HRR monitoring typically involve the use of specialized medical equipment or contact wearable sensors, resulting in high costs and poor user experience. In this paper, we propose a contactless HRR monitoring technique, mmHRR, which achieves accurate heart rate (HR) estimation with a commercial mmWave radar. Unlike HR estimation at rest, the HR varies quickly after exercise and the heartbeat signal entangles with the respiration harmonics. To overcome these hurdles and effectively estimate the HR from the weak and non-stationary heartbeat signal, we propose a novel signal processing pipeline, including dynamic target tracking, adaptive heartbeat signal extraction, and accurate HR estimation with composite sliding windows. Real-world experiments demonstrate that mmHRR exhibits exceptional robustness across diverse environmental conditions, and achieves an average HR estimation error of 3.31 bpm (beats per minute), 71% lower than that of the state-of-the-art method. △ Less

Submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.12382 [pdf, other]

RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds

Authors: Kang You, Tong Chen, Dandan Ding, M. Salman Asif, Zhan Ma

Abstract: Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression - an indispensable criterion for numerous industrial applications - remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight… ▽ More Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression - an indispensable criterion for numerous industrial applications - remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. RENO skips the octree construction and directly builds upon the multiscale sparse tensor representation. Instead of the multi-stage inferring, RENO devises sparse occupancy codes, which exploit cross-scale correlation and derive voxels' occupancy in a one-shot manner, greatly saving processing time. Experimental results demonstrate that the proposed RENO achieves real-time coding speed, 10 fps at 14-bit depth on a desktop platform (e.g., one RTX 3090 GPU) for both encoding and decoding processes, while providing 12.25% and 48.34% bit-rate savings compared to G-PCCv23 and Draco, respectively, at a similar quality. RENO model size is merely 1MB, making it attractive for practical applications. The source code is available at https://github.com/NJUVISION/RENO. △ Less

Submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.11190 [pdf, other]

Cross-Modal Learning for Music-to-Music-Video Description Generation

Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV descri… ▽ More Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation. △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: Accepted by RepL4NLP 2025 @ NAACL 2025

arXiv:2503.11075 [pdf, other]

Inverter Control with Time-Varying and Nonconvex State and Input Constraints

Authors: Zixiao Ma, Baosen Zhang

Abstract: The growing integration of inverter-based resources (IBRs) into modern power systems poses significant challenges for maintaining reliable operation under dynamic and constrained conditions. This paper focuses on the power tracking problem for grid-connected IBRs, addressing the complexities introduced by voltage and power factor constraints. Voltage constraints, being time-varying and nonlinear i… ▽ More The growing integration of inverter-based resources (IBRs) into modern power systems poses significant challenges for maintaining reliable operation under dynamic and constrained conditions. This paper focuses on the power tracking problem for grid-connected IBRs, addressing the complexities introduced by voltage and power factor constraints. Voltage constraints, being time-varying and nonlinear input constraints, often conflict with power factor constraints, which are state constraints. These conflicts, coupled with stability requirements, add substantial complexity to control design. To overcome these challenges, we propose a computationally efficient static state-feedback controller that guarantees stability and satisfies operational constraints. The concept of achievability is introduced to evaluate whether power setpoints can be accurately tracked while adhering to all constraints. Using a parameterization framework and the S-lemma, we develop criteria to assess and maximize the continuous achievable region for IBR operation. This framework allows system operators to ensure safety and stability by precomputing a finite set of control gains, significantly reducing online computational requirements. The proposed approach is validated through simulations, demonstrating its effectiveness in handling time-varying grid disturbances and achieving reliable control performance. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2503.10047 [pdf, other]

Dual-domain Modulation Network for Lightweight Image Super-Resolution

Authors: Wenjie Li, Heng Guo, Yuefeng Hou, Guangwei Gao, Zhanyu Ma

Abstract: Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show… ▽ More Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a dual-domain modulation network that utilize wavelet-domain modulation self-Transformer (WMT) plus Fourier supervision to modulate frequency features in addition to spatial domain modulation. Compared to existing frequency-based SR modules, our WMT is more suitable for frequency learning in lightweight SR. Experimental results show that our method achieves a comparable PSNR of SRFormer and MambaIR while with less than 50% and 60% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Codes will be released upon acceptance. △ Less

Submitted 13 March, 2025; originally announced March 2025.

arXiv:2503.08638 [pdf, other]

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: https://github.com/multimodal-art-projection/YuE

arXiv:2503.01710 [pdf, other]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS. △ Less

Submitted 3 March, 2025; originally announced March 2025.

Comments: Submitted to ACL 2025

arXiv:2503.01116 [pdf, other]

Large AI Model for Delay-Doppler Domain Channel Prediction in 6G OTFS-Based Vehicular Networks

Authors: Jianzhe Xue, Dongcheng Yuan, Zhanxi Ma, Tiankai Jiang, Yu Sun, Haibo Zhou, Xuemin Shen

Abstract: Channel prediction is crucial for high-mobility vehicular networks, as it enables the anticipation of future channel conditions and the proactive adjustment of communication strategies. However, achieving accurate vehicular channel prediction is challenging due to significant Doppler effects and rapid channel variations resulting from high-speed vehicle movement and complex propagation environment… ▽ More Channel prediction is crucial for high-mobility vehicular networks, as it enables the anticipation of future channel conditions and the proactive adjustment of communication strategies. However, achieving accurate vehicular channel prediction is challenging due to significant Doppler effects and rapid channel variations resulting from high-speed vehicle movement and complex propagation environments. In this paper, we propose a novel delay-Doppler (DD) domain channel prediction framework tailored for high-mobility vehicular networks. By transforming the channel representation into the DD domain, we obtain an intuitive, sparse, and stable depiction that closely aligns with the underlying physical propagation processes, effectively reducing the complex vehicular channel to a set of time-series parameters with enhanced predictability. Furthermore, we leverage the large artificial intelligence (AI) model to predict these DD-domain time-series parameters, capitalizing on their advanced ability to model temporal correlations. The zero-shot capability of the pre-trained large AI model facilitates accurate channel predictions without requiring task-specific training, while subsequent fine-tuning on specific vehicular channel data further improves prediction accuracy. Extensive simulation results demonstrate the effectiveness of our DD-domain channel prediction framework and the superior accuracy of the large AI model in predicting time-series channel parameters, thereby highlighting the potential of our approach for robust vehicular communication systems. △ Less

Submitted 8 May, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

Comments: This manuscript has been accepted by SCIENCE CHINA Information Sciences

arXiv:2502.17810 [pdf, other]

URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models

Authors: Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, Xie Chen

Abstract: In recent years, with advances in large language models (LLMs), end-to-end spoken dialogue models (SDMs) have made significant strides. Compared to text-based LLMs, the evaluation of SDMs needs to take speech-related aspects into account, such as paralinguistic information and speech quality. However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios.… ▽ More In recent years, with advances in large language models (LLMs), end-to-end spoken dialogue models (SDMs) have made significant strides. Compared to text-based LLMs, the evaluation of SDMs needs to take speech-related aspects into account, such as paralinguistic information and speech quality. However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose URO-Bench, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, consisting of 16 and 20 datasets respectively, evaluating the model's abilities in Understanding, Reasoning, and Oral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can effectively facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area. △ Less

Submitted 1 March, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.12623 [pdf, other]

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance… ▽ More Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets. △ Less

Submitted 20 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

arXiv:2502.11729 [pdf, other]

On Quantizing Neural Representation for Variable-Rate Video Coding

Authors: Junqi Shi, Zhujia Chen, Hanfei Li, Qi Zhao, Ming Lu, Tong Chen, Zhan Ma

Abstract: This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Ou… ▽ More This work introduces NeuroQuant, a novel post-training quantization (PTQ) approach tailored to non-generalized Implicit Neural Representations for variable-rate Video Coding (INR-VC). Unlike existing methods that require extensive weight retraining for each target bitrate, we hypothesize that variable-rate coding can be achieved by adjusting quantization parameters (QPs) of pre-trained weights. Our study reveals that traditional quantization methods, which assume inter-layer independence, are ineffective for non-generalized INR-VC models due to significant dependencies across layers. To address this, we redefine variable-rate INR-VC as a mixed-precision quantization problem and establish a theoretical framework for sensitivity criteria aimed at simplified, fine-grained rate control. Additionally, we propose network-wise calibration and channel-wise quantization strategies to minimize quantization-induced errors, arriving at a unified formula for representation-oriented PTQ calibration. Our experimental evaluations demonstrate that NeuroQuant significantly outperforms existing techniques in varying bitwidth quantization and compression efficiency, accelerating encoding by up to eight times and enabling quantization down to INT2 with minimal reconstruction loss. This work introduces variable-rate INR-VC for the first time and lays a theoretical foundation for future research in rate-distortion optimization, advancing the field of video coding technology. The materials will be available at https://github.com/Eric-qi/NeuroQuant. △ Less

Submitted 17 February, 2025; originally announced February 2025.

Comments: to be pulished in ICLR'25

arXiv:2501.15368 [pdf, other]

Baichuan-Omni-1.5 Technical Report

Authors: Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang , et al. (68 additional authors not shown)

Abstract: We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip… ▽ More We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks. △ Less

Submitted 25 January, 2025; originally announced January 2025.

arXiv:2501.15091 [pdf, other]

Deep Reinforcement Learning for Energy Efficiency Maximization in RSMA-IRS-Assisted ISAC System

Authors: Zhangfeng Ma, Ruichen Zhang, Bo Ai, Zhuxian Lian, Linzhou Zeng, Dusit Niyato

Abstract: This paper proposes a three-dimensional (3D) geometry-based channel model to accurately represent intelligent reflecting surfaces (IRS)-enhanced integrated sensing and communication (ISAC) networks using rate-splitting multiple access (RSMA) in practical urban environments. Based on this model, we formulate an energy efficiency (EE) maximization problem that incorporates transceiver beamforming co… ▽ More This paper proposes a three-dimensional (3D) geometry-based channel model to accurately represent intelligent reflecting surfaces (IRS)-enhanced integrated sensing and communication (ISAC) networks using rate-splitting multiple access (RSMA) in practical urban environments. Based on this model, we formulate an energy efficiency (EE) maximization problem that incorporates transceiver beamforming constraints, IRS phase adjustments, and quality-of-service (QoS) requirements to optimize communication and sensing functions. To solve this problem, we use the proximal policy optimization (PPO) algorithm within a deep reinforcement learning (DRL) framework. Our numerical results confirm the effectiveness of the proposed method in improving EE and satisfying QoS requirements. Additionally, we observe that system EE drops at higher frequencies, especially under double-Rayleigh fading. △ Less

Submitted 25 January, 2025; originally announced January 2025.

Comments: 5 pages, 4 figures

arXiv:2501.11263 [pdf, other]

Towards Loss-Resilient Image Coding for Unstable Satellite Networks

Authors: Hongwei Sha, Muchen Dong, Quanyou Luo, Ming Lu, Hao Chen, Zhan Ma

Abstract: Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image comp… ▽ More Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments. Code is available at https://github.com/NJUVISION/LossResilientLIC. △ Less

Submitted 19 January, 2025; originally announced January 2025.

Comments: Accepted as a poster presentation at AAAI 2025

arXiv:2501.07246 [pdf, other]

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

Authors: Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen

Abstract: Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALM… ▽ More Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between reasoning path length and accuracy, demonstrating the potential of scaling inference for advanced instruction-following and reasoning. This study not only highlights the promise of CoT in enhancing LALM reasoning capabilities but also identifies key limitations and provides actionable directions for future research. △ Less

Submitted 13 January, 2025; originally announced January 2025.

arXiv:2501.03592 [pdf, other]

A Value Mapping Virtual Staining Framework for Large-scale Histological Imaging

Authors: Junjia Wang, Bo Xiong, You Zhou, Xun Cao, Zhan Ma

Abstract: The emergence of virtual staining technology provides a rapid and efficient alternative for researchers in tissue pathology. It enables the utilization of unlabeled microscopic samples to generate virtual replicas of chemically stained histological slices, or facilitate the transformation of one staining type into another. The remarkable performance of generative networks, such as CycleGAN, offers… ▽ More The emergence of virtual staining technology provides a rapid and efficient alternative for researchers in tissue pathology. It enables the utilization of unlabeled microscopic samples to generate virtual replicas of chemically stained histological slices, or facilitate the transformation of one staining type into another. The remarkable performance of generative networks, such as CycleGAN, offers an unsupervised learning approach for virtual coloring, overcoming the limitations of high-quality paired data required in supervised learning. Nevertheless, large-scale color transformation necessitates processing large field-of-view images in patches, often resulting in significant boundary inconsistency and artifacts. Additionally, the transformation between different colorized modalities typically needs further efforts to modify loss functions and tune hyperparameters for independent training of networks. In this study, we introduce a general virtual staining framework that is adaptable to various conditions. We propose a loss function based on the value mapping constraint to ensure the accuracy of virtual coloring between different pathological modalities, termed the Value Mapping Generative Adversarial Network (VM-GAN). Meanwhile, we present a confidence-based tiling method to address the challenge of boundary inconsistency arising from patch-wise processing. Experimental results on diverse data with varying staining protocols demonstrate that our method achieves superior quantitative indicators and improved visual perception. △ Less

Submitted 7 January, 2025; originally announced January 2025.

arXiv:2501.01108 [pdf, other]

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

Authors: Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, Xie Chen

Abstract: Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random project… ▽ More Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ. △ Less

Submitted 3 January, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

arXiv:2501.00009 [pdf, other]

Model-Driven Deep Neural Network for Enhanced AoA Estimation Using 5G gNB

Authors: Shengheng Liu, Xingkang Li, Zihuan Mao, Peng Liu, Yongming Huang

Abstract: High-accuracy positioning has become a fundamental enabler for intelligent connected devices. Nevertheless, the present wireless networks still rely on model-driven approaches to achieve positioning functionality, which are susceptible to performance degradation in practical scenarios, primarily due to hardware impairments. Integrating artificial intelligence into the positioning framework present… ▽ More High-accuracy positioning has become a fundamental enabler for intelligent connected devices. Nevertheless, the present wireless networks still rely on model-driven approaches to achieve positioning functionality, which are susceptible to performance degradation in practical scenarios, primarily due to hardware impairments. Integrating artificial intelligence into the positioning framework presents a promising solution to revolutionize the accuracy and robustness of location-based services. In this study, we address this challenge by reformulating the problem of angle-of-arrival (AoA) estimation into image reconstruction of spatial spectrum. To this end, we design a model-driven deep neural network (MoD-DNN), which can automatically calibrate the angular-dependent phase error. The proposed MoD-DNN approach employs an iterative optimization scheme between a convolutional neural network and a sparse conjugate gradient algorithm. Simulation and experimental results are presented to demonstrate the effectiveness of the proposed method in enhancing spectrum calibration and AoA estimation. △ Less

Submitted 9 December, 2024; originally announced January 2025.

Comments: Presented at AAAI 2024 (Main Technical Track)

arXiv:2412.20675 [pdf]

Improved ICNN-LSTM Model Classification Based on Attitude Sensor Data for Hazardous State Assessment of Magnetic Adhesion Climbing Wall Robots

Authors: Zhen Ma, He Xu, Jielong Dou, Yi Qin, Xueyu Zhang

Abstract: Magnetic adhesion tracked climbing robots are widely utilized in high-altitude inspection, welding, and cleaning tasks due to their ability to perform various operations against gravity on vertical or inclined walls. However, during operation, the robot may experience overturning torque caused by its own weight and load, which can lead to the detachment of magnetic plates and subsequently pose saf… ▽ More Magnetic adhesion tracked climbing robots are widely utilized in high-altitude inspection, welding, and cleaning tasks due to their ability to perform various operations against gravity on vertical or inclined walls. However, during operation, the robot may experience overturning torque caused by its own weight and load, which can lead to the detachment of magnetic plates and subsequently pose safety risks. This paper proposes an improved ICNN-LSTM network classification method based on Micro-Electro-Mechanical Systems (MEMS) attitude sensor data for real-time monitoring and assessment of hazardous states in magnetic adhesion tracked climbing robots. Firstly, a data acquisition strategy for attitude sensors capable of capturing minute vibrations is designed. Secondly, a feature extraction and classification model combining an Improved Convolutional Neural Network (ICNN) with a Long Short-Term Memory (LSTM) network is proposed. Experimental validation demonstrates that the proposed minute vibration sensing method achieves significant results, and the proposed classification model consistently exhibits high accuracy compared to other models. The research findings provide effective technical support for the safe operation of climbing robots △ Less

Submitted 29 December, 2024; originally announced December 2024.

Comments: 20 pages, 8 figures, manuscript for Journal of Autonomous Robots

MSC Class: 68T05; 68T07; 68T40 ACM Class: I.2.6; I.2.7; K.6.7

arXiv:2412.19841 [pdf]

FlameGS: Reconstruct flame light field via Gaussian Splatting

Authors: Yunhao Shui, Fuhao Zhang, Can Gao, Hao Xue, Zhiyin Ma, Gang Xun, Xuesong Li

Abstract: To address the time-consuming and computationally intensive issues of traditional ART algorithms for flame combustion diagnosis, inspired by flame simulation technology, we propose a novel representation method for flames. By modeling the luminous process of flames and utilizing 2D projection images for supervision, our experimental validation shows that this model achieves an average structural s… ▽ More To address the time-consuming and computationally intensive issues of traditional ART algorithms for flame combustion diagnosis, inspired by flame simulation technology, we propose a novel representation method for flames. By modeling the luminous process of flames and utilizing 2D projection images for supervision, our experimental validation shows that this model achieves an average structural similarity index of 0.96 between actual images and predicted 2D projections, along with a Peak Signal-to-Noise Ratio of 39.05. Additionally, it saves approximately 34 times the computation time and about 10 times the memory compared to traditional algorithms. △ Less

Submitted 24 December, 2024; originally announced December 2024.

arXiv:2412.16619 [pdf, ps, other]

Topology-Aware 3D Gaussian Splatting: Leveraging Persistent Homology for Optimized Structural Integrity

Authors: Tianqi Shen, Shaohua Liu, Jiaqi Feng, Ziye Ma, Ning An

Abstract: Gaussian Splatting (GS) has emerged as a crucial technique for representing discrete volumetric radiance fields. It leverages unique parametrization to mitigate computational demands in scene optimization. This work introduces Topology-Aware 3D Gaussian Splatting (Topology-GS), which addresses two key limitations in current approaches: compromised pixel-level structural integrity due to incomplete… ▽ More Gaussian Splatting (GS) has emerged as a crucial technique for representing discrete volumetric radiance fields. It leverages unique parametrization to mitigate computational demands in scene optimization. This work introduces Topology-Aware 3D Gaussian Splatting (Topology-GS), which addresses two key limitations in current approaches: compromised pixel-level structural integrity due to incomplete initial geometric coverage, and inadequate feature-level integrity from insufficient topological constraints during optimization. To overcome these limitations, Topology-GS incorporates a novel interpolation strategy, Local Persistent Voronoi Interpolation (LPVI), and a topology-focused regularization term based on persistent barcodes, named PersLoss. LPVI utilizes persistent homology to guide adaptive interpolation, enhancing point coverage in low-curvature areas while preserving topological structure. PersLoss aligns the visual perceptual similarity of rendered images with ground truth by constraining distances between their topological features. Comprehensive experiments on three novel-view synthesis benchmarks demonstrate that Topology-GS outperforms existing methods in terms of PSNR, SSIM, and LPIPS metrics, while maintaining efficient memory usage. This study pioneers the integration of topology with 3D-GS, laying the groundwork for future research in this area. △ Less

Submitted 14 June, 2025; v1 submitted 21 December, 2024; originally announced December 2024.

Comments: 18 pages, 12 figures, includes appendix. Accepted as oral presentation at AAAI 2025 (Conference on Artificial Intelligence). Official conference version: 10 pages, 6 figures. ISBN (Print): 978-1-57735-897-8. Conference website: https://aaai.org/conference/aaai/aaai-25/

MSC Class: 55N31; 68T45 ACM Class: I.2.10; I.3.7; I.4.5

arXiv:2412.16102 [pdf, other]

Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers

Authors: Yifan Yang, Ziyang Ma, Shujie Liu, Jinyu Li, Hui Wang, Lingwei Meng, Haiyang Sun, Yuzhe Liang, Ruiyang Xu, Yuxuan Hu, Yan Lu, Rui Zhao, Xie Chen

Abstract: This paper introduces Interleaved Speech-Text Language Model (IST-LM) for streaming zero-shot Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed ratio, eliminating the need for additional efforts in duration prediction and grapheme-to-phoneme alignment. The ratio of text chunk size to speech chunk size i… ▽ More This paper introduces Interleaved Speech-Text Language Model (IST-LM) for streaming zero-shot Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed ratio, eliminating the need for additional efforts in duration prediction and grapheme-to-phoneme alignment. The ratio of text chunk size to speech chunk size is crucial for the performance of IST-LM. To explore this, we conducted a comprehensive series of statistical analyses on the training data and performed correlation analysis with the final performance, uncovering several key factors: 1) the distance between speech tokens and their corresponding text tokens, 2) the number of future text tokens accessible to each speech token, and 3) the frequency of speech tokens precedes their corresponding text tokens. Experimental results demonstrate how to achieve an optimal streaming TTS system without complicated engineering optimization, which has a limited gap with the non-streaming system. IST-LM is conceptually simple and empirically powerful, paving the way for streaming TTS with minimal overhead while largely maintaining performance, showcasing broad prospects coupled with real-time text stream from LLMs. △ Less

Submitted 23 December, 2024; v1 submitted 20 December, 2024; originally announced December 2024.

Comments: Submitted to ICME 2025

arXiv:2412.15649 [pdf, other]

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Authors: Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

Abstract: Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a v… ▽ More Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets. △ Less

Submitted 20 December, 2024; originally announced December 2024.

arXiv:2412.10644 [pdf, other]

Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

Authors: Shengheng Liu, Zihuan Mao, Xingkang Li, Mengguan Pan, Peng Liu, Yongming Huang, Xiaohu You

Abstract: Pervasive and high-accuracy positioning has become increasingly important as a fundamental enabler for intelligent connected devices in mobile networks. Nevertheless, current wireless networks heavily rely on pure model-driven techniques to achieve positioning functionality, often succumbing to performance deterioration due to hardware impairments in practical scenarios. Here we reformulate the di… ▽ More Pervasive and high-accuracy positioning has become increasingly important as a fundamental enabler for intelligent connected devices in mobile networks. Nevertheless, current wireless networks heavily rely on pure model-driven techniques to achieve positioning functionality, often succumbing to performance deterioration due to hardware impairments in practical scenarios. Here we reformulate the direction finding or angle-of-arrival (AoA) estimation problem as an image recovery task of the spatial spectrum and propose a new model-driven deep neural network (MoD-DNN) framework. The proposed MoD-DNN scheme comprises three modules: a multi-task autoencoder-based beamformer, a coarray spectrum generation module, and a model-driven deep learning-based spatial spectrum reconstruction module. Our technique enables automatic calibration of angular-dependent phase error thereby enhancing the resilience of direction-finding precision against realistic system non-idealities. We validate the proposed scheme both using numerical simulations and field tests. The results show that the proposed MoD-DNN framework enables effective spectrum calibration and accurate AoA estimation. To the best of our knowledge, this study marks the first successful demonstration of hybrid data-and-model-driven direction finding utilizing readily available commodity 5G gNodeB. △ Less

Submitted 13 December, 2024; originally announced December 2024.

Comments: To appear in ACM TOSN. A preliminary version of this article was presented at the AAAI'2024 Main Technical Track

arXiv:2412.00721

A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario

Authors: Zheshu Song, Ziyang Ma, Yifan Yang, Jianheng Zhuo, Xie Chen

Abstract: Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant trend in the Automatic Speech Recognition (ASR) field. Previous works mainly concentrated on leveraging LLMs for speech recognition in English and Chinese. However, their potential for addressing speech recognition challenges in l… ▽ More Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant trend in the Automatic Speech Recognition (ASR) field. Previous works mainly concentrated on leveraging LLMs for speech recognition in English and Chinese. However, their potential for addressing speech recognition challenges in low resource settings remains underexplored. Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. We also evaluate and compare the recognition performance of LLM-based ASR systems against Whisper model. Extensive experiments demonstrate that LLM-based ASR yields a relative gain of 12.8\% over the Whisper model in low resource ASR while Whisper performs better in Mandarin-English code switching ASR. We hope that this study could shed light on ASR for low resource scenarios. △ Less

Submitted 4 December, 2024; v1 submitted 1 December, 2024; originally announced December 2024.

Comments: This work hasn't been finished yet

arXiv:2411.17100 [pdf, other]

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Authors: Yifan Yang, Jianheng Zhuo, Zengrui Jin, Ziyang Ma, Xiaoyu Yang, Zengwei Yao, Liyong Guo, Wei Kang, Fangjun Kuang, Long Lin, Daniel Povey, Xie Chen

Abstract: Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the g… ▽ More Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the growing volumes of training data. To address these issues, we propose k2SSL, an open-source framework that offers faster, more memory-efficient, and better-performing self-supervised speech representation learning, focusing on downstream ASR tasks. The optimized HuBERT and proposed Zipformer-based SSL systems exhibit substantial reductions in both training time and memory usage during SSL training. Experiments on LibriSpeech demonstrate that Zipformer Base significantly outperforms HuBERT and WavLM, achieving up to a 34.8% relative WER reduction compared to HuBERT Base after fine-tuning, along with a 3.5x pre-training speedup in GPU hours. When scaled to 60k hours of LibriLight data, Zipformer Large exhibits remarkable efficiency, matching HuBERT Large's performance while requiring only 5/8 pre-training steps. △ Less

Submitted 22 March, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: Accepted in ICME 2025

Showing 1–50 of 333 results for author: Ma, Z