Search | arXiv e-print repository

Towards AI-Native RAN: An Operator's Perspective of 6G Day 1 Standardization

Authors: Nan Li, Qi Sun, Lehan Wang, Xiaofei Xu, Jinri Huang, Chunhui Liu, Jing Gao, Yuhong Huang, Chih-Lin I

Abstract: Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardizati… ▽ More Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardization experience from 2G to 5G, this paper explores the design and standardization principles of AI-Native radio access networks (RAN) for 6G, with a particular focus on its critical Day 1 architecture, functionalities and capabilities. We investigate the framework of AI-Native RAN and present its three essential capabilities to shed some light on the standardization direction; namely, AI-driven RAN processing/optimization/automation, reliable AI lifecycle management (LCM), and AI-as-a-Service (AIaaS) provisioning. The standardization of AI-Native RAN, in particular the Day 1 features, including an AI-Native 6G RAN architecture, were proposed. For validation, a large-scale field trial with over 5000 5G-A base stations have been built and delivered significant improvements in average air interface latency, root cause identification, and network energy consumption with the proposed architecture and the supporting AI functions. This paper aims to provide a Day 1 framework for 6G AI-Native RAN standardization design, balancing technical innovation with practical deployment. △ Less

Submitted 11 July, 2025; originally announced July 2025.

arXiv:2507.07105 [pdf, ps, other]

4KAgent: Agentic Any Image to 4K Super-Resolution

Authors: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu

Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components:… ▽ More We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Project page: https://4kagent.github.io

arXiv:2507.06256 [pdf, ps, other]

Attacker's Noise Can Manipulate Your Audio-based LLM in the Real World

Authors: Vinu Sankar Sadasivan, Soheil Feizi, Rajiv Mathews, Lun Wang

Abstract: This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Sub… ▽ More This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs can significantly degrade the response quality. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users when these adversarial noises are played through the air. Further, we discuss the transferrability of the attack, and potential defensive measures. △ Less

Submitted 7 July, 2025; originally announced July 2025.

arXiv:2507.05900 [pdf, ps, other]

Stable Acoustic Relay Assignment with High Throughput via Lase Chaos-based Reinforcement Learning

Authors: Zengjing Chen, Lu Wang, Chengzhi Xing

Abstract: This study addresses the problem of stable acoustic relay assignment in an underwater acoustic network. Unlike the objectives of most existing literature, two distinct objectives, namely classical stable arrangement and ambiguous stable arrangement, are considered. To achieve these stable arrangements, a laser chaos-based multi-processing learning (LC-ML) method is introduced to efficiently obtain… ▽ More This study addresses the problem of stable acoustic relay assignment in an underwater acoustic network. Unlike the objectives of most existing literature, two distinct objectives, namely classical stable arrangement and ambiguous stable arrangement, are considered. To achieve these stable arrangements, a laser chaos-based multi-processing learning (LC-ML) method is introduced to efficiently obtain high throughput and rapidly attain stability. In order to sufficiently explore the relay's decision-making, this method uses random numbers generated by laser chaos to learn the assignment of relays to multiple source nodes. This study finds that the laser chaos-based random number and multi-processing in the exchange process have a positive effect on higher throughput and strong adaptability with environmental changing over time. Meanwhile, ambiguous cognitions result in the stable configuration with less volatility compared to accurate ones. This provides a practical and useful method and can be the basis for relay selection in complex underwater environments. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2507.05317 [pdf, ps, other]

PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT

Authors: Yi Liu, Yiyang Wen, Zekun Zhou, Junqi Ma, Linghang Wang, Yucheng Yao, Liu Shi, Qiegen Liu

Abstract: Generative diffusion models have received increasing attention in medical imaging, particularly in limited-angle computed tomography (LACT). Standard diffusion models achieve high-quality image reconstruction but require a large number of sampling steps during inference, resulting in substantial computational overhead. Although skip-sampling strategies have been proposed to improve efficiency, the… ▽ More Generative diffusion models have received increasing attention in medical imaging, particularly in limited-angle computed tomography (LACT). Standard diffusion models achieve high-quality image reconstruction but require a large number of sampling steps during inference, resulting in substantial computational overhead. Although skip-sampling strategies have been proposed to improve efficiency, they often lead to loss of fine structural details. To address this issue, we propose a prior information embedding and wavelet feature fusion fast sampling diffusion model for LACT reconstruction. The PWD enables efficient sampling while preserving reconstruction fidelity in LACT, and effectively mitigates the degradation typically introduced by skip-sampling. Specifically, during the training phase, PWD maps the distribution of LACT images to that of fully sampled target images, enabling the model to learn structural correspondences between them. During inference, the LACT image serves as an explicit prior to guide the sampling trajectory, allowing for high-quality reconstruction with significantly fewer steps. In addition, PWD performs multi-scale feature fusion in the wavelet domain, effectively enhancing the reconstruction of fine details by leveraging both low-frequency and high-frequency information. Quantitative and qualitative evaluations on clinical dental arch CBCT and periapical datasets demonstrate that PWD outperforms existing methods under the same sampling condition. Using only 50 sampling steps, PWD achieves at least 1.7 dB improvement in PSNR and 10% gain in SSIM. △ Less

Submitted 10 July, 2025; v1 submitted 30 June, 2025; originally announced July 2025.

arXiv:2507.02666 [pdf, ps, other]

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

Authors: Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

Abstract: In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively miti… ▽ More In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: Accepted at Interspeech2025

arXiv:2507.02584 [pdf, ps, other]

Observer-Based Distributed Model Predictive Control for String-Stable Multi-vehicle Systems with Markovian Switching Topology

Authors: Wenwei Que, Yang Li, Lu Wang, Wentao Liu, Yougang Bian, Manjiang Hu, Yongfu Li

Abstract: Switching communication topologies can cause instability in vehicle platoons, as vehicle information may be lost during the dynamic switching process. This highlights the need to design a controller capable of maintaining the stability of vehicle platoons under dynamically changing topologies. However, capturing the dynamic characteristics of switching topologies and obtaining complete vehicle inf… ▽ More Switching communication topologies can cause instability in vehicle platoons, as vehicle information may be lost during the dynamic switching process. This highlights the need to design a controller capable of maintaining the stability of vehicle platoons under dynamically changing topologies. However, capturing the dynamic characteristics of switching topologies and obtaining complete vehicle information for controller design while ensuring stability remains a significant challenge. In this study, we propose an observer-based distributed model predictive control (DMPC) method for vehicle platoons under directed Markovian switching topologies. Considering the stochastic nature of the switching topologies, we model the directed switching communication topologies using a continuous-time Markov chain. To obtain the leader vehicle's information for controller design, we develop a fully distributed adaptive observer that can quickly adapt to the randomly switching topologies, ensuring that the observed information is not affected by the dynamic topology switches. Additionally, a sufficient condition is derived to guarantee the mean-square stability of the observer. Furthermore, we construct the DMPC terminal update law based on the observer and formulate a string stability constraint based on the observed information. Numerical simulations demonstrate that our method can reduce tracking errors while ensuring string stability. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: 8 pages,7 figures,conference

arXiv:2507.01428 [pdf, ps, other]

DiffMark: Diffusion-based Robust Watermark Against Deepfakes

Authors: Chen Sun, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Liejun Wang, Dan Ma, Gaobo Yang, Keqin Li

Abstract: Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image du… ▽ More Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.23568 [pdf, ps, other]

A Fast and Accurate 3-D Reconstruction Algorithm for Near-Range Microwave Imaging with Handheld Synthetic Aperture Radar

Authors: Lei Wang, Xianxun Yao, Tiancheng Song, Guolin Sun

Abstract: The design of image reconstruction algorithms for near-range handheld synthetic aperture radar (SAR) systems has gained increasing popularity due to the promising performance of portable millimeter-wave (MMW) imaging devices in various application fields. Time domain imaging algorithms including the backprojection algorithm (BPA) and the Kirchhoff migration algorithm (KMA) are widely adopted due t… ▽ More The design of image reconstruction algorithms for near-range handheld synthetic aperture radar (SAR) systems has gained increasing popularity due to the promising performance of portable millimeter-wave (MMW) imaging devices in various application fields. Time domain imaging algorithms including the backprojection algorithm (BPA) and the Kirchhoff migration algorithm (KMA) are widely adopted due to their direct applicability to arbitrary scan trajectories. However, they suffer from time complexity issues that hinder their practical application. Wavenumber domain algorithms greatly improve the computational efficiency but most of them are restricted to specific array topologies. Based on the factorization techniques as adopted in far-field synthetic aperture radar imaging, the time domain fast factorized backprojection algorithm for handheld synthetic aperture radar (HHFFBPA) is proposed. The local spectral properties of the radar images for handheld systems are analyzed and analytical spectrum compression techniques are derived to realize efficient sampling of the subimages. Validated through numerical simulations and experiments, HHFFBPA achieves fast and accurate 3-D imaging for handheld synthetic aperture radar systems with arbitrary trajectories. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.19774 [pdf, ps, other]

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Authors: Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alig… ▽ More We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.16961 [pdf, ps, other]

Reversing Flow for Image Restoration

Authors: Haina Qin, Wenyang Luo, Libin Wang, Dandan Zheng, Jingdong Chen, Ming Yang, Bing Li, Weiming Hu

Abstract: Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restorat… ▽ More Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: CVPR2025 Final Version; Corresponding Author: Bing Li

MSC Class: 68U10 ACM Class: I.4.4

arXiv:2506.15125 [pdf, ps, other]

Fiber Signal Denoising Algorithm using Hybrid Deep Learning Networks

Authors: Linlin Wang, Wei Wang, Dezhao Wang, Shanwen Wang

Abstract: With the applicability of optical fiber-based distributed acoustic sensing (DAS) systems, effective signal processing and analysis approaches are needed to promote its popularization in the field of intelligent transportation systems (ITS). This paper presents a signal denoising algorithm using a hybrid deep-learning network (HDLNet). Without annotated data and time-consuming labeling, this self-s… ▽ More With the applicability of optical fiber-based distributed acoustic sensing (DAS) systems, effective signal processing and analysis approaches are needed to promote its popularization in the field of intelligent transportation systems (ITS). This paper presents a signal denoising algorithm using a hybrid deep-learning network (HDLNet). Without annotated data and time-consuming labeling, this self-supervised network runs in parallel, combining an autoencoder for denoising (DAE) and a long short-term memory (LSTM) for sequential processing. Additionally, a line-by-line matching algorithm for vehicle detection and tracking is introduced, thus realizing the complete processing of fiber signal denoising and feature extraction. Experiments were carried out on a self-established real highway tunnel dataset, showing that our proposed hybrid network yields more satisfactory denoising performance than Spatial-domain DAE. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 15 pages, 10 figures

arXiv:2506.14494 [pdf, ps, other]

Fronthaul-Aware User-Centric Generalized Cell-Free Massive MIMO Systems

Authors: Zahra Mobini, Ahmet Hasim Gokceoglu, Li Wang, Gunnar Peters, Hien Quoc Ngo

Abstract: We consider fronthaul-limited generalized zeroforcing-based cell-free massive multiple-input multiple-output (CF-mMIMO) systems with multiple-antenna users and multipleantenna access points (APs) relying on both cooperative beamforming (CB) and user-centric (UC) clustering. The proposed framework is very general and can be degenerated into different special cases, such as pure CB/pure UC clusterin… ▽ More We consider fronthaul-limited generalized zeroforcing-based cell-free massive multiple-input multiple-output (CF-mMIMO) systems with multiple-antenna users and multipleantenna access points (APs) relying on both cooperative beamforming (CB) and user-centric (UC) clustering. The proposed framework is very general and can be degenerated into different special cases, such as pure CB/pure UC clustering, or fully centralized CB/fully distributed beamforming. We comprehensively analyze the spectral efficiency (SE) performance of the system wherein the users use the minimum mean-squared errorbased successive interference cancellation (MMSE-SIC) scheme to detect the desired signals. Specifically, we formulate an optimization problem for the user association and power control for maximizing the sum SE. The formulated problem is under per-AP transmit power and fronthaul constraints, and is based on only long-term channel state information (CSI). The challenging formulated problem is transformed into tractable form and a novel algorithm is proposed to solve it using minorization maximization (MM) technique. We analyze the trade-offs provided by the CF-mMIMO system with different number of CB clusters, hence highlighting the importance of the appropriate choice of CB design for different system setups. Numerical results show that for the centralized CB, the proposed power optimization provides nearly 59% improvement in the average sum SE over the heuristic approach, and 312% improvement, when the distributed beamforming is employed. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.12463 [pdf, ps, other]

Adding links wisely: how an influencer seeks for leadership in opinion dynamics?

Authors: Lingfei Wang, Yu Xing, Yuhao Yi, Ming Cao, Karl H. Johansson

Abstract: This paper investigates the problem of leadership development for an external influencer using the Friedkin-Johnsen (FJ) opinion dynamics model, where the influencer is modeled as a fully stubborn agent and leadership is quantified by social power. The influencer seeks to maximize her social power by strategically adding a limited number of links to regular agents. This optimization problem is sho… ▽ More This paper investigates the problem of leadership development for an external influencer using the Friedkin-Johnsen (FJ) opinion dynamics model, where the influencer is modeled as a fully stubborn agent and leadership is quantified by social power. The influencer seeks to maximize her social power by strategically adding a limited number of links to regular agents. This optimization problem is shown to be equivalent to maximizing the absorbing probability to the influencer in an augmented Markov chain. The resulting objective function is both monotone and submodular, enabling the use of a greedy algorithm to compute an approximate solution. To handle large-scale networks efficiently, a random walk sampling over the Markov chain is employed to reduce computational complexity. Analytical characterizations of the solution are provided for both low and high stubbornness of regular agents. Specific network topologies are also examined: for complete graphs with rank-one weight matrices, the problem reduces to a hyperbolic 0-1 programmming problem, which is solvable in polynomial time; for symmetric ring graphs with circulant weight matrices and uniform agent stubbornness, the optimal strategy involves selecting agents that are sufficiently dispersed across the network. Numerical simulations are presented for illustration. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12006 [pdf, ps, other]

crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023

Authors: Navodini Wijethilake, Reuben Dorent, Marina Ivory, Aaron Kujawa, Stefan Cornelissen, Patrick Langenhuizen, Mohamed Okasha, Anna Oviedova, Hexin Dong, Bogyeong Kang, Guillaume Sallé, Luyi Han, Ziyuan Zhao, Han Liu, Tao Yang, Shahad Hardan, Hussain Alasmawi, Santosh Sanjeev, Yuzhou Zhuang, Satoshi Kondo, Maria Baldeon Calisto, Shaikh Muhammad Uzair Noman, Cancan Chen, Ipek Oguz, Rongguo Zhang , et al. (14 additional authors not shown)

Abstract: The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a mea… ▽ More The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a meaningful and illustrative benchmark. From a clinical application perspective, it aims to automate Vestibular Schwannoma (VS) and cochlea segmentation on T2 scans for more cost-effective VS management. Over time, the challenge objectives have evolved to enhance its clinical relevance. The challenge evolved from using single-institutional data and basic segmentation in 2021 to incorporating multi-institutional data and Koos grading in 2022, and by 2023, it included heterogeneous routine data and sub-segmentation of intra- and extra-meatal tumour components. In this work, we report the findings of the 2022 and 2023 editions and perform a retrospective analysis of the challenge progression over the years. The observations from the successive challenge contributions indicate that the number of outliers decreases with an expanding dataset. This is notable since the diversity of scanning protocols of the datasets concurrently increased. The winning approach of the 2023 edition reduced the number of outliers on the 2021 and 2022 testing data, demonstrating how increased data heterogeneity can enhance segmentation performance even on homogeneous data. However, the cochlea Dice score declined in 2023, likely due to the added complexity from tumour sub-annotations affecting overall segmentation performance. While progress is still needed for clinically acceptable VS segmentation, the plateauing performance suggests that a more challenging cross-modal task may better serve future benchmarking. △ Less

Submitted 24 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.11514 [pdf, ps, other]

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders

Authors: Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan

Abstract: Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a comp… ▽ More Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a compact encoder network. Subsequently, a vocoder synthesizes the clean speech from denoised embeddings. An ablation study substantiates the parameter efficiency of the denoise encoder with a pre-trained audioencoder and vocoder. Experimental results on both speech enhancement and speaker fidelity demonstrate that our generative audioencoder-based SE system outperforms models utilizing discriminative audioencoders. Furthermore, subjective listening tests validate that our proposed system surpasses an existing state-of-the-art SE model in terms of perceptual quality. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025

arXiv:2506.09344 [pdf, ps, other]

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 18 pages,8 figures

arXiv:2506.05984 [pdf, ps, other]

Audio-Aware Large Language Models as Judges for Speaking Styles

Authors: Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Abstract: Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, vol… ▽ More Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues. △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.05171 [pdf, other]

Towards provable probabilistic safety for scalable embodied AI systems

Authors: Linxuan He, Qing-Shan Jia, Ang Li, Hongyan Sang, Ling Wang, Jiwen Lu, Tao Zhang, Jie Zhou, Yi Zhang, Yisen Wang, Peng Wei, Zhongyuan Wang, Henry X. Liu, Shuo Feng

Abstract: Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving prov… ▽ More Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety--verifying system safety across all possible scenarios--remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. To address this challenge, we introduce provable probabilistic safety, which aims to ensure that the residual risk of large-scale deployment remains below a predefined threshold. Instead of attempting exhaustive safety proof across all corner cases, this paradigm establishes a probabilistic safety boundary on overall system performance, leveraging statistical methods to enhance feasibility and scalability. A well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale while allowing for continuous refinement of safety guarantees. Our work focuses on three core questions: what is provable probabilistic safety, how to prove the probabilistic safety, and how to achieve the provable probabilistic safety. By bridging the gap between theoretical safety assurance and practical deployment, our work offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.04682

MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

Authors: Chuyun Deng, Na Liu, Wei Xie, Lianming Xu, Li Wang

Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting ge… ▽ More Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential. △ Less

Submitted 8 July, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

Comments: The authors withdraw this submission to substantially revise the introduction and experimental sections and incorporate new content. The manuscript has not been submitted or published elsewhere. A revised version may be submitted in the future

arXiv:2506.04518

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Authors: Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li

Abstract: Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleav… ▽ More Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance. △ Less

Submitted 12 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

Comments: Our company need to do internal review

arXiv:2506.03645 [pdf, other]

YOND: Practical Blind Raw Image Denoising Free from Camera-Specific Data Dependency

Authors: Hansen Feng, Lizhi Wang, Yiqi Huang, Tong Li, Lin Zhu, Hua Huang

Abstract: The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge,… ▽ More The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge, we introduce a novel blind raw image denoising method named YOND, which represents You Only Need a Denoiser. Trained solely on synthetic data, YOND can generalize robustly to noisy raw images captured by diverse unknown cameras. Specifically, we propose three key modules to guarantee the practicality of YOND: coarse-to-fine noise estimation (CNE), expectation-matched variance-stabilizing transform (EM-VST), and SNR-guided denoiser (SNR-Net). Firstly, we propose CNE to identify the camera noise characteristic, refining the estimated noise parameters based on the coarse denoised image. Secondly, we propose EM-VST to eliminate camera-specific data dependency, correcting the bias expectation of VST according to the noisy image. Finally, we propose SNR-Net to offer controllable raw image denoising, supporting adaptive adjustments and manual fine-tuning. Extensive experiments on unknown cameras, along with flexible solutions for challenging cases, demonstrate the superior practicality of our method. The source code will be publicly available at the \href{https://fenghansen.github.io/publication/YOND}{project homepage}. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 17 pages, 19 figures, TPAMI under review

arXiv:2506.03181 [pdf, ps, other]

Dc-EEMF: Pushing depth-of-field limit of photoacoustic microscopy via decision-level constrained learning

Authors: Wangting Zhou, Jiangshan He, Tong Cai, Lin Wang, Zhen Yuan, Xunbin Wei, Xueli Chen

Abstract: Photoacoustic microscopy holds the potential to measure biomarkers' structural and functional status without labels, which significantly aids in comprehending pathophysiological conditions in biomedical research. However, conventional optical-resolution photoacoustic microscopy (OR-PAM) is hindered by a limited depth-of-field (DoF) due to the narrow depth range focused on a Gaussian beam. Conseque… ▽ More Photoacoustic microscopy holds the potential to measure biomarkers' structural and functional status without labels, which significantly aids in comprehending pathophysiological conditions in biomedical research. However, conventional optical-resolution photoacoustic microscopy (OR-PAM) is hindered by a limited depth-of-field (DoF) due to the narrow depth range focused on a Gaussian beam. Consequently, it fails to resolve sufficient details in the depth direction. Herein, we propose a decision-level constrained end-to-end multi-focus image fusion (Dc-EEMF) to push DoF limit of PAM. The DC-EEMF method is a lightweight siamese network that incorporates an artifact-resistant channel-wise spatial frequency as its feature fusion rule. The meticulously crafted U-Net-based perceptual loss function for decision-level focus properties in end-to-end fusion seamlessly integrates the complementary advantages of spatial domain and transform domain methods within Dc-EEMF. This approach can be trained end-to-end without necessitating post-processing procedures. Experimental results and numerical analyses collectively demonstrate our method's robust performance, achieving an impressive fusion result for PAM images without a substantial sacrifice in lateral resolution. The utilization of Dc-EEMF-powered PAM has the potential to serve as a practical tool in preclinical and clinical studies requiring extended DoF for various applications. △ Less

Submitted 29 May, 2025; originally announced June 2025.

arXiv:2506.00626 [pdf]

Helmet ultrasound for brain imaging in post-hemicraniectomy patients

Authors: Yang Zhang, Karteekeya Sastry, Iyla Rossi, Joshua Olick-Gibson, Jonathan J. Russin, Charles Y. Liu, Lihong V. Wang

Abstract: Noninvasive imaging deep into the adult brain at submillimeter and millisecond scales remains a challenge in medical imaging. Here, we report a helmet based ultrasound brain imager built from a customized helmet, a scanned ultrasound array, and three dimensional printing for real time imaging of human brain anatomical and functional information. Through its application to post hemicraniectomy pati… ▽ More Noninvasive imaging deep into the adult brain at submillimeter and millisecond scales remains a challenge in medical imaging. Here, we report a helmet based ultrasound brain imager built from a customized helmet, a scanned ultrasound array, and three dimensional printing for real time imaging of human brain anatomical and functional information. Through its application to post hemicraniectomy patients in a sitting position, we achieved volumetric brain tissue structural, vascular, and blood flow images at centimeter scale depths with submillimeter and millisecond spatiotemporal resolutions. We also demonstrated the system capability to track cerebral blood flow over repeated imaging sessions, including during motion prone conditions. Our brain imager circumvents the skull and bridges the gap between high resolution human brain imaging and wearable convenience. This imager may serve as a platform for further investigations into human brain dynamics in post hemicraniectomy patients and offer insights into the brain that could surpass those obtained from non human primate studies. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.23180 [pdf, ps, other]

Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Authors: Ping Wang, Lishun Wang, Gang Qu, Xiaodong Wang, Yulun Zhang, Xin Yuan

Abstract: Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approa… ▽ More Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at https://github.com/pwangcs/ProxUnroll. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: Accepted by CVPR 2025

arXiv:2505.17970 [pdf, ps, other]

Faulty RIS-aided Integrated Sensing and Communication: Modeling and Optimization

Authors: Lu Wang, Gui Zhou, Changheng Li, Luis F. Abanto-Leon, Nairy Moghadas Gholian, Matthias Hollick, Arash Asadi

Abstract: This work investigates a practical reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) system, where a subset of RIS elements fail to function properly and reflect incident signals randomly towards unintended directions, thereby degrading system performance. To date, no study has addressed such impairments caused by faulty RIS elements in ISAC systems. This w… ▽ More This work investigates a practical reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) system, where a subset of RIS elements fail to function properly and reflect incident signals randomly towards unintended directions, thereby degrading system performance. To date, no study has addressed such impairments caused by faulty RIS elements in ISAC systems. This work aims to fill the gap. First, to quantify the impact of faulty elements on ISAC performance, we derive the misspecified Cramér-Rao bound (MCRB) for sensing parameter estimation and signal-to-interference-and-noise ratio (SINR) for communication quality. Then, to mitigate the performance loss caused by faulty elements, we jointly design the remaining functional RIS phase shifts and transmit beamforming to minimize the MCRB, subject to the communication SINR and transmit power constraints. The resulting optimization problem is highly non-convex due to the intricate structure of the MCRB expression and constant-modulus constraint imposed on RIS. To address this, we reformulate it into a more tractable form and propose a block coordinate descent (BCD) algorithm that incorporates majorization-minimization (MM), successive convex approximation (SCA), and penalization techniques. Simulation results demonstrate that our proposed approach reduces the MCRB performance loss by 24.36% on average compared to the case where the presence of faulty elements is ignored. Furthermore, the performance gain becomes more evident as the number of faulty elements increases. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: submitted to IEEE journals

arXiv:2505.17847 [pdf, ps, other]

TransDF: Time-Series Forecasting Needs Transformed Label Alignment

Authors: Hao Wang, Licheng Pan, Zhichao Chen, Xu Chen, Qingyang Dai, Lei Wang, Haoxuan Li, Zhouchen Lin

Abstract: Training time-series forecasting models presents unique challenges in designing effective learning objectives. Existing methods predominantly utilize the temporal mean squared error, which faces two critical challenges: (1) label autocorrelation, which leads to bias from the label sequence likelihood; (2) excessive amount of tasks, which increases with the forecast horizon and complicates optimiza… ▽ More Training time-series forecasting models presents unique challenges in designing effective learning objectives. Existing methods predominantly utilize the temporal mean squared error, which faces two critical challenges: (1) label autocorrelation, which leads to bias from the label sequence likelihood; (2) excessive amount of tasks, which increases with the forecast horizon and complicates optimization. To address these challenges, we propose Transform-enhanced Direct Forecast (TransDF), which transforms the label sequence into decorrelated components with discriminated significance. Models are trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount. Extensive experiments demonstrate that TransDF achieves state-of-the-art performance and is compatible with various forecasting models. Code is available at https://anonymous.4open.science/r/TransDF-88CF. △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.17472 [pdf, ps, other]

SUFFICIENT: A scan-specific unsupervised deep learning framework for high-resolution 3D isotropic fetal brain MRI reconstruction

Authors: Jiangjie Wu, Lixuan Chen, Zhenghao Li, Xin Li, Saban Ozturk, Lihui Wang, Rongpin Wang, Hongjiang Wei, Yuyao Zhang

Abstract: High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external t… ▽ More High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external training datasets, which are difficult to obtain for clinical fetal MRI. To address this issue, we propose an unsupervised iterative SVR-SRR framework for isotropic HR volume reconstruction. Specifically, SVR is formulated as a function mapping a 2D slice and a 3D target volume to a rigid transformation matrix, which aligns the slice to the underlying location in the target volume. The function is parameterized by a convolutional neural network, which is trained by minimizing the difference between the volume slicing at the predicted position and the input slice. In SRR, a decoding network embedded within a deep image prior framework is incorporated with a comprehensive image degradation model to produce the high-resolution (HR) volume. The deep image prior framework offers a local consistency prior to guide the reconstruction of HR volumes. By performing a forward degradation model, the HR volume is optimized by minimizing loss between predicted slices and the observed slices. Comprehensive experiments conducted on large-magnitude motion-corrupted simulation data and clinical data demonstrate the superior performance of the proposed framework over state-of-the-art fetal brain reconstruction frameworks. △ Less

Submitted 25 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.12552 [pdf, other]

FreqSelect: Frequency-Aware fMRI-to-Image Reconstruction

Authors: Junliang Ye, Lei Wang, Md Zakir Hossain

Abstract: Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequenc… ▽ More Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequency components of the input equally. This uniform treatment forces the model to extract meaning features and suppress irrelevant noise simultaneously, limiting its effectiveness. We introduce FreqSelect, a lightweight, adaptive module that selectively filters spatial-frequency bands before encoding. By dynamically emphasizing frequencies that are most predictive of brain activity and suppressing those that are uninformative, FreqSelect acts as a content-aware gate between image features and natural data. It integrates seamlessly into standard very deep VAE-diffusion pipelines and requires no additional supervision. Evaluated on the Natural Scenes dataset, FreqSelect consistently improves reconstruction quality across both low- and high-level metrics. Beyond performance gains, the learned frequency-selection patterns offer interpretable insights into how different visual frequencies are represented in the brain. Our method generalizes across subjects and scenes, and holds promise for extension to other neuroimaging modalities, offering a principled approach to enhancing both decoding accuracy and neuroscientific interpretability. △ Less

Submitted 18 May, 2025; originally announced May 2025.

Comments: Research report

arXiv:2505.12226 [pdf, ps, other]

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

Authors: Dong Yang, Yiyi Cai, Yuki Saito, Lixu Wang, Hiroshi Saruwatari

Abstract: We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled c… ▽ More We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise and focuses computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments show that SFM consistently improves the naturalness of synthesized speech in both objective and subjective evaluations, while significantly reducing inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2505.03380 [pdf, other]

Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant

Authors: Haonan Wang, Jiaji Mao, Lehan Wang, Qixiang Zhang, Marawan Elbatel, Yi Qin, Huijun Hu, Baoxun Li, Wenhui Deng, Weifeng Qin, Hongrui Li, Jialin Liang, Jun Shen, Xiaomeng Li

Abstract: Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise ana… ▽ More Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise anatomical delineation, accurate localization, and reliable diagnosis through hierarchical vision-language grounding. A self-reinforcing correlation mechanism allows visual features to inform language context, while language semantics guide pixel-wise attention, forming a closed loop that refines both modalities. This correlation is enhanced by a color region description strategy, translating anatomical structures into semantically rich text to learn shape-location-text relationships across scales. Trained on 20 million image-mask-description triplets, RCMed achieves state-of-the-art precision in contextualizing irregular lesions and subtle anatomical boundaries, excelling in 165 clinical tasks across 9 modalities. It achieved a 23.5% relative improvement in cell segmentation from microscopy images over prior methods. RCMed's strong vision-language alignment enables exceptional generalization, with state-of-the-art performance in external validation across 20 clinically significant cancer types, including novel tasks. This work demonstrates how integrated multimodal models capture fine-grained patterns, enabling human-level interpretation in complex scenarios and advancing human-centric AI healthcare. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2504.20447 [pdf, other]

APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech

Authors: Zhicheng Lian, Lizhi Wang, Hua Huang

Abstract: Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgme… ▽ More Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention architecture, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. Our code and checkpoint will be available on a public repository upon publication. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.16800 [pdf, other]

Array Partitioning Based Near-Field Attitude and Location Estimation

Authors: Mingchen Zhang, Xiaojun Yuan, Boyu Teng, Li Wang

Abstract: This paper studies a passive source localization system, where a single base station (BS) is employed to estimate the positions and attitudes of multiple mobile stations (MSs). The BS and the MSs are equipped with uniform rectangular arrays, and the MSs are located in the near-field region of the BS array. To avoid the difficulty of tackling the problem directly based on the near-field signal mode… ▽ More This paper studies a passive source localization system, where a single base station (BS) is employed to estimate the positions and attitudes of multiple mobile stations (MSs). The BS and the MSs are equipped with uniform rectangular arrays, and the MSs are located in the near-field region of the BS array. To avoid the difficulty of tackling the problem directly based on the near-field signal model, we establish a subarray-wise far-field received signal model. In this model, the entire BS array is divided into multiple subarrays to ensure that each MS is in the far-field region of each BS subarray. By exploiting the angles of arrival (AoAs) of an MS antenna at different BS subarrays, we formulate the attitude and location estimation problem under the Bayesian inference framework. Based on the factor graph representation of the probabilistic problem model, a message passing algorithm named array partitioning based pose and location estimation (APPLE) is developed to solve this problem. An estimation-error lower bound is obtained as a performance benchmark of the proposed algorithm. Numerical results demonstrate that the proposed APPLE algorithm outperforms other baseline methods in the accuracy of position and attitude estimation. △ Less

Submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.16036 [pdf]

Rotational ultrasound and photoacoustic tomography of the human body

Authors: Yang Zhang, Shuai Na, Jonathan J. Russin, Karteekeya Sastry, Li Lin, Junfu Zheng, Yilin Luo, Xin Tong, Yujin An, Peng Hu, Konstantin Maslov, Tze-Woei Tan, Charles Y. Liu, Lihong V. Wang

Abstract: Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic ima… ▽ More Imaging the human body's morphological and angiographic information is essential for diagnosing, monitoring, and treating medical conditions. Ultrasonography performs the morphological assessment of the soft tissue based on acoustic impedance variations, whereas photoacoustic tomography (PAT) can visualize blood vessels based on intrinsic hemoglobin absorption. Three-dimensional (3D) panoramic imaging of the vasculature is generally not practical in conventional ultrasonography with limited field-of-view (FOV) probes, and PAT does not provide sufficient scattering-based soft tissue morphological contrast. Complementing each other, fast panoramic rotational ultrasound tomography (RUST) and PAT are integrated for hybrid rotational ultrasound and photoacoustic tomography (RUS-PAT), which obtains 3D ultrasound structural and PAT angiographic images of the human body quasi-simultaneously. The RUST functionality is achieved in a cost-effective manner using a single-element ultrasonic transducer for ultrasound transmission and rotating arc-shaped arrays for 3D panoramic detection. RUST is superior to conventional ultrasonography, which either has a limited FOV with a linear array or is high-cost with a hemispherical array that requires both transmission and receiving. By switching the acoustic source to a light source, the system is conveniently converted to PAT mode to acquire angiographic images in the same region. Using RUS-PAT, we have successfully imaged the human head, breast, hand, and foot with a 10 cm diameter FOV, submillimeter isotropic resolution, and 10 s imaging time for each modality. The 3D RUS-PAT is a powerful tool for high-speed, 3D, dual-contrast imaging of the human body with potential for rapid clinical translation. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.13190 [pdf, other]

Cellular-X: An LLM-empowered Cellular Agent for Efficient Base Station Operations

Authors: Liujianfu Wang, Xinyi Long, Yuyang Du, Xiaoyan Liu, Kexin Chen, Soung Chang Liew

Abstract: This paper introduces Cellular-X, an LLM-powered agent designed to automate cellular base station (BS) maintenance. Leveraging multimodal LLM and retrieval-augmented generation (RAG) techniques, Cellular-X significantly enhances field engineer efficiency by quickly interpreting user intents, retrieving relevant technical information, and configuring a BS through iterative self-correction. Key feat… ▽ More This paper introduces Cellular-X, an LLM-powered agent designed to automate cellular base station (BS) maintenance. Leveraging multimodal LLM and retrieval-augmented generation (RAG) techniques, Cellular-X significantly enhances field engineer efficiency by quickly interpreting user intents, retrieving relevant technical information, and configuring a BS through iterative self-correction. Key features of the demo include automatic customized BS setup, document-based query answering, and voice-controlled configuration reporting and revision. We implemented Cellular-X on a USRP X310 testbed for demonstration. Demo videos and implementation details are available at https://github.com/SeaBreezing/Cellular-X. △ Less

Submitted 10 April, 2025; originally announced April 2025.

Comments: MobiSys â25, June 23-27, 2025, Anaheim, CA, USA

arXiv:2504.12703 [pdf, other]

Spike-Kal: A Spiking Neuron Network Assisted Kalman Filter

Authors: Xun Xiao, Junbo Tie, Jinyue Zhao, Ziqi Wang, Yuan Li, Qiang Dou, Lei Wang

Abstract: Kalman filtering can provide an optimal estimation of the system state from noisy observation data. This algorithm's performance depends on the accuracy of system modeling and noise statistical characteristics, which are usually challenging to obtain in practical applications. The powerful nonlinear modeling capabilities of deep learning, combined with its ability to extract features from large am… ▽ More Kalman filtering can provide an optimal estimation of the system state from noisy observation data. This algorithm's performance depends on the accuracy of system modeling and noise statistical characteristics, which are usually challenging to obtain in practical applications. The powerful nonlinear modeling capabilities of deep learning, combined with its ability to extract features from large amounts of data automatically, offer new opportunities for improving the Kalman filter. This paper proposes a novel method that leverages the Spiking Neural Network to optimize the Kalman filter. Our approach aims to reduce the reliance on prior knowledge of system and observation noises, allowing for adaptation to varying statistical characteristics of time-varying noise. Furthermore, we investigate the potential of SNNs in improving the computational efficiency of the Kalman filter. In our method, we design an integration strategy between the SNN and the Kalman filter. The SNN is trained to directly approximate the optimal gain matrix from observation data, thereby alleviating the computational burden of complex matrix operations inherent in traditional Kalman filtering while maintaining the accuracy and robustness of state estimation. Its average error has been reduced by 18\%-65\% compared with other methods. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.09225 [pdf, other]

AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis

Authors: Yubing Cao, Yinfeng Yu, Yongming Li, Liejun Wang

Abstract: This paper presents AMNet, an Acoustic Model Network designed to improve the performance of Mandarin speech synthesis by incorporating phrase structure annotation and local convolution modules. AMNet builds upon the FastSpeech 2 architecture while addressing the challenge of local context modeling, which is crucial for capturing intricate speech features such as pauses, stress, and intonation. By… ▽ More This paper presents AMNet, an Acoustic Model Network designed to improve the performance of Mandarin speech synthesis by incorporating phrase structure annotation and local convolution modules. AMNet builds upon the FastSpeech 2 architecture while addressing the challenge of local context modeling, which is crucial for capturing intricate speech features such as pauses, stress, and intonation. By embedding a phrase structure parser into the model and introducing a local convolution module, AMNet enhances the model's sensitivity to local information. Additionally, AMNet decouples tonal characteristics from phonemes, providing explicit guidance for tone modeling, which improves tone accuracy and pronunciation. Experimental results demonstrate that AMNet outperforms baseline models in subjective and objective evaluations. The proposed model achieves superior Mean Opinion Scores (MOS), lower Mel Cepstral Distortion (MCD), and improved fundamental frequency fitting $F0 (R^2)$, confirming its ability to generate high-quality, natural, and expressive Mandarin speech. △ Less

Submitted 12 April, 2025; originally announced April 2025.

Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025

arXiv:2504.05158 [pdf, other]

Leveraging Label Potential for Enhanced Multimodal Emotion Recognition

Authors: Xuechun Shao, Yinfeng Yu, Liejun Wang

Abstract: Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could… ▽ More Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately. However, most current research focuses solely on the fusion of audio and text features, overlooking the valuable information in emotion labels. This oversight could potentially hinder the performance of existing methods, as emotion labels harbor rich, insightful information that could significantly aid MER. We introduce a novel model called Label Signal-Guided Multimodal Emotion Recognition (LSGMER) to overcome this limitation. This model aims to fully harness the power of emotion label information to boost the classification accuracy and stability of MER. Specifically, LSGMER employs a Label Signal Enhancement module that optimizes the representation of modality features by interacting with audio and text features through label embeddings, enabling it to capture the nuances of emotions precisely. Furthermore, we propose a Joint Objective Optimization(JOO) approach to enhance classification accuracy by introducing the Attribution-Prediction Consistency Constraint (APC), which strengthens the alignment between fused features and emotion categories. Extensive experiments conducted on the IEMOCAP and MELD datasets have demonstrated the effectiveness of our proposed LSGMER model. △ Less

Submitted 7 April, 2025; originally announced April 2025.

Comments: Main paper (8 pages). Accepted for publication by IJCNN 2025

arXiv:2504.04012 [pdf, other]

Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAVTarget Detection

Authors: Houzhang Fang, Xiaolin Wang, Zengyang Li, Lu Wang, Qingshan Li, Yi Chang, Luxin Yan

Abstract: Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step… ▽ More Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on IRBFD demonstrate that our UniCD is a robust union framework for NUC and UAV target detection while achieving real-time processing capabilities. Dataset can be available at https://github.com/IVPLaboratory/UniCD. △ Less

Submitted 4 April, 2025; originally announced April 2025.

Comments: Accepted by CVPR2025

arXiv:2504.02628 [pdf, ps, other]

Towards Computation- and Communication-efficient Computational Pathology

Authors: Chu Han, Bingchao Zhao, Jiatai Lin, Shanshan Lyu, Longfei Wang, Tianpeng Deng, Cheng Lu, Changhong Liang, Hannah Y. Wen, Xiaojing Guo, Zhenwei Shi, Zaiyi Liu

Abstract: Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To add… ▽ More Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To address these issues, we present a novel computation- and communication-efficient framework called Magnification-Aligned Global-Local Transformer (MAG-GLTrans). Our approach significantly reduces computational time, file transfer requirements, and storage overhead by enabling effective analysis using low-magnification inputs rather than high-magnification ones. The key innovation lies in our proposed magnification alignment (MAG) mechanism, which employs self-supervised learning to bridge the information gap between low and high magnification levels by effectively aligning their feature representations. Through extensive evaluation across various fundamental CPath tasks, MAG-GLTrans demonstrates state-of-the-art classification performance while achieving remarkable efficiency gains: up to 10.7 times reduction in computational time and over 20 times reduction in file transfer and storage requirements. Furthermore, we highlight the versatility of our MAG framework through two significant extensions: (1) its applicability as a feature extractor to enhance the efficiency of any CPath architecture, and (2) its compatibility with existing foundation models and histopathology-specific encoders, enabling them to process low-magnification inputs with minimal information loss. These advancements position MAG-GLTrans as a particularly promising solution for time-sensitive applications, especially in the context of intraoperative frozen section diagnosis where both accuracy and efficiency are paramount. △ Less

Submitted 3 June, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

arXiv:2503.21571 [pdf, other]

Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting

Authors: Alimjan Mattursun, Liejun Wang, Yinfeng Yu, Chunyang Ma

Abstract: Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the mag… ▽ More Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{https://github.com/AlimMat/BSP-MPNet. \label{s1}} △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: Main paper (6 pages). Accepted for publication by ICME 2025

arXiv:2503.21498 [pdf, other]

Distributed Forgetting-factor Regret-based Online Optimization over Undirected Connected Networks

Authors: Lipo Mo, Jianjun Li, Min Zuo, Lei Wang

Abstract: The evaluation of final-iteration tracking performance is a formidable obstacle in distributed online optimization algorithms. To address this issue, this paper proposes a novel evaluation metric named distributed forgetting-factor regret (DFFR). It incorporates a weight into the loss function at each iteration, which progressively reduces the weights of historical loss functions while enabling dy… ▽ More The evaluation of final-iteration tracking performance is a formidable obstacle in distributed online optimization algorithms. To address this issue, this paper proposes a novel evaluation metric named distributed forgetting-factor regret (DFFR). It incorporates a weight into the loss function at each iteration, which progressively reduces the weights of historical loss functions while enabling dynamic weights allocation across optimization horizon. Furthermore, we develop two distributed online optimization algorithms based on DFFR over undirected connected networks: the Distributed Online Gradient-free Algorithm for bandit-feedback problems and the Distributed Online Projection-free Algorithm for high-dimensional problems. Through theoretical analysis, we derive the upper bounds of DFFR for both algorithms and further prove that under mild conditions, DFFR either converges to zero or maintains a tight upper bound as iterations approach infinity. Experimental simulation demonstrates the effectiveness of the algorithms and the superior performance of DFFR. △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: 11 pages,6 figures

ACM Class: C.2.4

arXiv:2503.20782 [pdf, other]

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Authors: Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang

Abstract: In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 1… ▽ More In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html △ Less

Submitted 26 March, 2025; originally announced March 2025.

Comments: Project page: https://genjib.github.io/project_page/AVED/index.html

arXiv:2503.18353 [pdf, other]

Contact Plan Design for Cross-Linked GNSSs: An ILP Approach for Extended Applications

Authors: Huan Yan, Juan A. Fraire, Ziqi Yang, Kanglian Zhao, Wenfeng Li, Xiyun Hou, Haohan Li, Yuxuan Miao, Jinjun Zheng, Chengbin Kang, Huichao Zhou, Xinuo Chang, Lu Wang

Abstract: Global Navigation Satellite Systems (GNSS) employ inter-satellite links (ISLs) to reduce dependency on ground stations, enabling precise ranging and communication across satellites. Beyond their traditional role, ISLs can support extended applications, including providing navigation and communication services to external entities. However, designing effective contact plan design (CPD) schemes for… ▽ More Global Navigation Satellite Systems (GNSS) employ inter-satellite links (ISLs) to reduce dependency on ground stations, enabling precise ranging and communication across satellites. Beyond their traditional role, ISLs can support extended applications, including providing navigation and communication services to external entities. However, designing effective contact plan design (CPD) schemes for these multifaceted ISLs, operating under a polling time-division duplex (PTDD) framework, remains a critical challenge. Existing CPD approaches focus solely on meeting GNSS satellites' internal ranging and communication demands, neglecting their extended applications. This paper introduces the first CPD scheme capable of supporting extended GNSS ISLs. By modeling GNSS requirements and designing a tailored service process, our approach ensures the allocation of essential resources for internal operations while accommodating external user demands. Based on the BeiDou constellation, simulation results demonstrate the proposed scheme's efficacy in maintaining core GNSS functionality while providing extended ISLs on a best-effort basis. Additionally, the results highlight the significant impact of GNSS ISLs in enhancing orbit determination and clock synchronization for the Earth-Moon libration point constellation, underscoring the importance of extended GNSS ISL applications. △ Less

Submitted 24 March, 2025; originally announced March 2025.

Comments: 18 pages, 13 figures

arXiv:2503.18340 [pdf, other]

Optimized Contact Plan Design for Reflector and Phased Array Terminals in Cislunar Space Networks

Authors: Huan Yan, Juan A. Fraire, Ziqi Yang, Kanglian Zhao, Wenfeng Li, Yuan Fang, Jinjun Zheng, Chengbin Kang, Huichao Zhou, Xinuo Chang, Lu Wang, Linshan Xue

Abstract: Cislunar space is emerging as a critical domain for human exploration, requiring robust infrastructure to support spatial users - spacecraft with navigation and communication demands. Deploying satellites at Earth-Moon libration points offers an effective solution. This paper introduces a novel Contact Plan Design (CPD) scheme that considers two classes of cislunar transponders: Reflector Links (R… ▽ More Cislunar space is emerging as a critical domain for human exploration, requiring robust infrastructure to support spatial users - spacecraft with navigation and communication demands. Deploying satellites at Earth-Moon libration points offers an effective solution. This paper introduces a novel Contact Plan Design (CPD) scheme that considers two classes of cislunar transponders: Reflector Links (RL) for high-volume data transfer and Phased Array Links (PL) for fast switching and navigation services.Our approach addresses the needs of both satellites and spatial users within the Earth-Moon Libration Point Communication and Navigation Constellation (EMLP-CNC). Simulations validate the proposed scheme, demonstrating its effectiveness in serving spatial users while meeting satellite ranging and communication requirements. These findings provide essential insights for developing future Cislunar Space Infrastructures. △ Less

Submitted 24 March, 2025; originally announced March 2025.

Comments: 16 pages, 14 figures

arXiv:2503.17992 [pdf, other]

Geometric Constrained Non-Line-of-Sight Imaging

Authors: Xueying Liu, Lianfang Wang, Jun Liu, Yong Wang, Yuping Duan

Abstract: Normal reconstruction is crucial in non-line-of-sight (NLOS) imaging, as it provides key geometric and lighting information about hidden objects, which significantly improves reconstruction accuracy and scene understanding. However, jointly estimating normals and albedo expands the problem from matrix-valued functions to tensor-valued functions that substantially increasing complexity and computat… ▽ More Normal reconstruction is crucial in non-line-of-sight (NLOS) imaging, as it provides key geometric and lighting information about hidden objects, which significantly improves reconstruction accuracy and scene understanding. However, jointly estimating normals and albedo expands the problem from matrix-valued functions to tensor-valued functions that substantially increasing complexity and computational difficulty. In this paper, we propose a novel joint albedo-surface reconstruction method, which utilizes the Frobenius norm of the shape operator to control the variation rate of the normal field. It is the first attempt to apply regularization methods to the reconstruction of surface normals for hidden objects. By improving the accuracy of the normal field, it enhances detail representation and achieves high-precision reconstruction of hidden object geometry. The proposed method demonstrates robustness and effectiveness on both synthetic and experimental datasets. On transient data captured within 15 seconds, our surface normal-regularized reconstruction model produces more accurate surfaces than recently proposed methods and is 30 times faster than the existing surface reconstruction approach. △ Less

Submitted 23 March, 2025; originally announced March 2025.

arXiv:2503.17551 [pdf, other]

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Authors: Yu Sun, Yin Li, Ruixiao Sun, Chunhui Liu, Fangming Zhou, Ze Jin, Linjie Wang, Xiang Shen, Zhuolin Hao, Hongyu Xiong

Abstract: Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content… ▽ More Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains. △ Less

Submitted 21 March, 2025; originally announced March 2025.

arXiv:2503.13479 [pdf, other]

EAGLE: Contextual Point Cloud Generation via Adaptive Continuous Normalizing Flow with Self-Attention

Authors: Linhao Wang, Qichang Zhang, Yifan Yang, Hao Wang

Abstract: As 3D point clouds become the prevailing shape representation in computer vision, how to generate high-resolution point clouds has become a pressing issue. Flow-based generative models can effectively perform point cloud generation tasks. However, traditional CNN-based flow architectures rely only on local information to extract features, making it difficult to capture global contextual informatio… ▽ More As 3D point clouds become the prevailing shape representation in computer vision, how to generate high-resolution point clouds has become a pressing issue. Flow-based generative models can effectively perform point cloud generation tasks. However, traditional CNN-based flow architectures rely only on local information to extract features, making it difficult to capture global contextual information. Inspired by the wide adoption of Transformers, we explored the complementary roles of self-attention mechanisms in Transformers, CNN, and continuous normalizing flows. To this end, we propose a probabilistic model via adaptive normalizing flows and self-attention. Our idea leverages self-attention mechanisms to capture global contextual information. We also propose adaptive continuous normalizing flows by introducing adaptive bias correction mechanism. Combined with normalization, the mechanism dynamically handles different input contexts and mitigates potential bias-shift issues from standard initialization. Experimental results demonstrate that EAGLE achieves competitive performance in point cloud generation. △ Less

Submitted 4 March, 2025; originally announced March 2025.

arXiv:2503.13474 [pdf, other]

ISLS: IoT-Based Smart Lighting System for Improving Energy Conservation in Office Buildings

Authors: Peace Obioma, Obinna Agbodike, Jenhui Chen, Lei Wang

Abstract: With the Internet of Things (IoT) fostering seamless device-to-human and device-to-device interactions, the domain of intelligent lighting systems have evolved beyond simple occupancy and daylight sensing towards autonomous monitoring and control of power consumption and illuminance levels. To this regard, this paper proposes a new do-it-yourself (DIY) IoT-based method of smart lighting system fea… ▽ More With the Internet of Things (IoT) fostering seamless device-to-human and device-to-device interactions, the domain of intelligent lighting systems have evolved beyond simple occupancy and daylight sensing towards autonomous monitoring and control of power consumption and illuminance levels. To this regard, this paper proposes a new do-it-yourself (DIY) IoT-based method of smart lighting system featuring an illuminance control algorithm. The design involves the integration of occupancy and presence sensors alongside a communication module, to enable real-time wireless interaction and remote monitoring of the system parameters from any location through an end-user application. A constrained optimization problem was formulated to determine the optimal dimming vector for achieving target illuminance at minimal power consumption. The simplex algorithm was used to solve this problem, and the system's performance was validated through both MATLAB simulations and real-world prototype testing in an indoor office environment. The obtained experimental results demonstrate substantial power savings across multiple user occupancy scenarios, achieving reductions of approx. 80%, 48%, and 26% for one, two, and four user settings, respectively, in comparison to traditional basic lighting installation systems. △ Less

Submitted 18 March, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

arXiv:2503.12419 [pdf, other]

EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera

Authors: Luming Wang, Hao Shi, Xiaoting Yin, Kailun Yang, Kaiwei Wang, Jian Bai

Abstract: Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing as… ▽ More Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets. The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture. △ Less

Submitted 13 April, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

Comments: The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture

Showing 1–50 of 658 results for author: Wang, L