Search | arXiv e-print repository

Energy Transfer and Data Collection from Batteryless Sensors in Low-altitude Wireless Networks

Authors: Wen Zhang, Aimin Wang, Jiahui Li, Geng Sun, Jiacheng Wang, Weijie Yuan, Dusit Niyato

Abstract: The integration of wireless power transfer (WPT) with Internet of Things (IoT) offers promising solutions for sensing applications, but faces significant challenges when deployed in hard-to-access areas such as high-temperature environments. In such extreme conditions, traditional fixed WPT infrastructure cannot be safely installed, and batteries rapidly degrade due to hardware failures. In this p… ▽ More The integration of wireless power transfer (WPT) with Internet of Things (IoT) offers promising solutions for sensing applications, but faces significant challenges when deployed in hard-to-access areas such as high-temperature environments. In such extreme conditions, traditional fixed WPT infrastructure cannot be safely installed, and batteries rapidly degrade due to hardware failures. In this paper, we propose an uncrewed aerial vehicle (UAV)-assisted data collection and WPT framework for batteryless sensor (BLS) networks deployed in these challenging environments. Specifically, we consider a practical scenario where a UAV first transfers energy to BLS nodes via WPT, enabling these nodes to subsequently transmit their collected data to the UAV through orthogonal frequency-division multiple access (OFDMA). Then, we formulate a multi-objective optimization problem that aims to maximize the fair data collection volume while minimizing the UAV energy consumption through joint optimization of transmit power allocation and flight trajectory planning. Due to the non-convex nature and dynamic characteristics of this problem, conventional optimization methods prove inadequate. To address these challenges, we propose an enhanced soft actor-critic algorithm with parameter-free attention, prioritized experience replay, and value-based reward centering (SAC-PPV), thereby improving the exploration efficiency and learning stability of the algorithm in complex WPT scenarios. Simulation results demonstrate that the proposed approach consistently outperforms benchmark algorithms under various network configurations. △ Less

Submitted 10 July, 2025; originally announced July 2025.

arXiv:2507.07384 [pdf, ps, other]

VP-SelDoA: Visual-prompted Selective DoA Estimation of Target Sound via Semantic-Spatial Matching

Authors: Yu Chen, Xinyuan Qian, Hongxu Zhu, Jiadong Wang, Kainan Chen, Haizhou Li

Abstract: Audio-visual sound source localization (AV-SSL) identifies the position of a sound source by exploiting the complementary strengths of auditory and visual signals. However, existing AV-SSL methods encounter three major challenges: 1) inability to selectively isolate the target sound source in multi-source scenarios, 2) misalignment between semantic visual features and spatial acoustic features, an… ▽ More Audio-visual sound source localization (AV-SSL) identifies the position of a sound source by exploiting the complementary strengths of auditory and visual signals. However, existing AV-SSL methods encounter three major challenges: 1) inability to selectively isolate the target sound source in multi-source scenarios, 2) misalignment between semantic visual features and spatial acoustic features, and 3) overreliance on paired audio-visual data. To overcome these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that leverages images from different instances of the same sound event category to localize target sound sources, thereby reducing dependence on paired data while enhancing generalization capabilities. Our proposed VP-SelDoA tackles this challenging task through a semantic-level modality fusion and employs a Frequency-Temporal ConMamba architecture to generate target-selective masks for sound isolation. We further develop a Semantic-Spatial Matching mechanism that aligns the heterogeneous semantic and spatial features via integrated cross- and self-attention mechanisms. To facilitate the CI-AVL research, we construct a large-scale dataset named VGG-SSL, comprising 13,981 spatial audio clips across 296 sound event categories. Extensive experiments show that our proposed method outperforms state-of-the-art audio-visual localization methods, achieving a mean absolute error (MAE) of 12.04 and an accuracy (ACC) of 78.23%. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Under Review

arXiv:2507.07105 [pdf, ps, other]

4KAgent: Agentic Any Image to 4K Super-Resolution

Authors: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu

Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components:… ▽ More We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io. △ Less

Submitted 9 July, 2025; originally announced July 2025.

Comments: Project page: https://4kagent.github.io

arXiv:2507.05177 [pdf, ps, other]

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for trans… ▽ More Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S △ Less

Submitted 8 July, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

Comments: Technical Report

arXiv:2507.04776 [pdf, ps, other]

Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction

Authors: Jun-You Wang, Li Su

Abstract: We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise t… ▽ More We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise the corrupted tokens; second, we also train the model to predict bar-level and local pianoroll-derived representations from the corrupted note tokens. We argue that these objectives guide the model to better learn specific musical knowledge such as pitch intervals. For evaluation, we propose a benchmark that incorporates 12 downstream tasks ranging from chord estimation to symbolic genre classification. Results confirm the effectiveness of the proposed pre-training objectives on downstream tasks. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: Accepted at ISMIR 2025

arXiv:2507.02666 [pdf, ps, other]

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

Authors: Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

Abstract: In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively miti… ▽ More In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: Accepted at Interspeech2025

arXiv:2507.02374 [pdf, ps, other]

Predictive Control over LAWN: Joint Trajectory Design and Resource Allocation

Authors: Haijia Jin, Jun Wu, Weijie Yuan, Ruizhi Ruan, Jiacheng Wang, Dusit Niyato, Dong In Kim, Abbas Jamalipour

Abstract: Low-altitude wireless networks (LAWNs) have been envisioned as flexible and transformative platforms for enabling delay-sensitive control applications in Internet of Things (IoT) systems. In this work, we investigate the real-time wireless control over a LAWN system, where an aerial drone is employed to serve multiple mobile automated guided vehicles (AGVs) via finite blocklength (FBL) transmissio… ▽ More Low-altitude wireless networks (LAWNs) have been envisioned as flexible and transformative platforms for enabling delay-sensitive control applications in Internet of Things (IoT) systems. In this work, we investigate the real-time wireless control over a LAWN system, where an aerial drone is employed to serve multiple mobile automated guided vehicles (AGVs) via finite blocklength (FBL) transmission. Toward this end, we adopt the model predictive control (MPC) to ensure accurate trajectory tracking, while we analyze the communication reliability using the outage probability. Subsequently, we formulate an optimization problem to jointly determine control policy, transmit power allocation, and drone trajectory by accounting for the maximum travel distance and control input constraints. To address the resultant non-convex optimization problem, we first derive the closed-form expression of the outage probability under FBL transmission. Based on this, we reformulate the original problem as a quadratic programming (QP) problem, followed by developing an alternating optimization (AO) framework. Specifically, we employ the projected gradient descent (PGD) method and the successive convex approximation (SCA) technique to achieve computationally efficient sub-optimal solutions. Furthermore, we thoroughly analyze the convergence and computational complexity of the proposed algorithm. Extensive simulations and AirSim-based experiments are conducted to validate the superiority of our proposed approach compared to the baseline schemes in terms of control performance. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2507.01360 [pdf, ps, other]

MmBack: Clock-free Multi-Sensor Backscatter with Synchronous Acquisition and Multiplexing

Authors: Yijie Li, Weichong Ling, Taiting Lu, Yi-Chao Chen, Vaishnavi Ranganathan, Lili Qiu, Jingxian Wang

Abstract: Backscatter tags provide a low-power solution for sensor applications, yet many real-world scenarios require multiple sensors-often of different types-for complex sensing tasks. However, existing designs support only a single sensor per tag, increasing spatial overhead. State-of-the-art approaches to multiplexing multiple sensor streams on a single tag rely on onboard clocks or multiple modulation… ▽ More Backscatter tags provide a low-power solution for sensor applications, yet many real-world scenarios require multiple sensors-often of different types-for complex sensing tasks. However, existing designs support only a single sensor per tag, increasing spatial overhead. State-of-the-art approaches to multiplexing multiple sensor streams on a single tag rely on onboard clocks or multiple modulation chains, which add cost, enlarge form factor, and remain prone to timing drift-disrupting synchronization across sensors. We present mmBack, a low-power, clock-free backscatter tag that enables synchronous multi-sensor data acquisition and multiplexing over a single modulation chain. mmBack synchronizes sensor inputs in parallel using a shared reference signal extracted from ambient RF excitation, eliminating the need for an onboard timing source. To efficiently multiplex sensor data, mmBack designs a voltage-division scheme to multiplex multiple sensor inputs as backscatter frequency shifts through a single oscillator and RF switch. At the receiver, mmBack develops a frequency tracking algorithm and a finite-state machine for accurate demultiplexing. mmBack's ASIC design consumes 25.56uW, while its prototype supports 5 concurrent sensor streams with bandwidths of up to 5kHz and 3 concurrent sensor streams with bandwidth of up to 18kHz. Evaluation shows that mmBack achieves an average SNR surpassing 15dB in signal reconstruction. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 16 pages, 14 figures

arXiv:2507.00398 [pdf, ps, other]

Accurate and Efficient Fetal Birth Weight Estimation from 3D Ultrasound

Authors: Jian Wang, Qiongying Ni, Hongkui Yu, Ruixuan Yao, Jinqiao Ying, Bin Zhang, Xingyi Yang, Jin Peng, Jiongquan Chen, Junxuan Yu, Wenlong Shi, Chaoyu Chen, Zhongnuo Yan, Mingyuan Luo, Gaocheng Cai, Dong Ni, Jing Lu, Xin Yang

Abstract: Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting the… ▽ More Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of $166.4\pm155.9$ $g$ and a mean absolute percentage error of $5.1\pm4.6$%, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: https://github.com/Qioy-i/EFW. △ Less

Submitted 30 June, 2025; originally announced July 2025.

Comments: Accepted by MICCAI 2025

arXiv:2507.00366 [pdf, ps, other]

Wireless AI Evolution: From Statistical Learners to Electromagnetic-Guided Foundation Models

Authors: Jian Xiao, Ji Wang, Kunrui Cao, Xingwang Li, Zhao Chen, Chau Yuen

Abstract: While initial applications of artificial intelligence (AI) in wireless communications over the past decade have demonstrated considerable potential using specialized models for targeted communication tasks, the revolutionary demands of sixth-generation (6G) networks for holographic communications, ubiquitous sensing, and native intelligence are propelling a necessary evolution towards AI-native wi… ▽ More While initial applications of artificial intelligence (AI) in wireless communications over the past decade have demonstrated considerable potential using specialized models for targeted communication tasks, the revolutionary demands of sixth-generation (6G) networks for holographic communications, ubiquitous sensing, and native intelligence are propelling a necessary evolution towards AI-native wireless networks. The arrival of large AI models paves the way for the next phase of Wireless AI, driven by wireless foundation models (WFMs). In particular, pre-training on universal electromagnetic (EM) principles equips WFMs with the essential adaptability for a multitude of demanding 6G applications. However, existing large AI models face critical limitations, including pre-training strategies disconnected from EM-compliant constraints leading to physically inconsistent predictions, a lack of embedded understanding of wave propagation physics, and the inaccessibility of massive labeled datasets for comprehensive EM-aware training. To address these challenges, this article presents an electromagnetic information theory-guided self-supervised pre-training (EIT-SPT) framework designed to systematically inject EM physics into WFMs. The EIT-SPT framework aims to infuse WFMs with intrinsic EM knowledge, thereby enhancing their physical consistency, generalization capabilities across varied EM landscapes, and overall data efficiency. Building upon the proposed EIT-SPT framework, this article first elaborates on diverse potential applications in 6G scenarios of WFMs, then validates the efficacy of the proposed framework through illustrative case studies, and finally summarizes critical open research challenges and future directions for WFMs. △ Less

Submitted 30 June, 2025; originally announced July 2025.

arXiv:2506.23874 [pdf, ps, other]

URGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement Competition

Authors: Jiahe Wang, Chenda Li, Wei Wang, Wangyou Zhang, Samuele Cornell, Marvin Sach, Robin Scheibler, Kohei Saijo, Yihui Fu, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian

Abstract: The Mean Opinion Score (MOS) is fundamental to speech quality assessment. However, its acquisition requires significant human annotation. Although deep neural network approaches, such as DNSMOS and UTMOS, have been developed to predict MOS to avoid this issue, they often suffer from insufficient training data. Recognizing that the comparison of speech enhancement (SE) systems prioritizes a reliabl… ▽ More The Mean Opinion Score (MOS) is fundamental to speech quality assessment. However, its acquisition requires significant human annotation. Although deep neural network approaches, such as DNSMOS and UTMOS, have been developed to predict MOS to avoid this issue, they often suffer from insufficient training data. Recognizing that the comparison of speech enhancement (SE) systems prioritizes a reliable system comparison over absolute scores, we propose URGENT-PK, a novel ranking approach leveraging pairwise comparisons. URGENT-PK takes homologous enhanced speech pairs as input to predict relative quality rankings. This pairwise paradigm efficiently utilizes limited training data, as all pairwise permutations of multiple systems constitute a training instance. Experiments across multiple open test sets demonstrate URGENT-PK's superior system-level ranking performance over state-of-the-art baselines, despite its simple network architecture and limited training data. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: Submitted to ASRU2025

arXiv:2506.23557 [pdf, ps, other]

Data-Driven Modulation Optimization with LMMSE Equalization for Reliability Enhancement in Underwater Acoustic Communications

Authors: Xuehan Wang, Hengyu Zhang, Jintao Wang, Zhi Sun, Bo Ai

Abstract: Ultra-reliable underwater acoustic (UWA) communications serve as one of the key enabling technologies for future space-air-ground-underwater integrated networks. However, the reliability of current UWA transmission is still insufficient since severe performance degradation occurs for conventional multicarrier systems in UWA channels with severe delay-scale spread. To solve this problem, we exploit… ▽ More Ultra-reliable underwater acoustic (UWA) communications serve as one of the key enabling technologies for future space-air-ground-underwater integrated networks. However, the reliability of current UWA transmission is still insufficient since severe performance degradation occurs for conventional multicarrier systems in UWA channels with severe delay-scale spread. To solve this problem, we exploit learning-inspired approaches to optimize the modulation scheme under the assumption of linear minimum mean square error (LMMSE) equalization, where the discrete representation of waveforms is adopted by utilizing Nyquist filters. The optimization problem is first transferred into maximizing the fairness of estimation mean square error (MSE) for each data symbol since the total MSE is invariant considering the property of orthogonal modulation. The Siamese architecture is then adopted to obtain consistent optimization results across various channel conditions, which avoids the overhead of online feedback, cooperation, and deployment of neural networks and guarantees generalization. The overall scheme including the loss function, neural network structure, and training process is also investigated in depth in this paper. The excellent performance and robustness of the proposed modulation scheme are verified by carrying out the bit error rate test over various UWA channels with severe delay-scale spread. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: 6 pages, 3 figures. This paper has been accepted for presentation in IEEE/CIC ICCC 2025

arXiv:2506.23493 [pdf, ps, other]

Securing the Sky: Integrated Satellite-UAV Physical Layer Security for Low-Altitude Wireless Networks

Authors: Jiahui Li, Geng Sun, Xiaoyu Sun, Fang Mei, Jingjing Wang, Xiangwang Hou, Daxin Tian, Victor C. M. Leung

Abstract: Low-altitude wireless networks (LAWNs) have garnered significant attention in the forthcoming 6G networks. In LAWNs, satellites with wide coverage and unmanned aerial vehicles (UAVs) with flexible mobility can complement each other to form integrated satellite-UAV networks, providing ubiquitous and high-speed connectivity for low-altitude operations. However, the higher line-of-sight probability i… ▽ More Low-altitude wireless networks (LAWNs) have garnered significant attention in the forthcoming 6G networks. In LAWNs, satellites with wide coverage and unmanned aerial vehicles (UAVs) with flexible mobility can complement each other to form integrated satellite-UAV networks, providing ubiquitous and high-speed connectivity for low-altitude operations. However, the higher line-of-sight probability in low-altitude airspace increases transmission security concerns. In this work, we present a collaborative beamforming-based physical layer security scheme for LAWNs. We introduce the fundamental aspects of integrated satellite-UAV networks, physical layer security, UAV swarms, and collaborative beamforming for LAWN applications. Following this, we highlight several opportunities for collaborative UAV swarm secure applications enabled by satellite networks, including achieving physical layer security in scenarios involving data dissemination, data relay, eavesdropper collusion, and imperfect eavesdropper information. Next, we detail two case studies: a secure relay system and a two-way aerial secure communication framework specifically designed for LAWN environments. Simulation results demonstrate that these physical layer security schemes are effective and beneficial for secure low-altitude wireless communications. A short practicality analysis shows that the proposed method is applicable to LAWN scenarios. Finally, we discuss current challenges and future research directions for enhancing security in LAWNs. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: This paper has been submitted to IEEE Wireless Communications

arXiv:2506.23203 [pdf, ps, other]

Multi-Branch DNN and CRLB-Ratio-Weight Fusion for Enhanced DOA Sensing via a Massive H$^2$AD MIMO Receiver

Authors: Feng Shu, Jiatong Bai, Di Wu, Wei Zhu, Bin Deng, Fuhui Zhou, Jiangzhou Wang

Abstract: As a green MIMO structure, massive H$^2$AD is viewed as a potential technology for the future 6G wireless network. For such a structure, it is a challenging task to design a low-complexity and high-performance fusion of target direction values sensed by different sub-array groups with fewer use of prior knowledge. To address this issue, a lightweight Cramer-Rao lower bound (CRLB)-ratio-weight fusi… ▽ More As a green MIMO structure, massive H$^2$AD is viewed as a potential technology for the future 6G wireless network. For such a structure, it is a challenging task to design a low-complexity and high-performance fusion of target direction values sensed by different sub-array groups with fewer use of prior knowledge. To address this issue, a lightweight Cramer-Rao lower bound (CRLB)-ratio-weight fusion (WF) method is proposed, which approximates inverse CRLB of each subarray using antenna number reciprocals to eliminate real-time CRLB computation. This reduces complexity and prior knowledge dependence while preserving fusion performance. Moreover, a multi-branch deep neural network (MBDNN) is constructed to further enhance direction-of-arrival (DOA) sensing by leveraging candidate angles from multiple subarrays. The subarray-specific branch networks are integrated with a shared regression module to effectively eliminate pseudo-solutions and fuse true angles. Simulation results show that the proposed CRLB-ratio-WF method achieves DOA sensing performance comparable to CRLB-based methods, while significantly reducing the reliance on prior knowledge. More notably, the proposed MBDNN has superior performance in low-SNR ranges. At SNR $= -15$ dB, it achieves an order-of-magnitude improvement in estimation accuracy compared to CRLB-ratio-WF method. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.22646 [pdf, ps, other]

Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR

Authors: Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg

Abstract: We propose a self-speaker adaptation method for streaming multi-talker automatic speech recognition (ASR) that eliminates the need for explicit speaker queries. Unlike conventional approaches requiring target speaker embeddings or enrollment audio, our technique dynamically adapts individual ASR instances through speaker-wise speech activity prediction. The key innovation involves injecting speake… ▽ More We propose a self-speaker adaptation method for streaming multi-talker automatic speech recognition (ASR) that eliminates the need for explicit speaker queries. Unlike conventional approaches requiring target speaker embeddings or enrollment audio, our technique dynamically adapts individual ASR instances through speaker-wise speech activity prediction. The key innovation involves injecting speaker-specific kernels generated via speaker supervision activations into selected ASR encoder layers. This enables instantaneous speaker adaptation to target speakers while handling fully overlapped speech even in a streaming scenario. Experiments show state-of-the-art performance in both offline and streaming scenarios, demonstrating that our self-adaptive method effectively addresses severe speech overlap through streamlined speaker-focused recognition. The results validate the proposed self-speaker adaptation approach as a robust solution for multi-talker ASR under severe overlapping speech conditions. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: Accepted by INTERSPEECH 2025

arXiv:2506.22277 [pdf, ps, other]

A Self-scaled Approximate $\ell_0$ Regularization Robust Model for Outlier Detection

Authors: Pengyang Song, Jue Wang

Abstract: Robust regression models in the presence of outliers have significant practical relevance in areas such as signal processing, financial econometrics, and energy management. Many existing robust regression methods, either grounded in statistical theory or sparse signal recovery, typically rely on the explicit or implicit assumption of outlier sparsity to filter anomalies and recover the underlying… ▽ More Robust regression models in the presence of outliers have significant practical relevance in areas such as signal processing, financial econometrics, and energy management. Many existing robust regression methods, either grounded in statistical theory or sparse signal recovery, typically rely on the explicit or implicit assumption of outlier sparsity to filter anomalies and recover the underlying signal or data. However, these methods often suffer from limited robustness or high computational complexity, rendering them inefficient for large-scale problems. In this work, we propose a novel robust regression model based on a Self-scaled Approximate l0 Regularization Model (SARM) scheme. By introducing a self-scaling mechanism into the regularization term, the proposed model mitigates the negative impact of uneven or excessively large outlier magnitudes on robustness. We also develop an alternating minimization algorithm grounded in Proximal Operators and Block Coordinate Descent. We rigorously prove the algorithm convergence. Empirical comparisons with several state-of-the-art robust regression methods demonstrate that SARM not only achieves superior robustness but also significantly improves computational efficiency. Motivated by both the theoretical error bound and empirical observations, we further design a Two-Stage SARM (TSSARM) framework, which better utilizes sample information when the singular values of the design matrix are widely spread, thereby enhancing robustness under certain conditions. Finally, we validate our approach on a real-world load forecasting task. The experimental results show that our method substantially enhances the robustness of load forecasting against adversarial data attacks, which is increasingly critical in the era of heightened data security concerns. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.21796 [pdf, ps, other]

Demonstrating Interoperable Channel State Feedback Compression with Machine Learning

Authors: Dani Korpi, Rachel Wang, Jerry Wang, Abdelrahman Ibrahim, Carl Nuzman, Runxin Wang, Kursat Rasim Mestav, Dustin Zhang, Iraj Saniee, Shawn Winston, Gordana Pavlovic, Wei Ding, William J. Hillery, Chenxi Hao, Ram Thirunagari, Jung Chang, Jeehyun Kim, Bartek Kozicki, Dragan Samardzija, Taesang Yoo, Andreas Maeder, Tingfang Ji, Harish Viswanathan

Abstract: Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of co… ▽ More Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of concepts demonstrating the benefits of ML-based channel feedback compression in a practical setting, where the user equipment (UE) and base station have no access to each others' ML models. In this paper, we present a novel approach for training interoperable compression and decompression ML models in a confidential manner, and demonstrate the accuracy of the ensuing models using prototype UEs and base stations. The performance of the ML-based channel feedback is measured both in terms of the accuracy of the reconstructed channel information and achieved downlink throughput gains when using the channel information for beamforming. The reported measurement results demonstrate that it is possible to develop an accurate ML-based channel feedback link without having to share ML models between device and network vendors. These results pave the way for a practical implementation of ML-based channel feedback in commercial 6G networks. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.21619 [pdf, other]

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Authors: Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu

Abstract: Large-scale text-to-speech (TTS) models are typically categorized into autoregressive and non-autoregressive systems. Although autoregressive systems exhibit certain advantages in speech naturalness, their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This is a key limitation in applications such as video dubbing that require strict… ▽ More Large-scale text-to-speech (TTS) models are typically categorized into autoregressive and non-autoregressive systems. Although autoregressive systems exhibit certain advantages in speech naturalness, their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This is a key limitation in applications such as video dubbing that require strict audio-visual synchronization. This paper introduces IndexTTS2, which proposes a novel and autoregressive-model-friendly method for speech duration control. The method supports two generation modes: one allows explicit specification of the number of generated tokens for precise duration control; the other does not require manual input and lets the model freely generate speech while preserving prosodic characteristics from the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion. In the zero-shot setting, the model can perfectly reproduce the emotional characteristics of the input prompt. Users may also provide a separate emotion prompt, even from a different speaker, allowing the model to reconstruct the target timbre while conveying the desired emotion. To enhance clarity during strong emotional expressions, we incorporate GPT latent representations to improve speech stability. Meanwhile, to lower the barrier for emotion control, we design a soft instruction mechanism based on textual descriptions by fine-tuning Qwen3. This enables effective guidance of speech generation with desired emotional tendencies using natural language input. Experimental results demonstrate that IndexTTS2 outperforms existing state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.21448 [pdf, ps, other]

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Authors: Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue

Abstract: While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework t… ▽ More While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Project.github.io. △ Less

Submitted 28 June, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.20244 [pdf, ps, other]

Cooperative Sensing and Communication Beamforming Design for Low-Altitude Economy

Authors: Fangzhi Li, Zhichu Ren, Cunhua Pan, Hong Ren, Jing Jin, Qixing Wang, Jiangzhou Wang

Abstract: To empower the low-altitude economy with high-accuracy sensing and high-rate communication, this paper proposes a cooperative integrated sensing and communication (ISAC) framework for aerial-ground networks. In the proposed system, the ground base stations (BSs) cooperatively serve the unmanned aerial vehicles (UAVs), which are equipped for either joint communication and sensing or sensing-only op… ▽ More To empower the low-altitude economy with high-accuracy sensing and high-rate communication, this paper proposes a cooperative integrated sensing and communication (ISAC) framework for aerial-ground networks. In the proposed system, the ground base stations (BSs) cooperatively serve the unmanned aerial vehicles (UAVs), which are equipped for either joint communication and sensing or sensing-only operations. The BSs employ coordinated beamforming to simultaneously transmit communication and sensing signals, while the UAVs execute their missions. To maximize the weighted sum rate under the sensing signal-to-interference-plus-noise ratio (SINR) constraints, we jointly optimize the transmit beamforming, receive filtering, and UAV trajectory. The resulting non-convex problem is solved using an alternating optimization framework incorporating semidefinite relaxation (SDR) and successive convex approximation (SCA). Simulation results demonstrate that the proposed joint design achieves higher communication throughput while ensuring required sensing robustness. Additionally, the sensing SINR threshold and the UAV altitude have a significant impact on the trajectory design, highlighting the necessity of adaptive deployment strategies in practical applications. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.19975 [pdf, ps, other]

VoxelOpt: Voxel-Adaptive Message Passing for Discrete Optimization in Deformable Abdominal CT Registration

Authors: Hang Zhang, Yuxi Zhang, Jiazheng Wang, Xiang Chen, Renjiu Hu, Xin Tian, Gaolei Li, Min Liu

Abstract: Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achiev… ▽ More Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achieve higher accuracy in such scenarios, they are considerably slower than learning-based methods. To address these limitations, we propose VoxelOpt, a discrete optimization-based DIR framework that combines the strengths of learning-based and iterative methods to achieve a better balance between registration accuracy and runtime. VoxelOpt uses displacement entropy from local cost volumes to measure displacement signal strength at each voxel, which differs from earlier approaches in three key aspects. First, it introduces voxel-wise adaptive message passing, where voxels with lower entropy receives less influence from their neighbors. Second, it employs a multi-level image pyramid with 27-neighbor cost volumes at each level, avoiding exponential complexity growth. Third, it replaces hand-crafted features or contrastive learning with a pretrained foundational segmentation model for feature extraction. In abdominal CT registration, these changes allow VoxelOpt to outperform leading iterative in both efficiency and accuracy, while matching state-of-the-art learning-based methods trained with label supervision. The source code will be available at https://github.com/tinymilky/VoxelOpt △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: Accepted for publication at MICCAI 2025

arXiv:2506.19774 [pdf, ps, other]

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Authors: Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alig… ▽ More We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19376 [pdf, ps, other]

Holographic Communication via Recordable and Reconfigurable Metasurface

Authors: Jinzhe Wang, Qinghua Guo, Xiaojun Yuan

Abstract: Holographic surface based communication technologies are anticipated to play a significant role in the next generation of wireless networks. The existing reconfigurable holographic surface (RHS)-based scheme only utilizes the reconstruction process of the holographic principle for beamforming, where the channel sate information (CSI) is needed. However, channel estimation for CSI acquirement is a… ▽ More Holographic surface based communication technologies are anticipated to play a significant role in the next generation of wireless networks. The existing reconfigurable holographic surface (RHS)-based scheme only utilizes the reconstruction process of the holographic principle for beamforming, where the channel sate information (CSI) is needed. However, channel estimation for CSI acquirement is a challenging task in metasurface based communications. In this study, inspired by both the recording and reconstruction processes of holography, we develop a novel holographic communication scheme by introducing recordable and reconfigurable metasurfaces (RRMs), where channel estimation is not needed thanks to the recording process. Then we analyze the input-output mutual information of the RRM-based communication system and compare it with the existing RHS based system. Our results show that, without channel estimation, the proposed scheme achieves performance comparable to that of the RHS scheme with perfect CSI, suggesting a promising alternative for future wireless communication networks. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.18680 [pdf, ps, other]

DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling

Authors: Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, Chuan Guo

Abstract: We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discret… ▽ More We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers' motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 11 pages, 7 figures, 2 tables, accepted in ACM Siggraph 2025 conference track

arXiv:2506.18067 [pdf, ps, other]

Cooperative Bistatic ISAC Systems for Low-Altitude Economy

Authors: Zhenkun Zhang, Yining Xu, Cunhua Pan, Hong Ren, Yiming Yu, Jiangzhou Wang

Abstract: The burgeoning low-altitude economy (LAE) necessitates integrated sensing and communication (ISAC) systems capable of high-accuracy multi-target localization and velocity estimation under hardware and coverage constraints inherent in conventional ISAC architectures. This paper addresses these challenges by proposing a cooperative bistatic ISAC framework within MIMO-OFDM cellular networks, enabling… ▽ More The burgeoning low-altitude economy (LAE) necessitates integrated sensing and communication (ISAC) systems capable of high-accuracy multi-target localization and velocity estimation under hardware and coverage constraints inherent in conventional ISAC architectures. This paper addresses these challenges by proposing a cooperative bistatic ISAC framework within MIMO-OFDM cellular networks, enabling robust sensing services for LAE applications through standardized 5G New Radio (NR) infrastructure. We first develop a low-complexity parameter extraction algorithm employing CANDECOMP/PARAFAC (CP) tensor decomposition, which exploits the inherent Vandermonde structure in delay-related factor matrices to efficiently recover bistatic ranges, Doppler velocities, and angles-of-arrival (AoA) from multi-dimensional received signal tensors. To resolve data association ambiguity across distributed transmitter-receiver pairs and mitigate erroneous estimates, we further design a robust fusion scheme based on the minimum spanning tree (MST) method, enabling joint 3D position and velocity reconstruction. Comprehensive simulation results validate the framework's superiority in computational efficiency and sensing performance for low-altitude scenarios. △ Less

Submitted 22 June, 2025; originally announced June 2025.

arXiv:2506.17184 [pdf, ps, other]

Judo: A User-Friendly Open-Source Package for Sampling-Based Model Predictive Control

Authors: Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac'h, Preston Culbertson

Abstract: Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping… ▽ More Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, Judo provides robust implementations of common sampling-based MPC algorithms and standardized benchmark tasks. It further emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While written in Python, the software leverages MuJoCo as its physics backend to achieve real-time performance, which we validate across both consumer and server-grade hardware. Code at https://github.com/bdaiinstitute/judo. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted at the 2025 RSS Workshop on Fast Motion Planning and Control in the Era of Parallelism. 5 Pages

arXiv:2506.16173 [pdf, ps, other]

Single-Microphone-Based Sound Source Localization for Mobile Robots in Reverberant Environments

Authors: Jiang Wang, Runwu Shi, Benjamin Yen, He Kong, Kazuhiro Nakadai

Abstract: Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localizatio… ▽ More Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localization method that uses a single microphone mounted on a mobile robot in reverberant environments. Specifically, we develop a lightweight neural network model with only 43k parameters to perform real-time distance estimation by extracting temporal information from reverberant signals. The estimated distances are then processed using an extended Kalman filter to achieve online sound source localization. To the best of our knowledge, this is the first work to achieve online sound source localization using a single microphone on a moving robot, a gap that we aim to fill in this work. Extensive experiments demonstrate the effectiveness and merits of our approach. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/single-mic-SSL. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: This paper was accepted and going to appear in the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2506.15835 [pdf, ps, other]

doi 10.1109/TNNLS.2025.3573210

MoNetV2: Enhanced Motion Network for Freehand 3D Ultrasound Reconstruction

Authors: Mingyuan Luo, Xin Yang, Zhongnuo Yan, Yan Cao, Yuanji Zhang, Xindi Hu, Jin Wang, Haoxuan Ding, Wei Han, Litao Sun, Dong Ni

Abstract: Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties… ▽ More Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties in reducing cumulative drift and further improving reconstruction accuracy, particularly in scenarios involving complex motion trajectories. In this context, we propose an enhanced motion network (MoNetV2) to enhance the accuracy and generalizability of reconstruction under diverse scanning velocities and tactics. First, we propose a sensor-based temporal and multi-branch structure that fuses image and motion information from a velocity perspective to improve image-only reconstruction accuracy. Second, we devise an online multi-level consistency constraint that exploits the inherent consistency of scans to handle various scanning velocities and tactics. This constraint exploits both scan-level velocity consistency, path-level appearance consistency, and patch-level motion consistency to supervise inter-frame transformation estimation. Third, we distill an online multi-modal self-supervised strategy that leverages the correlation between network estimation and motion information to further reduce cumulative errors. Extensive experiments clearly demonstrate that MoNetV2 surpasses existing methods in both reconstruction quality and generalizability performance across three large datasets. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.12308 [pdf, ps, other]

From Ground to Sky: Architectures, Applications, and Challenges Shaping Low-Altitude Wireless Networks

Authors: Weijie Yuan, Yuanhao Cui, Jiacheng Wang, Fan Liu, Geng Sun, Tao Xiang, Jie Xu, Shi Jin, Dusit Niyato, Sinem Coleri, Sumei Sun, Shiwen Mao, Abbas Jamalipour, Dong In Kim, Mohamed-Slim Alouini, Xuemin Shen

Abstract: In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communica… ▽ More In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace. △ Less

Submitted 16 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

Comments: 10 pages, 5 figures

arXiv:2506.11443 [pdf, ps, other]

Hadamard Encoded Row Column Ultrasonic Expansive Scanning (HERCULES) with Bias-Switchable Row-Column Arrays

Authors: Darren Olufemi Dahunsi, Randy Palmar, Tyler Henry, Mohammad Rahim Sobhani, Negar Majidi, Joy Wang, Afshin Kashani Ilkhechi, Jeremy Brown, Roger Zemp

Abstract: Top-Orthogonal-to-Bottom-Electrode (TOBE) arrays, also known as bias-switchable row-column arrays (RCAs), allow for imaging techniques otherwise impossible for non-bias-switachable RCAs. Hadamard Encoded Row Column Ultrasonic Expansive Scanning (HERCULES) is a novel imaging technique that allows for expansive 3D scanning by transmitting plane or cylindrical wavefronts and receiving using Hadamard-… ▽ More Top-Orthogonal-to-Bottom-Electrode (TOBE) arrays, also known as bias-switchable row-column arrays (RCAs), allow for imaging techniques otherwise impossible for non-bias-switachable RCAs. Hadamard Encoded Row Column Ultrasonic Expansive Scanning (HERCULES) is a novel imaging technique that allows for expansive 3D scanning by transmitting plane or cylindrical wavefronts and receiving using Hadamard-Encoded-Read-Out (HERO) to perform beamforming on what is effectively a full 2D synthetic receive aperture. This allows imaging beyond the shadow of the aperture of the RCA array, potentially allows for whole organ imaging and 3D visualization of tissue morphology. It additionally enables view large volumes through limited windows. In this work we demonstrated with simulation that we are able to image at comparable resolution to existing RCA imaging methods at hundreds of frames per second. We validated these simulations by demonstrating an experimental implementation of HERCULES using a custom fabricated TOBE array, custom biasing electronics, and a research ultrasound system. Furthermore, we assess our imaging capabilities by imaging a commercial phantom, and comparing our results to those taken with traditional RCA imaging methods. Finally, we verified our ability to image real tissue by imaging a xenograft mouse model. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: 10 pages, 10 figures, 6 supplementary videos

arXiv:2506.10958 [pdf, ps, other]

Bias-Switchable Row-Column Array Imaging using Fast Orthogonal Row-Column Electronic Scanning (FORCES) Compared with Conventional Row-Column Array Imaging

Authors: Randy Palamar, Mohammad Rahim Sobhani, Darren Dahunsi, Negar Majidi, Afshin Kashani Ilkhechi, Joy Wang, Jeremy Brown, Roger Zemp

Abstract: Row-Column Arrays (RCAs) offer an attractive alternative to fully wired 2D-arrays for 3D-ultrasound, due to their greatly simplified wiring. However, conventional RCAs face challenges related to their long elements. These include an inability to image beyond the shadow of the aperture and an inability to focus in both transmit and receive for desired scan planes. To address these limitations, we r… ▽ More Row-Column Arrays (RCAs) offer an attractive alternative to fully wired 2D-arrays for 3D-ultrasound, due to their greatly simplified wiring. However, conventional RCAs face challenges related to their long elements. These include an inability to image beyond the shadow of the aperture and an inability to focus in both transmit and receive for desired scan planes. To address these limitations, we recently developed bias-switchable RCAs, also known as Top Orthogonal to Bottom Electrode (TOBE) arrays. These arrays provide novel opportunities to read out from every element of the array and achieve high-quality images. While TOBE arrays and their associated imaging schemes have shown promise, they have not yet been directly compared experimentally to conventional RCA imaging techniques. This study aims to provide such a comparison, demonstrating superior B-scan and volumetric images from two electrostrictive relaxor TOBE arrays, using a method called Fast Orthogonal Row-Column Electronic scanning (FORCES), compared to conventional RCA imaging schemes, including Tilted Plane Wave (TPW) compounding and Virtual Line Source (VLS) imaging. The study quantifies resolution and Generalized Contrast to Noise Ratio (gCNR) in phantoms, and also demonstrates volumetric acquisitions in phantom and animal models. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.10459 [pdf, ps, other]

Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance

Authors: Chun Liu, Bingqian Zhu, Tao Xu, Zheng Zheng, Zheng Li, Wei Yang, Zhigang Han, Jiayao Wang

Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI a… ▽ More Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI adversarial examples remains limited and faces challenges in fully utilizing the structural and feature information of images. To address these issues, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification models. First, while keeping the image structure unchanged, the proposed method randomly divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on a block by block basis to increase input diversity and mitigate overfitting. Second, a feature distancing loss targeting intermediate layers is designed, which measures the distance between the amplified features of the original examples and the features of the adversarial examples as the primary loss, while the output layer prediction serves as the auxiliary loss. This guides the perturbation to disrupt the features of the true class in adversarial examples, effectively enhancing transferability. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve effective transferability to black-box models on two public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.08967 [pdf, ps, other]

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks. △ Less

Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

Comments: 12 pages, 3 figures

arXiv:2506.07876 [pdf, ps, other]

Versatile Loco-Manipulation through Flexible Interlimb Coordination

Authors: Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac'h, Jiuguang Wang, Kuan Fang

Abstract: The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation thr… ▽ More The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation through flexible interlimb coordination. The key to our approach is an adaptive controller that seamlessly bridges the execution of manipulation motions and the generation of stable gaits based on task demands. Through the interplay between two controller modules, ReLIC dynamically assigns each limb for manipulation or locomotion and robustly coordinates them to achieve the task success. Using efficient reinforcement learning in simulation, ReLIC learns to perform stable gaits in accordance with the manipulation goals in the real world. To solve diverse and complex tasks, we further propose to interface the learned controller with different types of task specifications, including target trajectories, contact points, and natural language instructions. Evaluated on 12 real-world tasks that require diverse and complex coordination patterns, ReLIC demonstrates its versatility and robustness by achieving a success rate of 78.9% on average. Videos and code can be found at https://relic-locoman.rai-inst.com. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.06360 [pdf]

Towards Generalizable Drowsiness Monitoring with Physiological Sensors: A Preliminary Study

Authors: Jiyao Wang, Suzan Ayas, Jiahao Zhang, Xiao Wen, Dengbo He, Birsen Donmez

Abstract: Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity… ▽ More Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity (EDA), and respiratory (RESP) signals across four datasets, where different drowsiness inducers (such as fatigue and low arousal) and assessment methods (subjective vs. objective) were used. Binary logistic regression models were built to identify the physiological metrics that are associated with drowsiness. Findings indicate that distinct different drowsiness inducers can lead to different physiological responses, and objective assessments were more sensitive than subjective ones in detecting drowsiness. Further, the increased heart rate stability, reduced respiratory amplitude, and decreased tonic EDA are robustly associated with increased drowsiness. The results enhance understanding of drowsiness detection and can inform future generalizable monitoring designs. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: Accepted by HFES2025

arXiv:2506.06156 [pdf, ps, other]

Resource Allocation for Pinching-Antenna Systems: State-of-the-Art, Key Techniques and Open Issues

Authors: Ming Zeng, Ji Wang, Octavia A. Dobre, Zhiguo Ding, George K. Karagiannidis, Robert Schober, H. Vincent Poor

Abstract: Pinching antennas have emerged as a promising technology for reconfiguring wireless propagation environments, particularly in high-frequency communication systems operating in the millimeter-wave and terahertz bands. By enabling dynamic activation at arbitrary positions along a dielectric waveguide, pinching antennas offer unprecedented channel reconfigurability and the ability to provide line-of-… ▽ More Pinching antennas have emerged as a promising technology for reconfiguring wireless propagation environments, particularly in high-frequency communication systems operating in the millimeter-wave and terahertz bands. By enabling dynamic activation at arbitrary positions along a dielectric waveguide, pinching antennas offer unprecedented channel reconfigurability and the ability to provide line-of-sight (LoS) links in scenarios with severe LoS blockages. The performance of pinching-antenna systems is highly dependent on the optimized placement of the pinching antennas, which must be jointly considered with traditional resource allocation (RA) variables -- including transmission power, time slots, and subcarriers. The resulting joint RA problems are typically non-convex with complex variable coupling, necessitating sophisticated optimization techniques. This article provides a comprehensive survey of existing RA algorithms designed for pinching-antenna systems, supported by numerical case studies that demonstrate their potential performance gains. Key challenges and open research problems are also identified to guide future developments in this emerging field. △ Less

Submitted 6 June, 2025; originally announced June 2025.

Comments: submitted to IEEE WCM, 8 pages, 5 figures

arXiv:2506.04134 [pdf, other]

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Authors: Jinting Wang, Shan Yang, Li Liu

Abstract: Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient… ▽ More Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech synchronization by 32% compared to the single CSV2S. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 10 pages, 10 figures

arXiv:2506.03976 [pdf, ps, other]

Large Deviations for Sequential Tests of Statistical Sequence Matching

Authors: Lin Zhou, Qianyun Wang, Yun Wei, Jingjing Wang

Abstract: We revisit the problem of statistical sequence matching initiated by Unnikrishnan (TIT 2015) and derive theoretical performance guarantees for sequential tests that have bounded expected stopping times. Specifically, in this problem, one is given two databases of sequences and the task is to identify all matched pairs of sequences. In each database, each sequence is generated i.i.d. from a distinc… ▽ More We revisit the problem of statistical sequence matching initiated by Unnikrishnan (TIT 2015) and derive theoretical performance guarantees for sequential tests that have bounded expected stopping times. Specifically, in this problem, one is given two databases of sequences and the task is to identify all matched pairs of sequences. In each database, each sequence is generated i.i.d. from a distinct distribution and a pair of sequences is said matched if they are generated from the same distribution. The generating distribution of each sequence is \emph{unknown}. We first consider the case where the number of matches is known and derive the exact exponential decay rate of the mismatch (error) probability, a.k.a. the mismatch exponent, under each hypothesis for optimal sequential tests. Our results reveal the benefit of sequentiality by showing that optimal sequential tests have larger mismatch exponent than fixed-length tests by Zhou \emph{et al.} (TIT 2024). Subsequently, we generalize our achievability result to the case of unknown number of matches. In this case, two additional error probabilities arise: false alarm and false reject probabilities. We propose a corresponding sequential test, show that the test has bounded expected stopping time under certain conditions, and characterize the tradeoff among the exponential decay rates of three error probabilities. Furthermore, we reveal the benefit of sequentiality over the two-step fixed-length test by Zhou \emph{et al.} (TIT 2024) and propose an one-step fixed-length test that has no worse performance than the fixed-length test by Zhou \emph{et al.} (TIT 2024). When specialized to the case where either database contains a single sequence, our results specialize to large deviations of sequential tests for statistical classification, the binary case of which was recently studied by Hsu, Li and Wang (ITW 2022). △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.02610 [pdf, ps, other]

Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm

Authors: Zhaoyang Li, Jie Wang, XiaoXiao Li, Wangjie Li, Longjie Luo, Lin Li, Qingyang Hong

Abstract: In speaker diarization, traditional clustering-based methods remain widely used in real-world applications. However, these methods struggle with the complex distribution of speaker embeddings and overlapping speech segments. To address these limitations, we propose an Overlapping Community Detection method based on Graph Attention networks and the Label Propagation Algorithm (OCDGALP). The propose… ▽ More In speaker diarization, traditional clustering-based methods remain widely used in real-world applications. However, these methods struggle with the complex distribution of speaker embeddings and overlapping speech segments. To address these limitations, we propose an Overlapping Community Detection method based on Graph Attention networks and the Label Propagation Algorithm (OCDGALP). The proposed framework comprises two key components: (1) a graph attention network that refines speaker embeddings and node connections by aggregating information from neighboring nodes, and (2) a label propagation algorithm that assigns multiple community labels to each node, enabling simultaneous clustering and overlapping community detection. Experimental results show that the proposed method significantly reduces the Diarization Error Rate (DER), achieving a state-of-the-art 15.94% DER on the DIHARD-III dataset without oracle Voice Activity Detection (VAD), and an impressive 11.07% with oracle VAD. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02574 [pdf, other]

Dynamic mapping from static labels: remote sensing dynamic sample generation with temporal-spectral embedding

Authors: Shuai Yuan, Shuang Chen, Tianwu Lin, Jie Wang, Peng Gong

Abstract: Accurate remote sensing geographic mapping depends heavily on representative and timely sample data. However, rapid changes in land surface dynamics necessitate frequent updates, quickly rendering previously collected samples obsolete and imposing significant labor demands for continuous manual updates. In this study, we aim to address this problem by dynamic sample generation using existing singl… ▽ More Accurate remote sensing geographic mapping depends heavily on representative and timely sample data. However, rapid changes in land surface dynamics necessitate frequent updates, quickly rendering previously collected samples obsolete and imposing significant labor demands for continuous manual updates. In this study, we aim to address this problem by dynamic sample generation using existing single-date static labeled samples. We introduce TasGen, a two-stage automated framework to automatically generate dynamic samples, designed to simultaneously model spectral and temporal dependencies in time-series remote sensing imagery via temporal-spectral embedding, capturing land surface changes without additional manual annotations. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02414 [pdf, ps, other]

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Authors: Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu

Abstract: Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion perf… ▽ More Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: https://thuhcsi.github.io/StarVC/. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: 5 pages, 2 figures, Accepted by Interspeech 2025, Demo: https://thuhcsi.github.io/StarVC/

arXiv:2506.02197 [pdf, ps, other]

NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution

Authors: Marcos V. Conde, Radu Timofte, Zihao Lu, Xiangyu Kong, Xiaoxia Xing, Fan Wang, Suejin Han, MinKyu Park, Tianyu Zhang, Xin Luo, Yeda Chen, Dong Liu, Li Pang, Yuhang Yang, Hongzhong Wang, Xiangyong Cao, Ruixuan Jiang, Senyan Xu, Siyuan Jiang, Xueyang Fu, Zheng-Jun Zha, Tianyu Hao, Yuhong He, Ruoqi Li, Yueqi Yang , et al. (14 additional authors not shown)

Abstract: This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and… ▽ More This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration. △ Less

Submitted 4 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

arXiv:2506.01038 [pdf, ps, other]

Self-Supervised-ISAR-Net Enables Fast Sparse ISAR Imaging

Authors: Ziwen Wang, Jianping wang, Pucheng Li, Yifan Wu, Zegang Ding

Abstract: Numerous sparse inverse synthetic aperture radar (ISAR) imaging methods based on unfolded neural networks have been developed for high-quality image reconstruction with sparse measurements. However, their training typically requires paired ISAR images and echoes, which are often difficult to obtain. Meanwhile, one property can be observed that for a certain sparse measurement configuration of ISAR… ▽ More Numerous sparse inverse synthetic aperture radar (ISAR) imaging methods based on unfolded neural networks have been developed for high-quality image reconstruction with sparse measurements. However, their training typically requires paired ISAR images and echoes, which are often difficult to obtain. Meanwhile, one property can be observed that for a certain sparse measurement configuration of ISAR, when a target is rotated around its center of mass, only the image of the target undergoes the corresponding rotation after ISAR imaging, while the grating lobes do not follow this rotation and are solely determined by the sparse-sampling pattern. This property is mathematically termed as the equivariant property. Taking advantage of this property, an unfolded neural network for sparse ISAR imaging with self-supervised learning, named SS-ISAR-Net is proposed. It effectively mitigates grating lobes caused by sparse radar echo, allowing high-quality training to be achieved using only sparse radar echo data. The superiority of the proposed SS-ISAR-Net, compared to existing methods, is verified through experiments with both synthetic and real-world measurement data. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2505.24493 [pdf, ps, other]

MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge

Authors: Xin Jing, Jiadong Wang, Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

Abstract: Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scala… ▽ More Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments\' results demonstrate a consistence performance improvement on SER. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24437 [pdf, ps, other]

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Authors: Jin Wang, Wenbin Jiang, Xiangbo Wang

Abstract: Neural audio compression has emerged as a promising technology for efficiently representing speech, music, and general audio. However, existing methods suffer from significant performance degradation at limited bitrates, where the available embedding space is sharply constrained. To address this, we propose a universal high-fidelity neural audio compression algorithm featuring Residual Experts Vec… ▽ More Neural audio compression has emerged as a promising technology for efficiently representing speech, music, and general audio. However, existing methods suffer from significant performance degradation at limited bitrates, where the available embedding space is sharply constrained. To address this, we propose a universal high-fidelity neural audio compression algorithm featuring Residual Experts Vector Quantization (REVQ), which substantially expands the embedding space with minimal impact on bandwidth. A gentle load-balancing strategy is introduced to ensure the full utilization of this expanded space. Furthermore, we develop a novel multi-tiered discriminator that periodically stratifies STFT spectra, guiding the generator to focus on critical spectral regions. To support multiple bitrates without quality loss at the lower end, we adopt an efficient post-training strategy. Our proposed model achieves impressive performance, with PESQ and ViSQOL scores of 2.87 and 4.27, respectively, at 2.67 kbps bandwidth. The approach effectively reduces spectral blur, decreasing the distance to the original mel-spectrogram by 13%. Notably, our post-training strategy achieves performance comparable to dedicated fixed-bitrate models while reducing the required training time by half. Extensive ablation studies confirm the superiority of our method over baselines. △ Less

Submitted 4 July, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

Comments: 11 pages,7 figures

arXiv:2505.24356 [pdf, ps, other]

Joint Transmit and Receive Beamforming for Tri-directional Coil-Based Magnetic Induction Communications

Authors: Jinyang Li, Jianyu Wang, Wenchi Cheng, Yudong Fang, Wei Guo

Abstract: In this paper, we enhance the omnidirectional coverage performance of tri-directional coil-based magnetic induction communication (TC-MIC) and reduce the pathloss with a joint transmit and receive magnetic beamforming method. An iterative optimization algorithm incorporating the transmit current vector and receive weight matrix is developed to minimize the pathloss under constant transmit power co… ▽ More In this paper, we enhance the omnidirectional coverage performance of tri-directional coil-based magnetic induction communication (TC-MIC) and reduce the pathloss with a joint transmit and receive magnetic beamforming method. An iterative optimization algorithm incorporating the transmit current vector and receive weight matrix is developed to minimize the pathloss under constant transmit power constraints. We formulate the mathematical models for the mutual inductance of tri-directional coils, receive power, and pathloss. The optimization problem is decomposed into Rayleigh quotient extremum optimization for transmit currents and Cauchy-Schwarz inequality-constrained optimization for receive weights, with an alternating iterative algorithm to approach the global optimum. Numerical results demonstrate that the proposed algorithm converges within an average of 13.6 iterations, achieving up to 54% pathloss reduction compared with equal power allocation schemes. The joint optimization approach exhibits superior angular robustness, maintaining pathloss fluctuation smaller than 2 dB, and reducing fluctuation of pathloss by approximately 45% compared with single-parameter optimization methods. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.23980 [pdf, other]

DeepTopoNet: A Framework for Subglacial Topography Estimation on the Greenland Ice Sheets

Authors: Bayu Adhi Tama, Mansa Krishna, Homayra Alam, Mostafa Cham, Omar Faruque, Gong Cheng, Jianwu Wang, Mathieu Morlighem, Vandana Janeja

Abstract: Understanding Greenland's subglacial topography is critical for projecting the future mass loss of the ice sheet and its contribution to global sea-level rise. However, the complex and sparse nature of observational data, particularly information about the bed topography under the ice sheet, significantly increases the uncertainty in model projections. Bed topography is traditionally measured by a… ▽ More Understanding Greenland's subglacial topography is critical for projecting the future mass loss of the ice sheet and its contribution to global sea-level rise. However, the complex and sparse nature of observational data, particularly information about the bed topography under the ice sheet, significantly increases the uncertainty in model projections. Bed topography is traditionally measured by airborne ice-penetrating radar that measures the ice thickness directly underneath the aircraft, leaving data gap of tens of kilometers in between flight lines. This study introduces a deep learning framework, which we call as DeepTopoNet, that integrates radar-derived ice thickness observations and BedMachine Greenland data through a novel dynamic loss-balancing mechanism. Among all efforts to reconstruct bed topography, BedMachine has emerged as one of the most widely used datasets, combining mass conservation principles and ice thickness measurements to generate high-resolution bed elevation estimates. The proposed loss function adaptively adjusts the weighting between radar and BedMachine data, ensuring robustness in areas with limited radar coverage while leveraging the high spatial resolution of BedMachine predictions i.e. bed estimates. Our approach incorporates gradient-based and trend surface features to enhance model performance and utilizes a CNN architecture designed for subgrid-scale predictions. By systematically testing on the Upernavik Isstrøm) region, the model achieves high accuracy, outperforming baseline methods in reconstructing subglacial terrain. This work demonstrates the potential of deep learning in bridging observational gaps, providing a scalable and efficient solution to inferring subglacial topography. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: Submitted to SIGSPATIAL 2025

arXiv:2505.23249 [pdf, ps, other]

Context-Aware Semantic Communication for the Wireless Networks

Authors: Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Abbas Jamalipour

Abstract: In next-generation wireless networks, supporting real-time applications such as augmented reality, autonomous driving, and immersive Metaverse services demands stringent constraints on bandwidth, latency, and reliability. Existing semantic communication (SemCom) approaches typically rely on static models, overlooking dynamic conditions and contextual cues vital for efficient transmission. To addre… ▽ More In next-generation wireless networks, supporting real-time applications such as augmented reality, autonomous driving, and immersive Metaverse services demands stringent constraints on bandwidth, latency, and reliability. Existing semantic communication (SemCom) approaches typically rely on static models, overlooking dynamic conditions and contextual cues vital for efficient transmission. To address these challenges, we propose CaSemCom, a context-aware SemCom framework that leverages a Large Language Model (LLM)-based gating mechanism and a Mixture of Experts (MoE) architecture to adaptively select and encode only high-impact semantic features across multiple data modalities. Our multimodal, multi-user case study demonstrates that CaSemCom significantly improves reconstructed image fidelity while reducing bandwidth usage, outperforming single-agent deep reinforcement learning (DRL) methods and traditional baselines in convergence speed, semantic accuracy, and retransmission overhead. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.22053 [pdf, other]

AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

Authors: Yan Rong, Jinting Wang, Shan Yang, Guangzhi Lei, Li Liu

Abstract: Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in… ▽ More Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie outperforms state-of-the-art (SOTA) methods across 9 metrics in 8 tasks. User study further validate the effectiveness of the proposed method in terms of quality, accuracy, alignment, and aesthetic. The anonymous project website with samples can be found at https://audiogenie.github.io/. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.20149 [pdf, ps, other]

Improvement Strategies for Few-Shot Learning in OCT Image Classification of Rare Retinal Diseases

Authors: Cheng-Yu Tai, Ching-Wen Chen, Chi-Chin Wu, Bo-Chen Chiu, Cheng-Hung, Lin, Cheng-Kai Lu, Jia-Kang Wang, Tzu-Lun Huang

Abstract: This paper focuses on using few-shot learning to improve the accuracy of classifying OCT diagnosis images with major and rare classes. We used the GAN-based augmentation strategy as a baseline and introduced several novel methods to further enhance our model. The proposed strategy contains U-GAT-IT for improving the generative part and uses the data balance technique to narrow down the skew of acc… ▽ More This paper focuses on using few-shot learning to improve the accuracy of classifying OCT diagnosis images with major and rare classes. We used the GAN-based augmentation strategy as a baseline and introduced several novel methods to further enhance our model. The proposed strategy contains U-GAT-IT for improving the generative part and uses the data balance technique to narrow down the skew of accuracy between all categories. The best model obtained was built with CBAM attention mechanism and fine-tuned InceptionV3, and achieved an overall accuracy of 97.85%, representing a significant improvement over the original baseline. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Showing 1–50 of 1,438 results for author: Wang, J