-
SonicMotion: Dynamic Spatial Audio Soundscapes with Latent Diffusion Models
Authors:
Christian Templin,
Yanda Zhu,
Hao Wang
Abstract:
Spatial audio is an integral part of immersive entertainment, such as VR/AR, and has seen increasing popularity in cinema and music as well. The most common format of spatial audio is described as first-order Ambisonics (FOA). We seek to extend recent advancements in FOA generative AI models to enable the generation of 3D scenes with dynamic sound sources. Our proposed end-to-end model, SonicMotio…
▽ More
Spatial audio is an integral part of immersive entertainment, such as VR/AR, and has seen increasing popularity in cinema and music as well. The most common format of spatial audio is described as first-order Ambisonics (FOA). We seek to extend recent advancements in FOA generative AI models to enable the generation of 3D scenes with dynamic sound sources. Our proposed end-to-end model, SonicMotion, comes in two variations which vary in their user input and level of precision in sound source localization. In addition to our model, we also present a new dataset of simulated spatial audio-caption pairs. Evaluation of our models demonstrate that they are capable of matching the semantic alignment and audio quality of state of the art models while capturing the desired spatial attributes.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
On-Device Training of PV Power Forecasting Models in a Smart Meter for Grid Edge Intelligence
Authors:
Jian Huang,
Yongli Zhu,
Linna Xu,
Zhe Zheng,
Wenpeng Cui,
Mingyang Sun
Abstract:
In this paper, an edge-side model training study is conducted on a resource-limited smart meter. The motivation of grid-edge intelligence and the concept of on-device training are introduced. Then, the technical preparation steps for on-device training are described. A case study on the task of photovoltaic power forecasting is presented, where two representative machine learning models are invest…
▽ More
In this paper, an edge-side model training study is conducted on a resource-limited smart meter. The motivation of grid-edge intelligence and the concept of on-device training are introduced. Then, the technical preparation steps for on-device training are described. A case study on the task of photovoltaic power forecasting is presented, where two representative machine learning models are investigated: a gradient boosting tree model and a recurrent neural network model. To adapt to the resource-limited situation in the smart meter, "mixed"- and "reduced"-precision training schemes are also devised. Experiment results demonstrate the feasibility of economically achieving grid-edge intelligence via the existing advanced metering infrastructures.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
Intelligent Reflecting Surfaces for THz Communications: Fundamentals, Key Solutions, and System Prototyping
Authors:
Qingqing Wu,
Yanze Zhu,
Qiaoyan Peng,
Wanming Hao,
Yanzhao Hou,
Fengyuan Yang,
Wencai Yan,
Guoning Wang,
Wen Chen,
Chi Qiu
Abstract:
Intelligent reflecting surfaces (IRSs) have emerged as a cost-effective technology for terahertz (THz) communications by enabling programmable control of the wireless environment. This paper provides a comprehensive overview of IRSs-aided THz communications, covering hardware designs, advanced signal processing techniques, and practical deployment strategies. It first examines key THz reconfigurab…
▽ More
Intelligent reflecting surfaces (IRSs) have emerged as a cost-effective technology for terahertz (THz) communications by enabling programmable control of the wireless environment. This paper provides a comprehensive overview of IRSs-aided THz communications, covering hardware designs, advanced signal processing techniques, and practical deployment strategies. It first examines key THz reconfigurable metasurface architectures, including electronic, optical, phase-change material, and micro-electromechanical systems (MEMS)-based implementations, highlighting their reconfiguration mechanisms and challenges. Then, fundamental effects including near field and beam squint in wideband THz systems are analyzed, along with their impacts on system performance. The paper further explores conventional and beam-squint-assisted channel estimation methods, innovative beam management strategies, and deployment considerations across large- and small-scale scenarios. Practical experiments at 220 gigahertz (GHz) validate the effectiveness of IRS in improving signal strength and communication reliability for both single-user and multi-user setups.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
MMWiLoc: A Multi-Sensor Dataset and Robust Device-Free Localization Method Using Commercial Off-The-Shelf Millimeter Wave Wi-Fi Devices
Authors:
Wenbo Ding,
Yang Li,
Dongsheng Wang,
Bin Zhao,
Yunrong Zhu,
Yibo Zhang,
Yumeng Miao
Abstract:
Device-free Wi-Fi sensing has numerous benefits in practical settings, as it eliminates the requirement for dedicated sensing devices and can be accomplished using current low-cost Wi-Fi devices. With the development of Wi-Fi standards, millimeter wave Wi-Fi devices with 60GHz operating frequency and up to 4GHz bandwidth have become commercially available. Although millimeter wave Wi-Fi presents g…
▽ More
Device-free Wi-Fi sensing has numerous benefits in practical settings, as it eliminates the requirement for dedicated sensing devices and can be accomplished using current low-cost Wi-Fi devices. With the development of Wi-Fi standards, millimeter wave Wi-Fi devices with 60GHz operating frequency and up to 4GHz bandwidth have become commercially available. Although millimeter wave Wi-Fi presents great promise for Device-Free Wi-Fi sensing with increased bandwidth and beam-forming ability, there still lacks a method for localization using millimeter wave Wi-Fi. Here, we present two major contributions: First, we provide a comprehensive multi-sensor dataset that synchronously captures human movement data from millimeter wave Wi-Fi, 2.4GHz Wi-Fi, and millimeter wave radar sensors. This dataset enables direct performance comparisons across different sensing modalities and facilitates reproducible researches in indoor localization. Second, we introduce MMWiLoc, a novel localization method that achieves centimeter-level precision with low computational cost. MMWiLoc incorporates two components: beam pattern calibration using Expectation Maximization and target localization through Multi-Scale Compression Sensing. The system processes beam Signal-to-Noise Ratio (beamSNR) information from the beam-forming process to determine target Angle of Arrival (AoA), which is then fused across devices for localization. Our extensive evaluation demonstrates that MMWiLoc achieves centimeter-level precision, outperforming 2.4GHz Wi-Fi systems while maintaining competitive performance with high-precision radar systems. The dataset and examples processing code will be released after this paper is accepted at https://github.com/wowoyoho/MMWiLoc.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
BNMusic: Blending Environmental Noises into Personalized Music
Authors:
Chi Zuo,
Martin B. Møller,
Pablo Martínez-Nuevo,
Huayang Huang,
Yu Wu,
Ye Zhu
Abstract:
While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivate…
▽ More
While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Heterogeneous-IRS-Assisted MIMO Systems: Channel Estimation and Beamforming
Authors:
Weibiao Zhao,
Qiucen Wu,
Yuanqi Tang,
Yu Zhu
Abstract:
Intelligent reflecting surface (IRS) has gained great attention for its ability to create favorable propagation environments. However, the power consumption of conventional IRSs cannot be ignored due to the large number of reflecting elements and control circuits. To balance performance and power consumption, we previously proposed a heterogeneous-IRS (HE-IRS), a green IRS structure integrating dy…
▽ More
Intelligent reflecting surface (IRS) has gained great attention for its ability to create favorable propagation environments. However, the power consumption of conventional IRSs cannot be ignored due to the large number of reflecting elements and control circuits. To balance performance and power consumption, we previously proposed a heterogeneous-IRS (HE-IRS), a green IRS structure integrating dynamically tunable elements (DTEs) and statically tunable elements (STEs). Compared to conventional IRSs with only DTEs, the unique DTE-STE integrated structure introduces new challenges in both channel estimation and beamforming. In this paper, we investigate the channel estimation and beamforming problems in HE-IRS-assisted multi-user multiple-input multiple-output systems. Unlike the overall cascaded channel estimated in conventional IRSs, we show that the HE-IRS channel to be estimated is decomposed into a DTE-based cascaded channel and an STE-based equivalent channel. Leveraging it along with the inherent sparsity of DTE- and STE-based channels and manifold optimization, we propose an efficient channel estimation scheme. To address the rank mismatch problem in the imperfect channel sparsity information, a robust rank selection rule is developed. For beamforming, we propose an offline algorithm to optimize the STE phase shifts for wide beam coverage, and an online algorithm to optimize the BS precoder and the DTE phase shifts using the estimated HE-IRS channel. Simulation results show that the HE-IRS requires less pilot overhead than conventional IRSs with the same number of elements. With the proposed channel estimation and beamforming schemes, the green HE-IRS achieves competitive sum rate performance with significantly reduced power consumption.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction
Authors:
Yuliang Zhu,
Jing Cheng,
Qi Xie,
Zhuo-Xu Cui,
Qingyong Zhu,
Yuanyuan Liu,
Xin Liu,
Jianfeng Ren,
Chengbo Wang,
Dong Liang
Abstract:
Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural n…
▽ More
Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural network (ECNN) has shown great promise in exploiting spatial symmetry priors. However, existing ECNNs critically fail to model temporal symmetry, arguably the most universal and informative structural prior in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance through a (2+1)D equivariant convolutional architecture. In particular, it integrates both the data consistency and proximal mapping module into a unified deep unrolling framework. This architecture ensures rigorous propagation of spatiotemporal rotation symmetry constraints throughout the reconstruction process, enabling more physically accurate modeling of cardiac motion dynamics in cine MRI. In addition, a high-fidelity group filter parameterization mechanism is developed to maintain representation precision while enforcing symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in preserving rotation-symmetric structures, offering strong generalization capability to a broad range of dynamic MRI reconstruction tasks.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Authors:
Ailin Huang,
Bingxin Li,
Bruce Wang,
Boyong Wu,
Chao Yan,
Chengli Feng,
Heng Wang,
Hongyu Zhou,
Hongyuan Wang,
Jingbei Li,
Jianjian Sun,
Joanna Wang,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Shilei Jiang,
Tian Fei,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Ge,
Zheng Gong,
Zhewei Huang
, et al. (51 additional authors not shown)
Abstract:
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du…
▽ More
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.
△ Less
Submitted 13 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
Delay Optimization in Remote ID-Based UAV Communication via BLE and Wi-Fi Switching
Authors:
Yian Zhu,
Ziye Jia,
Lei Zhang,
Yao Wu,
Qiuming Zhu,
Qihui Wu
Abstract:
The remote identification (Remote ID) broadcast capability allows unmanned aerial vehicles (UAVs) to exchange messages, which is a pivotal technology for inter-UAV communications. Although this capability enhances the operational visibility, low delay in Remote ID-based communications is critical for ensuring the efficiency and timeliness of multi-UAV operations in dynamic environments. To address…
▽ More
The remote identification (Remote ID) broadcast capability allows unmanned aerial vehicles (UAVs) to exchange messages, which is a pivotal technology for inter-UAV communications. Although this capability enhances the operational visibility, low delay in Remote ID-based communications is critical for ensuring the efficiency and timeliness of multi-UAV operations in dynamic environments. To address this challenge, we first establish delay models for Remote ID communications by considering packet reception and collisions across both BLE 4 and Wi-Fi protocols. Building upon these models, we formulate an optimization problem to minimize the long-term communication delay through adaptive protocol selection. Since the delay performance varies with the UAV density, we propose an adaptive BLE/Wi-Fi switching algorithm based on the multi-agent deep Q-network approach. Experimental results demonstrate that in dynamic-density scenarios, our strategy achieves 32.1% and 37.7% lower latency compared to static BLE 4 and Wi-Fi modes respectively.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation
Authors:
Ming Meng,
Ziyi Yang,
Jian Yang,
Zhenjie Su,
Yonggui Zhu,
Zhaoxin Fan
Abstract:
Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and arbitrary text, without prior exposure to the speaker during training. This process employs pattern recognition techniques to analyze and replicate the speaker…
▽ More
Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and arbitrary text, without prior exposure to the speaker during training. This process employs pattern recognition techniques to analyze and replicate the speaker's unique vocal features. Despite progress, challenges remain in adapting to the vocal style of unseen speakers, highlighting difficulties in generalizing TTS systems to handle diverse voices while maintaining naturalness, expressiveness, and speaker fidelity. To address the challenges of unseen speaker style adaptation, we propose DS-TTS, a novel approach aimed at enhancing the synthesis of diverse, previously unheard voices. Central to our method is a Dual-Style Encoding Network (DuSEN), where two distinct style encoders capture complementary aspects of a speaker's vocal identity. These speaker-specific style vectors are seamlessly integrated into the Dynamic Generator Network (DyGN) via a Style Gating-Film (SGF) mechanism, enabling more accurate and expressive reproduction of unseen speakers' unique vocal characteristics. In addition, we introduce a Dynamic Generator Network to tackle synthesis issues that arise with varying sentence lengths. By dynamically adapting to the length of the input, this component ensures robust performance across diverse text inputs and speaker styles, significantly improving the model's ability to generalize to unseen speakers in a more natural and expressive manner. Experimental evaluations on the VCTK dataset suggest that DS-TTS demonstrates superior overall performance in voice cloning tasks compared to existing state-of-the-art models, showing notable improvements in both word error rate and speaker similarity.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
B2LoRa: Boosting LoRa Transmission for Satellite-IoT Systems with Blind Coherent Combining
Authors:
Yimin Zhao,
Weibo Wang,
Xiong Wang,
Linghe Kong,
Jiadi Yu,
Yifei Zhu,
Shiyuan Li,
Chong He,
Guihai Chen
Abstract:
With the rapid growth of Low Earth Orbit (LEO) satellite networks, satellite-IoT systems using the LoRa technique have been increasingly deployed to provide widespread Internet services to low-power and low-cost ground devices. However, the long transmission distance and adverse environments from IoT satellites to ground devices pose a huge challenge to link reliability, as evidenced by the measur…
▽ More
With the rapid growth of Low Earth Orbit (LEO) satellite networks, satellite-IoT systems using the LoRa technique have been increasingly deployed to provide widespread Internet services to low-power and low-cost ground devices. However, the long transmission distance and adverse environments from IoT satellites to ground devices pose a huge challenge to link reliability, as evidenced by the measurement results based on our real-world setup. In this paper, we propose a blind coherent combining design named B2LoRa to boost LoRa transmission performance. The intuition behind B2LoRa is to leverage the repeated broadcasting mechanism inherent in satellite-IoT systems to achieve coherent combining under the low-power and low-cost constraints, where each re-transmission at different times is regarded as the same packet transmitted from different antenna elements within an antenna array. Then, the problem is translated into aligning these packets at a fine granularity despite the time, frequency, and phase offsets between packets in the case of frequent packet loss. To overcome this challenge, we present three designs - joint packet sniffing, frequency shift alignment, and phase drift mitigation to deal with ultra-low SNRs and Doppler shifts featured in satellite-IoT systems, respectively. Finally, experiment results based on our real-world deployments demonstrate the high efficiency of B2LoRa.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Patch-based Reconstruction for Unsupervised Dynamic MRI using Learnable Tensor Function with Implicit Neural Representation
Authors:
Yuanyuan Liu,
Yuanbiao Yang,
Zhuo-Xu Cui,
Qingyong Zhu,
Jing Cheng,
Congcong Liu,
Jinwen Xie,
Jingran Xu,
Hairong Zheng,
Dong Liang,
Yanjie Zhu
Abstract:
Dynamic MRI plays a vital role in clinical practice by capturing both spatial details and dynamic motion, but its high spatiotemporal resolution is often limited by long scan times. Deep learning (DL)-based methods have shown promising performance in accelerating dynamic MRI. However, most existing algorithms rely on large fully-sampled datasets for training, which are difficult to acquire. Recent…
▽ More
Dynamic MRI plays a vital role in clinical practice by capturing both spatial details and dynamic motion, but its high spatiotemporal resolution is often limited by long scan times. Deep learning (DL)-based methods have shown promising performance in accelerating dynamic MRI. However, most existing algorithms rely on large fully-sampled datasets for training, which are difficult to acquire. Recently, implicit neural representation (INR) has emerged as a powerful scan-specific paradigm for accelerated MRI, which models signals as a continuous function over spatiotemporal coordinates. Although this approach achieves efficient continuous modeling of dynamic images and robust reconstruction, it faces challenges in recovering fine details and increasing computational demands for high dimensional data representation. To enhance both efficiency and reconstruction quality, we propose TenF-INR, a novel patch-based unsupervised framework that employs INR to model bases of tensor decomposition, enabling efficient and accurate modeling of dynamic MR images with learnable tensor functions. By exploiting strong correlations in similar spatial image patches and in the temporal direction, TenF-INR enforces multidimensional low-rankness and implements patch-based reconstruction with the benefits of continuous modeling. We compare TenF-INR with state-of-the-art methods, including supervised DL methods and unsupervised approaches. Experimental results demonstrate that TenF-INR achieves high acceleration factors up to 21, outperforming all comparison methods in image quality, temporal fidelity, and quantitative metrics, even surpassing the supervised methods.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching
Authors:
Ziqian Wang,
Zikai Liu,
Xinfa Zhu,
Yike Zhu,
Mingshuai Liu,
Jun Chen,
Longshuai Xiao,
Chao Weng,
Lei Xie
Abstract:
Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inferenc…
▽ More
Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inference latency. To address these challenges, we propose FlowSE, a flow-matching-based model for SE. Flow matching learns a continuous transformation between noisy and clean speech distributions in a single pass, significantly reducing inference latency while maintaining high-quality reconstruction. Specifically, FlowSE trains on noisy mel spectrograms and optional character sequences, optimizing a conditional flow matching loss with ground-truth mel spectrograms as supervision. It implicitly learns speech's temporal-spectral structure and text-speech alignment. During inference, FlowSE can operate with or without textual information, achieving impressive results in both scenarios, with further improvements when transcripts are available. Extensive experiments demonstrate that FlowSE significantly outperforms state-of-the-art generative methods, establishing a new paradigm for generative-based SE and demonstrating the potential of flow matching to advance the field. Our code, pre-trained checkpoints, and audio samples are available.
△ Less
Submitted 27 May, 2025; v1 submitted 25 May, 2025;
originally announced May 2025.
-
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Authors:
Ziyang Ma,
Yinghao Ma,
Yanqiao Zhu,
Chen Yang,
Yi-Wen Chao,
Ruiyang Xu,
Wenxi Chen,
Yuanzhe Chen,
Zhuo Chen,
Jian Cong,
Kai Li,
Keliang Li,
Siyou Li,
Xinfeng Li,
Xiquan Li,
Zheng Lian,
Yuzhe Liang,
Minghao Liu,
Zhikang Niu,
Tianrui Wang,
Yuping Wang,
Yuxuan Wang,
Yihao Wu,
Guanrou Yang,
Jianwei Yu
, et al. (9 additional authors not shown)
Abstract:
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that…
▽ More
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Fast Heuristic Scheduling and Trajectory Planning for Robotic Fruit Harvesters with Multiple Cartesian Arms
Authors:
Yuankai Zhu,
Stavros Vougioukas
Abstract:
This work proposes a fast heuristic algorithm for the coupled scheduling and trajectory planning of multiple Cartesian robotic arms harvesting fruits. Our method partitions the workspace, assigns fruit-picking sequences to arms, determines tight and feasible fruit-picking schedules and vehicle travel speed, and generates smooth, collision-free arm trajectories. The fruit-picking throughput achieve…
▽ More
This work proposes a fast heuristic algorithm for the coupled scheduling and trajectory planning of multiple Cartesian robotic arms harvesting fruits. Our method partitions the workspace, assigns fruit-picking sequences to arms, determines tight and feasible fruit-picking schedules and vehicle travel speed, and generates smooth, collision-free arm trajectories. The fruit-picking throughput achieved by the algorithm was assessed using synthetically generated fruit coordinates and a harvester design featuring up to 12 arms. The throughput increased monotonically as more arms were added. Adding more arms when fruit densities were low resulted in diminishing gains because it took longer to travel from one fruit to another. However, when there were enough fruits, the proposed algorithm achieved a linear speedup as the number of arms increased.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Offline Reinforcement Learning for Microgrid Voltage Regulation
Authors:
Shan Yang,
Yongli Zhu
Abstract:
This paper presents a study on using different offline reinforcement learning algorithms for microgrid voltage regulation with solar power penetration. When environment interaction is unviable due to technical or safety reasons, the proposed approach can still obtain an applicable model through offline-style training on a previously collected dataset, lowering the negative impact of lacking online…
▽ More
This paper presents a study on using different offline reinforcement learning algorithms for microgrid voltage regulation with solar power penetration. When environment interaction is unviable due to technical or safety reasons, the proposed approach can still obtain an applicable model through offline-style training on a previously collected dataset, lowering the negative impact of lacking online environment interactions. Experiment results on the IEEE 33-bus system demonstrate the feasibility and effectiveness of the proposed approach on different offline datasets, including the one with merely low-quality experience.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Deep Reinforcement Learning for Power Grid Multi-Stage Cascading Failure Mitigation
Authors:
Bo Meng,
Chenghao Xu,
Yongli Zhu
Abstract:
Cascading failures in power grids can lead to grid collapse, causing severe disruptions to social operations and economic activities. In certain cases, multi-stage cascading failures can occur. However, existing cascading-failure-mitigation strategies are usually single-stage-based, overlooking the complexity of the multi-stage scenario. This paper treats the multi-stage cascading failure problem…
▽ More
Cascading failures in power grids can lead to grid collapse, causing severe disruptions to social operations and economic activities. In certain cases, multi-stage cascading failures can occur. However, existing cascading-failure-mitigation strategies are usually single-stage-based, overlooking the complexity of the multi-stage scenario. This paper treats the multi-stage cascading failure problem as a reinforcement learning task and develops a simulation environment. The reinforcement learning agent is then trained via the deterministic policy gradient algorithm to achieve continuous actions. Finally, the effectiveness of the proposed approach is validated on the IEEE 14-bus and IEEE 118-bus systems.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Joint Communication Scheduling and Resource Allocation for Distributed Edge Learning: Seamless Integration in Next-Generation Wireless Networks
Authors:
Paul Zheng,
Navid Keshtiarast,
Pradyumna Kumar Bishoyi,
Yao Zhu,
Yulin Hu,
Marina Petrova,
Anke Schmeink
Abstract:
Distributed edge learning (DL) is considered a cornerstone of intelligence enablers, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and security. Integrating DL into the 6G networks requires a coexistence design with existing services such as high-bandwidth (HB) traffic like eMBB. Current designs i…
▽ More
Distributed edge learning (DL) is considered a cornerstone of intelligence enablers, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and security. Integrating DL into the 6G networks requires a coexistence design with existing services such as high-bandwidth (HB) traffic like eMBB. Current designs in the literature mainly focus on communication round-wise designs that assume a rigid resource allocation throughout each communication round (CR). However, rigid resource allocation within a CR is a highly inefficient and inaccurate representation of the system's realistic behavior. This is due to the heterogeneous nature of the system, as clients inherently may need to access the network at different times. This work zooms into one arbitrary CR, and demonstrates the importance of considering a time-dependent resource sharing design with HB traffic. We first formulate a time-step-wise optimization problem to minimize the consumed time by DL within the CR while constrained by a DL energy budget. Due to its intractability, a session-based optimization problem is formulated assuming a CR lasts less than a large-scale coherence time. Some scheduling properties of such multi-server joint communication scheduling and resource allocation framework have been established. An iterative algorithm has been designed to solve such non-convex and non-block-separable-constrained problems. Simulation results confirm the importance of the efficient and accurate integration design proposed in this work.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Diffusion-assisted Model Predictive Control Optimization for Power System Real-Time Operation
Authors:
Linna Xu,
Yongli Zhu
Abstract:
This paper presents a modified model predictive control (MPC) framework for real-time power system operation. The framework incorporates a diffusion model tailored for time series generation to enhance the accuracy of the load forecasting module used in the system operation. In the absence of explicit state transition law, a model-identification procedure is leveraged to derive the system dynamics…
▽ More
This paper presents a modified model predictive control (MPC) framework for real-time power system operation. The framework incorporates a diffusion model tailored for time series generation to enhance the accuracy of the load forecasting module used in the system operation. In the absence of explicit state transition law, a model-identification procedure is leveraged to derive the system dynamics, thereby eliminating a barrier when applying MPC to a renewables-dominated power system. Case study results on an industry park system and the IEEE 30-bus system demonstrate that using the diffusion model to augment the training dataset significantly improves load-forecasting accuracy, and the inferred system dynamics are applicable to the real-time grid operation with solar and wind.
△ Less
Submitted 14 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Sub-diffraction terahertz backpropagation compressive imaging
Authors:
Yongsheng Zhu,
Shaojing Liu,
Ximiao Wang,
Runli Li,
Haili Yang,
Jiali Wang,
Hongjia Zhu,
Yanlin Ke,
Ningsheng Xu,
Huanjun Chen,
Shaozhi Deng
Abstract:
Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backp…
▽ More
Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backpropagation compressive imaging technique. We illuminate the object with monochromatic continuous-wave THz radiation. The transmitted THz wave is modulated by prearranged patterns generated on the back surface of a 500-μm-thick silicon wafer, realized through photoexcited carriers using a 532-nm laser. The modulated THz wave is then recorded by a single-element detector. An untrained neural network is employed to iteratively reconstruct the object image with an ultralow compression ratio of 1.5625% under a physical model constraint, thus reducing the long sampling times. To further suppress the diffraction-field effects, embedded with the angular spectrum propagation (ASP) theory to model the diffraction of THz waves during propagation, the network retrieves near-field information from the object, enabling sub-diffraction imaging with a spatial resolution of ~λ0/7 (λ0 = 833.3 μm at 0.36 THz) and eliminating the need for ultrathin photomodulators. This approach provides an efficient solution for advancing THz microscopic imaging and addressing other inverse imaging challenges.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
DeltaDPD: Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion
Authors:
Yizhuo Wu,
Yi Zhu,
Kun Qian,
Qinyu Chen,
Anding Zhu,
John Gajadharsing,
Leo C. N. de Vreede,
Chang Gao
Abstract:
Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This…
▽ More
Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This paper introduces DeltaDPD, exploring the dynamic temporal sparsity of input signals and neuronal hidden states in RNNs for energy-efficient DPD, reducing arithmetic operations and memory accesses while preserving satisfactory linearization performance. Applying a TM3.1a 200MHz-BW 256-QAM OFDM signal to a 3.5 GHz GaN Doherty RF PA, DeltaDPD achieves -50.03 dBc in Adjacent Channel Power Ratio (ACPR), -37.22 dB in Normalized Mean Square Error (NMSE) and -38.52 dBc in Error Vector Magnitude (EVM) with 52% temporal sparsity, leading to a 1.8X reduction in estimated inference power. The DeltaDPD code will be released after formal publication at https://www.opendpd.com.
△ Less
Submitted 29 April, 2025;
originally announced May 2025.
-
STG: Spatiotemporal Graph Neural Network with Fusion and Spatiotemporal Decoupling Learning for Prognostic Prediction of Colorectal Cancer Liver Metastasis
Authors:
Yiran Zhu,
Wei Yang,
Yan su,
Zesheng Li,
Chengchang Pan,
Honggang Qi
Abstract:
We propose a multimodal spatiotemporal graph neural network (STG) framework to predict colorectal cancer liver metastasis (CRLM) progression. Current clinical models do not effectively integrate the tumor's spatial heterogeneity, dynamic evolution, and complex multimodal data relationships, limiting their predictive accuracy. Our STG framework combines preoperative CT imaging and clinical data int…
▽ More
We propose a multimodal spatiotemporal graph neural network (STG) framework to predict colorectal cancer liver metastasis (CRLM) progression. Current clinical models do not effectively integrate the tumor's spatial heterogeneity, dynamic evolution, and complex multimodal data relationships, limiting their predictive accuracy. Our STG framework combines preoperative CT imaging and clinical data into a heterogeneous graph structure, enabling joint modeling of tumor distribution and temporal evolution through spatial topology and cross-modal edges. The framework uses GraphSAGE to aggregate spatiotemporal neighborhood information and leverages supervised and contrastive learning strategies to enhance the model's ability to capture temporal features and improve robustness. A lightweight version of the model reduces parameter count by 78.55%, maintaining near-state-of-the-art performance. The model jointly optimizes recurrence risk regression and survival analysis tasks, with contrastive loss improving feature representational discriminability and cross-modal consistency. Experimental results on the MSKCC CRLM dataset show a time-adjacent accuracy of 85% and a mean absolute error of 1.1005, significantly outperforming existing methods. The innovative heterogeneous graph construction and spatiotemporal decoupling mechanism effectively uncover the associations between dynamic tumor microenvironment changes and prognosis, providing reliable quantitative support for personalized treatment decisions.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Selective Variable Convolution Meets Dynamic Content Guided Attention for Infrared Small Target Detection
Authors:
Yirui Chen,
Yiming Zhu,
Yuxin Jing,
Tianpei Zhang,
Yuchen Zheng
Abstract:
Infrared Small Target Detection (IRSTD) system aims to identify small targets in complex backgrounds. Due to the convolution operation in Convolutional Neural Networks (CNNs), applying traditional CNNs to IRSTD presents challenges, since the feature extraction of small targets is often insufficient, resulting in the loss of critical features. To address these issues, we propose a dynamic content g…
▽ More
Infrared Small Target Detection (IRSTD) system aims to identify small targets in complex backgrounds. Due to the convolution operation in Convolutional Neural Networks (CNNs), applying traditional CNNs to IRSTD presents challenges, since the feature extraction of small targets is often insufficient, resulting in the loss of critical features. To address these issues, we propose a dynamic content guided attention multiscale feature aggregation network (DCGANet), which adheres to the attention principle of 'coarse-to-fine' and achieves high detection accuracy. First, we propose a selective variable convolution (SVC) module that integrates the benefits of standard convolution, irregular deformable convolution, and multi-rate dilated convolution. This module is designed to expand the receptive field and enhance non-local features, thereby effectively improving the discrimination of targets from backgrounds. Second, the core component of DCGANet is a two-stage content guided attention module. This module employs two-stage attention mechanism to initially direct the network's focus to salient regions within the feature maps and subsequently determine whether these regions correspond to targets or background interference. By retaining the most significant responses, this mechanism effectively suppresses false alarms. Additionally, we propose adaptive dynamic feature fusion (ADFF) module to substitute for static feature cascading. This dynamic feature fusion strategy enables DCGANet to adaptively integrate contextual features, thereby enhancing its ability to discriminate true targets from false alarms. DCGANet has achieved new benchmarks across multiple datasets.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Cell-free Fluid Antenna Multiple Access Networks
Authors:
Tianyu Han,
Yongxu Zhu,
Kai-Kit Wong,
Gan Zheng,
Hyundong Shin
Abstract:
Fluid antenna enables position reconfigurability that gives transceiver access to a high-resolution spatial signal and the ability to avoid interference through the ups and downs of fading channels. Previous studies investigated this fluid antenna multiple access (FAMA) approach in a single-cell setup only. In this paper, we consider a cell-free network architecture in which users are associated w…
▽ More
Fluid antenna enables position reconfigurability that gives transceiver access to a high-resolution spatial signal and the ability to avoid interference through the ups and downs of fading channels. Previous studies investigated this fluid antenna multiple access (FAMA) approach in a single-cell setup only. In this paper, we consider a cell-free network architecture in which users are associated with the nearest base stations (BSs) and all users share the same physical channel. Each BS has multiple fixed antennas that employ maximum ratio transmission (MRT) to beam to its associated users while each user relies on its fluid antenna system (FAS) on one radio frequency (RF) chain to overcome the inter-user interference. Our aim is to analyze the outage probability performance of such cell-free FAMA network when both large- and small-scale fading effects are considered. To do so, we derive the distribution of the received \textcolor{black}{magnitude} for a typical user and then the interference distribution under both fast and slow port switching techniques. The outage probability is finally obtained in integral form in each case. Numerical results demonstrate that in an interference-limited situation, although fast port switching is typically understood as the superior method for FAMA, slow port switching emerges as a more effective solution when there is a large antenna array at the BS. Moreover, it is revealed that FAS at each user can serve to greatly reduce the burden of BS in terms of both antenna costs and CSI estimation overhead, thereby enhancing the scalability of cell-free networks.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
RIS-Assisted Beamfocusing in Near-Field IoT Communication Systems: A Transformer-Based Approach
Authors:
Quan Zhou,
Jingjing Zhao,
Kaiquan Cai,
Yanbo Zhu
Abstract:
The massive number of antennas in extremely large aperture array (ELAA) systems shifts the propagation regime of signals in internet of things (IoT) communication systems towards near-field spherical wave propagation. We propose a reconfigurable intelligent surfaces (RIS)-assisted beamfocusing mechanism, where the design of the two-dimensional beam codebook that contains both the angular and dista…
▽ More
The massive number of antennas in extremely large aperture array (ELAA) systems shifts the propagation regime of signals in internet of things (IoT) communication systems towards near-field spherical wave propagation. We propose a reconfigurable intelligent surfaces (RIS)-assisted beamfocusing mechanism, where the design of the two-dimensional beam codebook that contains both the angular and distance domains is challenging. To address this issue, we introduce a novel Transformer-based two-stage beam training algorithm, which includes the coarse and fine search phases. The proposed mechanism provides a fine-grained codebook with enhanced spatial resolution, enabling precise beamfocusing. Specifically, in the first stage, the beam training is performed to estimate the approximate location of the device by using a simple codebook, determining whether it is within the beamfocusing range (BFR) or the none-beamfocusing range (NBFR). In the second stage, by using a more precise codebook, a fine-grained beam search strategy is conducted. Experimental results unveil that the precision of the RIS-assisted beamfocusing is greatly improved. The proposed method achieves beam selection accuracy up to 97% at signal-to-noise ratio (SNR) of 20 dB, and improves 10% to 50% over the baseline method at different SNRs.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Continuous Aperture Array (CAPA)-Based Secure Wireless Communications
Authors:
Jingjing Zhao,
Haowen Song,
Xidong Mu,
Kaiquan Cai,
Yanbo Zhu,
Yuanwei Liu
Abstract:
A continuous aperture array (CAPA)-based secure communication system is investigated, where a base station equipped with a CAPA transmits signals to a legitimate user under the existence of an eavesdropper. For improving the secrecy performance, the artificial noise (AN) is employed at the BS for the jamming purpose. We aim at maximizing the secrecy rate by jointly optimizing the information-beari…
▽ More
A continuous aperture array (CAPA)-based secure communication system is investigated, where a base station equipped with a CAPA transmits signals to a legitimate user under the existence of an eavesdropper. For improving the secrecy performance, the artificial noise (AN) is employed at the BS for the jamming purpose. We aim at maximizing the secrecy rate by jointly optimizing the information-bearing and AN source current patterns, subject to the maximum transmit power constraint. To solve the resultant non-convex integral-based functional programming problem, a channel subspace-based approach is first proposed via exploiting the result that the optimal current patterns always lie within the subspace spanned by all users' channel responses. Then, the intractable CAPA continuous source current pattern design problem with an infinite number of optimization variables is equivalently transformed into the channel-subspace weighting factor optimization problem with a finite number of optimization variables. A penalty-based successive convex approximation method is developed for iteratively optimizing the finite-size weighting vectors. To further reduce the computational complexity, we propose a two-stage source current patterns design scheme. Specifically, the information-bearing and AN patterns are first designed using the maximal ration transmission and zero-forcing transmission, respectively. Then, the remaining power allocation is addressed via the one-dimensional search method. Numerical results unveil that 1) the CAPA brings in significant secrecy rate gain compared to the conventional discrete multiple-input multiple-output; 2) the proposed channel subspace-based algorithm outperforms the conventional Fourier-based approach, while sustaining much lower computational complexity; and 3) the two-stage ZF-MRT approach has negligible performance loss for the large transmit power regime.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Exploration of Approaches for Robustness and Safety in a Low Code Open Environment for Factory Automation
Authors:
Gustavo Quiros A.,
Yi Peng Zhu,
Tao Cui,
Shaokai Lin,
Marten Lohstroh,
Edward A. Lee
Abstract:
This report is a compilation of technical knowledge and concepts that were produced by the authors and additional contributors in the context of the collaboration projects "Abstraction Requirements for Language of Choice in Industrial Automation" (FY21-22) and "Approaches for Robust and Safe Low-Code" (FY23-24) from Siemens Technology and the University of California, Berkeley. The primary objecti…
▽ More
This report is a compilation of technical knowledge and concepts that were produced by the authors and additional contributors in the context of the collaboration projects "Abstraction Requirements for Language of Choice in Industrial Automation" (FY21-22) and "Approaches for Robust and Safe Low-Code" (FY23-24) from Siemens Technology and the University of California, Berkeley. The primary objective of these projects was to assess Siemens Open Industrial Edge (OIE) engineering capabilities by defining a concept that ensures the satisfaction of coordination and safety requirements when using disparate OIE modules. The objective was to use the Lingua Franca (LF) coordination language to demonstrate how to address challenges in: 1. engineering modular, distributed, and flexible automation solutions that ensure, by design, robust and safe operation1; 2. the use of IEC 61499, the event driven execution model for specifying the execution order of OIE modules (defined as function blocks); 3. support large-scale distributed OIE automation solutions, and eventually 4. define optimal solutions with synchronization and time-optimal mechanisms.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
PupiNet: Seamless OCT-OCTA Interconversion Through Wavelet-Driven and Multi-Scale Attention Mechanisms
Authors:
Renzhi Tian,
Jinjie Wang,
Wei Yang,
Weizhen Li,
Haoran Chen,
Yiran Zhu,
Chengchang Pan,
Honggang Qi
Abstract:
Optical Coherence Tomography (OCT) and Optical Coherence Tomography Angiography (OCTA) are key diagnostic tools for clinical evaluation and management of retinal diseases. Compared to traditional OCT, OCTA provides richer microvascular information, but its acquisition requires specialized sensors and high-cost equipment, creating significant challenges for the clinical deployment of hardware-depen…
▽ More
Optical Coherence Tomography (OCT) and Optical Coherence Tomography Angiography (OCTA) are key diagnostic tools for clinical evaluation and management of retinal diseases. Compared to traditional OCT, OCTA provides richer microvascular information, but its acquisition requires specialized sensors and high-cost equipment, creating significant challenges for the clinical deployment of hardware-dependent OCTA imaging methods. Given the technical complexity of OCTA image acquisition and potential mechanical artifacts, this study proposes a bidirectional image conversion framework called PupiNet, which accurately achieves bidirectional transformation between 3D OCT and 3D OCTA. The generator module of this framework innovatively integrates wavelet transformation and multi-scale attention mechanisms, significantly enhancing image conversion quality. Meanwhile, an Adaptive Discriminator Augmentation (ADA) module has been incorporated into the discriminator to optimize model training stability and convergence efficiency. To ensure clinical accuracy of vascular structures in the converted images, we designed a Vessel Structure Matcher (VSM) supervision module, achieving precise matching of vascular morphology between generated images and target images. Additionally, the Hierarchical Feature Calibration (HFC) module further guarantees high consistency of texture details between generated images and target images across different depth levels. To rigorously validate the clinical effectiveness of the proposed method, we conducted a comprehensive evaluation on a paired OCT-OCTA image dataset containing 300 eyes with various retinal pathologies. Experimental results demonstrate that PupiNet not only reliably achieves high-quality bidirectional transformation between the two modalities but also shows significant advantages in image fidelity, vessel structure preservation, and clinical usability.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Learning-based Estimation of Forward Kinematics for an Orthotic Parallel Robotic Mechanism
Authors:
Jingzong Zhou,
Yuhan Zhu,
Xiaobin Zhang,
Sunil Agrawal,
Konstantinos Karydis
Abstract:
This paper introduces a 3D parallel robot with three identical five-degree-of-freedom chains connected to a circular brace end-effector, aimed to serve as an assistive device for patients with cervical spondylosis. The inverse kinematics of the system is solved analytically, whereas learning-based methods are deployed to solve the forward kinematics. The methods considered herein include a Koopman…
▽ More
This paper introduces a 3D parallel robot with three identical five-degree-of-freedom chains connected to a circular brace end-effector, aimed to serve as an assistive device for patients with cervical spondylosis. The inverse kinematics of the system is solved analytically, whereas learning-based methods are deployed to solve the forward kinematics. The methods considered herein include a Koopman operator-based approach as well as a neural network-based approach. The task is to predict the position and orientation of end-effector trajectories. The dataset used to train these methods is based on the analytical solutions derived via inverse kinematics. The methods are tested both in simulation and via physical hardware experiments with the developed robot. Results validate the suitability of deploying learning-based methods for studying parallel mechanism forward kinematics that are generally hard to resolve analytically.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
Authors:
Hao Zhou,
Xiaobao Guo,
Yuzhe Zhu,
Adams Wai-Kin Kong
Abstract:
Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-model task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating co…
▽ More
Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-model task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, a method called MACS is proposed to conduct multi-source audio-to-image generation. This is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, efficient image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 of the 21 evaluation indexes on all tasks and delivers superior visual quality. The code will be publicly available.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
4D-ACFNet: A 4D Attention Mechanism-Based Prognostic Framework for Colorectal Cancer Liver Metastasis Integrating Multimodal Spatiotemporal Features
Authors:
Zesheng Li,
Wei Yang,
Yan Su,
Yiran Zhu,
Yuhan Tang,
Haoran Chen,
Chengchang Pan,
Honggang Qi
Abstract:
Postoperative prognostic prediction for colorectal cancer liver metastasis (CRLM) remains challenging due to tumor heterogeneity, dynamic evolution of the hepatic microenvironment, and insufficient multimodal data fusion. To address these issues, we propose 4D-ACFNet, the first framework that synergistically integrates lightweight spatiotemporal modeling, cross-modal dynamic calibration, and perso…
▽ More
Postoperative prognostic prediction for colorectal cancer liver metastasis (CRLM) remains challenging due to tumor heterogeneity, dynamic evolution of the hepatic microenvironment, and insufficient multimodal data fusion. To address these issues, we propose 4D-ACFNet, the first framework that synergistically integrates lightweight spatiotemporal modeling, cross-modal dynamic calibration, and personalized temporal prediction within a unified architecture. Specifically, it incorporates a novel 4D spatiotemporal attention mechanism, which employs spatiotemporal separable convolution (reducing parameter count by 41%) and virtual timestamp encoding to model the interannual evolution patterns of postoperative dynamic processes, such as liver regeneration and steatosis. For cross-modal feature alignment, Transformer layers are integrated to jointly optimize modality alignment loss and disentanglement loss, effectively suppressing scale mismatch and redundant interference in clinical-imaging data. Additionally, we design a dynamic prognostic decision module that generates personalized interannual recurrence risk heatmaps through temporal upsampling and a gated classification head, overcoming the limitations of traditional methods in temporal dynamic modeling and cross-modal alignment. Experiments on 197 CRLM patients demonstrate that the model achieves 100% temporal adjacency accuracy (TAA), with performance significantly surpassing existing approaches. This study establishes the first spatiotemporal modeling paradigm for postoperative dynamic monitoring of CRLM. The proposed framework can be extended to prognostic analysis of multi-cancer metastases, advancing precision surgery from "spatial resection" to "spatiotemporal cure."
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
FilmComposer: LLM-Driven Music Production for Silent Film Clips
Authors:
Zhifeng Xie,
Qile He,
Youjia Zhu,
Qiwei He,
Mengtian Li
Abstract:
In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music…
▽ More
In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: https://apple-jun.github.io/FilmComposer.github.io/
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Beamforming Design for Beyond Diagonal RIS-Aided Cell-Free Massive MIMO Systems
Authors:
Yizhuo Li,
Jiakang Zheng,
Bokai Xu,
Yiyang Zhu,
Jiayi Zhang,
Bo Ai
Abstract:
Reconfigurable intelligent surface (RIS)-aided cell-free (CF) massive multiple-input multiple-output (mMIMO) is a promising architecture for further improving spectral efficiency (SE) with low cost and power consumption. However, conventional RIS has inevitable limitations due to its capability of only reflecting signals. In contrast, beyond-diagonal RIS (BD-RIS), with its ability to both reflect…
▽ More
Reconfigurable intelligent surface (RIS)-aided cell-free (CF) massive multiple-input multiple-output (mMIMO) is a promising architecture for further improving spectral efficiency (SE) with low cost and power consumption. However, conventional RIS has inevitable limitations due to its capability of only reflecting signals. In contrast, beyond-diagonal RIS (BD-RIS), with its ability to both reflect and transmit signals, has gained great attention. This correspondence focuses on using BD-RIS to improve the sum SE of CF mMIMO systems. This requires completing the beamforming design under the transmit power constraints and unitary constraints of the BD-RIS, by optimizing active and passive beamformer simultaneously. To tackle this issue, we introduce an alternating optimization algorithm that decomposes it using fractional programming and solves the subproblems alternatively. Moreover, to address the challenge introduced by the unitary constraint on the beamforming matrix of the BD-RIS, a manifold optimization algorithm is proposed to solve the problem optimally. Simulation results show that BD-RISs outperform RISs comprehensively, especially in the case of the full connected architecture which achieves the best performance, enhancing the sum SE by around 40% compared to ideal RISs.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Efficient Integration of Distributed Learning Services in Next-Generation Wireless Networks
Authors:
Paul Zheng,
Navid Keshtiarast,
Pradyumna Kumar Bishoyi,
Yao Zhu,
Yulin Hu,
Marina Petrova,
Anke Schmeink
Abstract:
Distributed learning (DL) is considered a cornerstone of intelligence enabler, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and security. Integrating DL into the 6G networks requires coexistence design with existing services such as high-bandwidth (HB) traffic like eMBB. Current designs in the li…
▽ More
Distributed learning (DL) is considered a cornerstone of intelligence enabler, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and security. Integrating DL into the 6G networks requires coexistence design with existing services such as high-bandwidth (HB) traffic like eMBB. Current designs in the literature mainly focus on communication round (CR)-wise designs that assume a fixed resource allocation during each CR. However, fixed resource allocation within a CR is a highly inefficient and inaccurate representation of the system's realistic behavior. This is due to the heterogeneous nature of the system, where clients inherently need to access the network at different times. This work zooms into one arbitrary communication round and demonstrates the importance of considering a time-dependent resource-sharing design with HB traffic. We propose a time-dependent optimization problem for minimizing the consumed time and energy by DL within the CR. Due to its intractability, a session-based optimization problem has been proposed assuming a large-scale coherence time. An iterative algorithm has been designed to solve such problems and simulation results confirm the importance of such efficient and accurate integration design.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Authors:
Boyi Kang,
Xinfa Zhu,
Zihan Zhang,
Zhen Ye,
Mingshuai Liu,
Ziqian Wang,
Yike Zhu,
Guobin Ma,
Jun Chen,
Longshuai Xiao,
Chao Weng,
Wei Xue,
Lei Xie
Abstract:
Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited…
▽ More
Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area.
△ Less
Submitted 10 June, 2025; v1 submitted 1 March, 2025;
originally announced March 2025.
-
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Authors:
Toru Lin,
Kartik Sachdev,
Linxi Fan,
Jitendra Malik,
Yuke Zhu
Abstract:
Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to o…
▽ More
Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to overcome the identified challenges with empirical validation. Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap. We show promising results on three humanoid dexterous manipulation tasks, with ablation studies on each technique. Our work presents a successful approach to learning humanoid dexterous manipulation using sim-to-real reinforcement learning, achieving robust generalization and high performance without the need for human demonstration.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
RURANET++: An Unsupervised Learning Method for Diabetic Macular Edema Based on SCSE Attention Mechanisms and Dynamic Multi-Projection Head Clustering
Authors:
Wei Yang,
Yiran Zhu,
Jiayu Shen,
Yuhan Tang,
Chengchang Pan,
Hui He,
Yan Su,
Honggang Qi
Abstract:
Diabetic Macular Edema (DME), a prevalent complication among diabetic patients, constitutes a major cause of visual impairment and blindness. Although deep learning has achieved remarkable progress in medical image analysis, traditional DME diagnosis still relies on extensive annotated data and subjective ophthalmologist assessments, limiting practical applications. To address this, we present RUR…
▽ More
Diabetic Macular Edema (DME), a prevalent complication among diabetic patients, constitutes a major cause of visual impairment and blindness. Although deep learning has achieved remarkable progress in medical image analysis, traditional DME diagnosis still relies on extensive annotated data and subjective ophthalmologist assessments, limiting practical applications. To address this, we present RURANET++, an unsupervised learning-based automated DME diagnostic system. This framework incorporates an optimized U-Net architecture with embedded Spatial and Channel Squeeze & Excitation (SCSE) attention mechanisms to enhance lesion feature extraction. During feature processing, a pre-trained GoogLeNet model extracts deep features from retinal images, followed by PCA-based dimensionality reduction to 50 dimensions for computational efficiency. Notably, we introduce a novel clustering algorithm employing multi-projection heads to explicitly control cluster diversity while dynamically adjusting similarity thresholds, thereby optimizing intra-class consistency and inter-class discrimination. Experimental results demonstrate superior performance across multiple metrics, achieving maximum accuracy (0.8411), precision (0.8593), recall (0.8411), and F1-score (0.8390), with exceptional clustering quality. This work provides an efficient unsupervised solution for DME diagnosis with significant clinical implications.
△ Less
Submitted 7 March, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Transient Stability Analysis and Fault Clearing Angle Estimation of VSG Based on Domain of Attraction Estimated by Trajectory Reversing Method
Authors:
Jiayue Lyu,
Tianzhi Fang,
Zhiheng Lin,
Jingxue Han,
Yantao Zhu
Abstract:
The virtual synchronous generator (VSG), with the analogous nonlinear power-angle relationship to the synchronous generator (SG), has attracted much attention as a promising solution for converter-based power systems. In this paper, a large signal model of the grid-connected VSG is first established. The trajectory reversing method (TRM) is then introduced to estimate the domain of attraction (DOA…
▽ More
The virtual synchronous generator (VSG), with the analogous nonlinear power-angle relationship to the synchronous generator (SG), has attracted much attention as a promising solution for converter-based power systems. In this paper, a large signal model of the grid-connected VSG is first established. The trajectory reversing method (TRM) is then introduced to estimate the domain of attraction (DOA) of VSG. Subsequently, the transient instability mechanism is revealed in detail based on the estimated DOA boundary. The impacts of system parameters on the DOA range are further investigated. It is found that loss of synchronization (LOS) occurs if the system trajectory lies outside the post-fault DOA range. In scenarios where no equilibrium points exist after a grid fault, system stability can be reestablished only when the fault clearing angle (FCA) does not exceed the critical clearing angle (CCA). Finally, the CCA derived from the DOA and that from the conventional equal area criteria (EAC) are compared. The results show that CCA obtained by our solution has a higher accuracy. Time-domain simulations are performed to verify the effectiveness of the proposed transient stability analysis method of grid-connected VSG.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Joint Power Allocation and Phase Shift Design for Stacked Intelligent Metasurfaces-aided Cell-Free Massive MIMO Systems with MARL
Authors:
Yiyang Zhu,
Jiayi Zhang,
Enyu Shi,
Ziheng Liu,
Chau Yuen,
Bo Ai
Abstract:
Cell-free (CF) massive multiple-input multiple-output (mMIMO) systems offer high spectral efficiency (SE) through multiple distributed access points (APs). However, the large number of antennas increases power consumption. We propose incorporating stacked intelligent metasurfaces (SIM) into CF mMIMO systems as a cost-effective, energy-efficient solution. This paper focuses on optimizing the joint…
▽ More
Cell-free (CF) massive multiple-input multiple-output (mMIMO) systems offer high spectral efficiency (SE) through multiple distributed access points (APs). However, the large number of antennas increases power consumption. We propose incorporating stacked intelligent metasurfaces (SIM) into CF mMIMO systems as a cost-effective, energy-efficient solution. This paper focuses on optimizing the joint power allocation of APs and the phase shift of SIMs to maximize the sum SE. To address this complex problem, we introduce a fully distributed multi-agent reinforcement learning (MARL) algorithm. Our novel algorithm, the noisy value method with a recurrent policy in multi-agent policy optimization (NVR-MAPPO), enhances performance by encouraging diverse exploration under centralized training and decentralized execution. Simulations demonstrate that NVR-MAPPO significantly improves sum SE and robustness across various scenarios.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Waveguide Division Multiple Access for Pinching-Antenna Systems (PASS)
Authors:
Jingjing Zhao,
Xidong Mu,
Kaiquan Cai,
Yanbo Zhu,
Yuanwei Liu
Abstract:
A novel concept of waveguide division multiple access (WDMA) is proposed for multi-user pinching-antenna systems (PASS). The key principle of WDMA is to allocate each user with a dedicated waveguide, which is regarded as a new type of radio resources, so as to facilitate multi-user communications. By adjusting the activation positions of pinching antennas (PAs) over each waveguide, the pinching be…
▽ More
A novel concept of waveguide division multiple access (WDMA) is proposed for multi-user pinching-antenna systems (PASS). The key principle of WDMA is to allocate each user with a dedicated waveguide, which is regarded as a new type of radio resources, so as to facilitate multi-user communications. By adjusting the activation positions of pinching antennas (PAs) over each waveguide, the pinching beamforming can be exploited for intended user signal enhancement and inter-user interference mitigation. Considering both ideal continuous and practical discrete PA position activation schemes, a joint power allocation and pinching beamforming optimization problem is formulated for the maximization of the sum rate. An alternating optimization-based algorithm is developed to address the formulated nonconvex problem. For solving the power allocation subproblem, the successive convex approximation method is invoked. For the pinching beamforming design subproblem, a penalty-based gradient ascent algorithm is first developed for the continuous PA activation case. Then, for the discrete PA activation case, a matching theory-based algorithm is proposed to achieve the near-optimal performance but with a low complexity. Numerical results unveil that: 1) For both continuous and discrete activation cases, PASS can achieve a significant performance gain over conventional fixed-position antenna systems; 2) the proposed WDMA can effectively underpin multi-user communications with the near orthogonality in free space achieved by the pinching beamforming; and 3) the performance gap between the discrete and continuous activation cases can be significantly alleviated with practically feasible numbers of PA candidate positions.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
SpikACom: A Neuromorphic Computing Framework for Green Communications
Authors:
Yanzhen Liu,
Zhijin Qin,
Yongxu Zhu,
Geoffrey Ye Li
Abstract:
The ever-growing power consumption of wireless communication systems necessitates more energy-efficient algorithms. This paper introduces SpikACom ({Spik}ing {A}daptive {Com}munication), a neuromorphic computing-based framework for power-intensive wireless communication tasks. SpikACom leverages brain-inspired spiking neural networks (SNNs) for efficient signal processing. It is designed for dynam…
▽ More
The ever-growing power consumption of wireless communication systems necessitates more energy-efficient algorithms. This paper introduces SpikACom ({Spik}ing {A}daptive {Com}munication), a neuromorphic computing-based framework for power-intensive wireless communication tasks. SpikACom leverages brain-inspired spiking neural networks (SNNs) for efficient signal processing. It is designed for dynamic wireless environments, helping to mitigate catastrophic forgetting and facilitate adaptation to new circumstances. Moreover, SpikACom is customizable, allowing flexibly integration of domain knowledge to enhance it interpretability and efficacy. We validate its performance on fundamental wireless communication tasks, including task-oriented semantic communication, multiple-input multiple-output (MIMO) beamforming, and orthogonal frequency-division multiplexing (OFDM) channel estimation. The simulation results show that SpikACom significantly reduces power consumption while matching or exceeding the performance of conventional algorithms. This study highlights the potential of SNNs for enabling greener and smarter wireless communication systems.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Robust Multidimensional Graph Neural Networks for Signal Processing in Wireless Communications with Edge-Graph Information Bottleneck
Authors:
Ziheng Liu,
Jiayi Zhang,
Yiyang Zhu,
Enyu Shi,
Bo Ai
Abstract:
Signal processing is crucial for satisfying the high data rate requirements of future sixth-generation (6G) wireless networks. However, the rapid growth of wireless networks has brought about massive data traffic, which hinders the application of traditional optimization theory-based algorithms. Meanwhile, traditional graph neural networks (GNNs) focus on compressing inputs onto vertices to update…
▽ More
Signal processing is crucial for satisfying the high data rate requirements of future sixth-generation (6G) wireless networks. However, the rapid growth of wireless networks has brought about massive data traffic, which hinders the application of traditional optimization theory-based algorithms. Meanwhile, traditional graph neural networks (GNNs) focus on compressing inputs onto vertices to update representations, which often leads to their inability to effectively distinguish input features and severely weakens performance. In this context, designing efficient signal processing frameworks becomes imperative. Moreover, actual scenarios are susceptible to multipath interference and noise, resulting in specific differences between the received and actual information. To address these challenges, this paper incorporates multidimensional graph neural networks (MDGNNs) with edge-graph information bottleneck (EGIB) to design a robust framework for signal processing. Specifically, MDGNNs utilize hyper-edges instead of vertices to update representations to avoid indistinguishable features and reduce information loss, while EGIB encourages providing minimal sufficient information about outputs to avoid aggregation of irrelevant information. We numerically demonstrate that compared with existing frameworks, the proposed frameworks achieve excellent performance in terms of spectrum efficiency (SE) and network overhead under multiple signal processing tasks. Remarkably, as the interference noise increases, the SE performance of the proposed frameworks gradually stabilizes. This reveals the proposed frameworks have excellent robustness in interference prone environments, especially in wireless policies related to channel matrices.
△ Less
Submitted 15 February, 2025;
originally announced February 2025.
-
Inverse Design with Dynamic Mode Decomposition
Authors:
Yunpeng Zhu,
Liangliang Cheng,
Anping Jing,
Hanyu Huo,
Ziqiang Lang,
Bo Zhang,
J. Nathan Kutz
Abstract:
We introduce a computationally efficient method for the automation of inverse design in science and engineering. Based on simple least-square regression, the underlying dynamic mode decomposition algorithm can be used to construct a low-rank subspace spanning multiple experiments in parameter space. The proposed inverse design dynamic mode composition (ID-DMD) algorithm leverages the computed low-…
▽ More
We introduce a computationally efficient method for the automation of inverse design in science and engineering. Based on simple least-square regression, the underlying dynamic mode decomposition algorithm can be used to construct a low-rank subspace spanning multiple experiments in parameter space. The proposed inverse design dynamic mode composition (ID-DMD) algorithm leverages the computed low-dimensional subspace to enable fast digital design and optimization on laptop-level computing, including the potential to prescribe the dynamics themselves. Moreover, the method is robust to noise, physically interpretable, and can provide uncertainty quantification metrics. The architecture can also efficiently scale to large-scale design problems using randomized algorithms in the ID-DMD. The simplicity of the method and its implementation are highly attractive in practice, and the ID-DMD has been demonstrated to be an order of magnitude more accurate than competing methods while simultaneously being 3-5 orders faster on challenging engineering design problems ranging from structural vibrations to fluid dynamics. Due to its speed, robustness, interpretability, and ease-of-use, ID-DMD in comparison with other leading machine learning methods represents a significant advancement in data-driven methods for inverse design and optimization, promising a paradigm shift in how to approach inverse design in practice.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Multi-Agent Reinforcement Learning in Wireless Distributed Networks for 6G
Authors:
Jiayi Zhang,
Ziheng Liu,
Yiyang Zhu,
Enyu Shi,
Bokai Xu,
Chau Yuen,
Dusit Niyato,
Mérouane Debbah,
Shi Jin,
Bo Ai,
Xuemin,
Shen
Abstract:
The introduction of intelligent interconnectivity between the physical and human worlds has attracted great attention for future sixth-generation (6G) networks, emphasizing massive capacity, ultra-low latency, and unparalleled reliability. Wireless distributed networks and multi-agent reinforcement learning (MARL), both of which have evolved from centralized paradigms, are two promising solutions…
▽ More
The introduction of intelligent interconnectivity between the physical and human worlds has attracted great attention for future sixth-generation (6G) networks, emphasizing massive capacity, ultra-low latency, and unparalleled reliability. Wireless distributed networks and multi-agent reinforcement learning (MARL), both of which have evolved from centralized paradigms, are two promising solutions for the great attention. Given their distinct capabilities, such as decentralization and collaborative mechanisms, integrating these two paradigms holds great promise for unleashing the full power of 6G, attracting significant research and development attention. This paper provides a comprehensive study on MARL-assisted wireless distributed networks for 6G. In particular, we introduce the basic mathematical background and evolution of wireless distributed networks and MARL, as well as demonstrate their interrelationships. Subsequently, we analyze different structures of wireless distributed networks from the perspectives of homogeneous and heterogeneous. Furthermore, we introduce the basic concepts of MARL and discuss two typical categories, including model-based and model-free. We then present critical challenges faced by MARL-assisted wireless distributed networks, providing important guidance and insights for actual implementation. We also explore an interplay between MARL-assisted wireless distributed networks and emerging techniques, such as information bottleneck and mirror learning, delivering in-depth analyses and application scenarios. Finally, we outline several compelling research directions for future MARL-assisted wireless distributed networks.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin
Authors:
Minrui Chen,
Yi Zhou,
Huidong Jiang,
Yuhan Zhu,
Guanjie Zou,
Minqi Chen,
Rong Tian,
Hiroto Saigo
Abstract:
Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is introduced as a multimodal framework inspired by real-world diagnostic processes. It uses pretrained models such as DINOv2, Vision Transformer, and ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into low-dimensional, semantically meaningful features. A learnable self-attention-based fusion network then integrates…
▽ More
Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is introduced as a multimodal framework inspired by real-world diagnostic processes. It uses pretrained models such as DINOv2, Vision Transformer, and ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into low-dimensional, semantically meaningful features. A learnable self-attention-based fusion network then integrates these imaging features with clinical data for classification. Using 416 FUO patient cases from Sichuan University West China Hospital from 2017 to 2023, the multimodal fusion classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to 0.9291 across seven tasks, outperforming conventional machine learning and single-modality deep learning methods. Ablation studies and five-fold cross-validation further validated its effectiveness. By combining the strengths of pretrained large models and deep learning, MedMimic offers a promising solution for disease classification.
△ Less
Submitted 13 February, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills
Authors:
Tairan He,
Jiawei Gao,
Wenli Xiao,
Yuanhang Zhang,
Zi Wang,
Jiashun Wang,
Zhengyi Luo,
Guanqi He,
Nikhil Sobanbab,
Chaoyi Pan,
Zeji Yi,
Guannan Qu,
Kris Kitani,
Jessica Hodgins,
Linxi "Jim" Fan,
Yuke Zhu,
Changliu Liu,
Guanya Shi
Abstract:
Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive par…
▽ More
Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, we pre-train motion tracking policies in simulation using retargeted human motion data. In the second stage, we deploy the policies in the real world and collect real-world data to train a delta (residual) action model that compensates for the dynamics mismatch. Then, ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids.
△ Less
Submitted 25 April, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Near-Field Integrated Sensing and Communications for Secure UAV Networks
Authors:
Jingjing Zhao,
Songtao Xue,
Kaiquan Cai,
Xidong Mu,
Yuanwei Liu,
Yanbo Zhu
Abstract:
A novel near-field integrated sensing and communications framework for secure unmanned aerial vehicle (UAV) networks with high time efficiency is proposed. A ground base station (GBS) with large aperture size communicates with one communication UAV (C-UAV) under the existence of one eavesdropping UAV (E-UAV), where the artificial noise (AN) is employed for both jamming and sensing purpose. Given t…
▽ More
A novel near-field integrated sensing and communications framework for secure unmanned aerial vehicle (UAV) networks with high time efficiency is proposed. A ground base station (GBS) with large aperture size communicates with one communication UAV (C-UAV) under the existence of one eavesdropping UAV (E-UAV), where the artificial noise (AN) is employed for both jamming and sensing purpose. Given that the E-UAV's motion model is unknown at the GBS, we first propose a near-field localization and trajectory tracking scheme. Specifically, exploiting the variant Doppler shift observations over the spatial domain in the near field, the E-UAV's three-dimensional (3D) velocities are estimated from echo signals. To provide the timely correction of location prediction errors, the extended Kalman filter (EKF) is adopted to fuse the predicted states and the measured ones. Subsequently, based on the real-time predicated location of the E-UAV, we further propose a joint GBS beamforming and C-UAV trajectory design scheme for maximizing the instantaneous secrecy rate, while guaranteeing the sensing accuracy constraint. To solve the resultant non-convex problem, an alternating optimization approach is developed, where the near-field GBS beamforming and the C-UAV trajectory design subproblems are iteratively solved by exploiting the successive convex approximation method. Finally, our numerical results unveil that: 1) the E-UAV's 3D velocities and location can be accurately estimated in real time with our proposed framework by exploiting the near-field spherical wave propagation; and 2) the proposed framework achieves superior secrecy rate compared to benchmark schemes and closely approaches the performance when the E-UAV trajectory is perfectly known.
△ Less
Submitted 2 February, 2025;
originally announced February 2025.
-
Gland Segmentation Using SAM With Cancer Grade as a Prompt
Authors:
Yijie Zhu,
Shan E Ahmed Raza
Abstract:
Cancer grade is a critical clinical criterion that can be used to determine the degree of cancer malignancy. Revealing the condition of the glands, a precise gland segmentation can assist in a more effective cancer grade classification. In machine learning, binary classification information about glands (i.e., benign and malignant) can be utilized as a prompt for gland segmentation and cancer grad…
▽ More
Cancer grade is a critical clinical criterion that can be used to determine the degree of cancer malignancy. Revealing the condition of the glands, a precise gland segmentation can assist in a more effective cancer grade classification. In machine learning, binary classification information about glands (i.e., benign and malignant) can be utilized as a prompt for gland segmentation and cancer grade classification. By incorporating prior knowledge of the benign or malignant classification of the gland, the model can anticipate the likely appearance of the target, leading to better segmentation performance. We utilize Segment Anything Model to solve the segmentation task, by taking advantage of its prompt function and applying appropriate modifications to the model structure and training strategies. We improve the results from fine-tuned Segment Anything Model and produce SOTA results using this approach.
△ Less
Submitted 27 January, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.