Skip to main content

Showing 1–50 of 351 results for author: Zhu, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.07318  [pdf, ps, other

    cs.SD cs.AI eess.AS

    SonicMotion: Dynamic Spatial Audio Soundscapes with Latent Diffusion Models

    Authors: Christian Templin, Yanda Zhu, Hao Wang

    Abstract: Spatial audio is an integral part of immersive entertainment, such as VR/AR, and has seen increasing popularity in cinema and music as well. The most common format of spatial audio is described as first-order Ambisonics (FOA). We seek to extend recent advancements in FOA generative AI models to enable the generation of 3D scenes with dynamic sound sources. Our proposed end-to-end model, SonicMotio… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

  2. arXiv:2507.07016  [pdf, ps, other

    cs.LG eess.SP

    On-Device Training of PV Power Forecasting Models in a Smart Meter for Grid Edge Intelligence

    Authors: Jian Huang, Yongli Zhu, Linna Xu, Zhe Zheng, Wenpeng Cui, Mingyang Sun

    Abstract: In this paper, an edge-side model training study is conducted on a resource-limited smart meter. The motivation of grid-edge intelligence and the concept of on-device training are introduced. Then, the technical preparation steps for on-device training are described. A case study on the task of photovoltaic power forecasting is presented, where two representative machine learning models are invest… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: This paper is currently under reviewing by an IEEE publication; it may be subjected to minor changes due to review comments later

  3. arXiv:2506.17200  [pdf, ps, other

    eess.SP

    Intelligent Reflecting Surfaces for THz Communications: Fundamentals, Key Solutions, and System Prototyping

    Authors: Qingqing Wu, Yanze Zhu, Qiaoyan Peng, Wanming Hao, Yanzhao Hou, Fengyuan Yang, Wencai Yan, Guoning Wang, Wen Chen, Chi Qiu

    Abstract: Intelligent reflecting surfaces (IRSs) have emerged as a cost-effective technology for terahertz (THz) communications by enabling programmable control of the wireless environment. This paper provides a comprehensive overview of IRSs-aided THz communications, covering hardware designs, advanced signal processing techniques, and practical deployment strategies. It first examines key THz reconfigurab… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  4. arXiv:2506.11540  [pdf, ps, other

    eess.SP

    MMWiLoc: A Multi-Sensor Dataset and Robust Device-Free Localization Method Using Commercial Off-The-Shelf Millimeter Wave Wi-Fi Devices

    Authors: Wenbo Ding, Yang Li, Dongsheng Wang, Bin Zhao, Yunrong Zhu, Yibo Zhang, Yumeng Miao

    Abstract: Device-free Wi-Fi sensing has numerous benefits in practical settings, as it eliminates the requirement for dedicated sensing devices and can be accomplished using current low-cost Wi-Fi devices. With the development of Wi-Fi standards, millimeter wave Wi-Fi devices with 60GHz operating frequency and up to 4GHz bandwidth have become commercially available. Although millimeter wave Wi-Fi presents g… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 8 pages, 8 figures

  5. arXiv:2506.10754  [pdf, ps, other

    cs.SD cs.AI eess.AS

    BNMusic: Blending Environmental Noises into Personalized Music

    Authors: Chi Zuo, Martin B. Møller, Pablo Martínez-Nuevo, Huayang Huang, Yu Wu, Ye Zhu

    Abstract: While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivate… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  6. arXiv:2506.10350  [pdf, ps, other

    eess.SP

    Heterogeneous-IRS-Assisted MIMO Systems: Channel Estimation and Beamforming

    Authors: Weibiao Zhao, Qiucen Wu, Yuanqi Tang, Yu Zhu

    Abstract: Intelligent reflecting surface (IRS) has gained great attention for its ability to create favorable propagation environments. However, the power consumption of conventional IRSs cannot be ignored due to the large number of reflecting elements and control circuits. To balance performance and power consumption, we previously proposed a heterogeneous-IRS (HE-IRS), a green IRS structure integrating dy… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: 30 pages, 8 figures

  7. arXiv:2506.10309  [pdf, ps, other

    eess.IV cs.AI cs.CV

    DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction

    Authors: Yuliang Zhu, Jing Cheng, Qi Xie, Zhuo-Xu Cui, Qingyong Zhu, Yuanyuan Liu, Xin Liu, Jianfeng Ren, Chengbo Wang, Dong Liang

    Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural n… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  8. arXiv:2506.08967  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

    Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More

    Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 12 pages, 3 figures

  9. arXiv:2506.07715  [pdf, ps, other

    cs.NI eess.SY

    Delay Optimization in Remote ID-Based UAV Communication via BLE and Wi-Fi Switching

    Authors: Yian Zhu, Ziye Jia, Lei Zhang, Yao Wu, Qiuming Zhu, Qihui Wu

    Abstract: The remote identification (Remote ID) broadcast capability allows unmanned aerial vehicles (UAVs) to exchange messages, which is a pivotal technology for inter-UAV communications. Although this capability enhances the operational visibility, low delay in Remote ID-based communications is critical for ensuring the efficiency and timeliness of multi-UAV operations in dynamic environments. To address… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  10. arXiv:2506.01020  [pdf, other

    cs.SD eess.AS

    DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation

    Authors: Ming Meng, Ziyi Yang, Jian Yang, Zhenjie Su, Yonggui Zhu, Zhaoxin Fan

    Abstract: Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and arbitrary text, without prior exposure to the speaker during training. This process employs pattern recognition techniques to analyze and replicate the speaker… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  11. arXiv:2505.24140  [pdf, ps, other

    cs.NI eess.SP

    B2LoRa: Boosting LoRa Transmission for Satellite-IoT Systems with Blind Coherent Combining

    Authors: Yimin Zhao, Weibo Wang, Xiong Wang, Linghe Kong, Jiadi Yu, Yifei Zhu, Shiyuan Li, Chong He, Guihai Chen

    Abstract: With the rapid growth of Low Earth Orbit (LEO) satellite networks, satellite-IoT systems using the LoRa technique have been increasingly deployed to provide widespread Internet services to low-power and low-cost ground devices. However, the long transmission distance and adverse environments from IoT satellites to ground devices pose a huge challenge to link reliability, as evidenced by the measur… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by ACM MOBICOM'25

  12. arXiv:2505.21894  [pdf, ps, other

    eess.IV

    Patch-based Reconstruction for Unsupervised Dynamic MRI using Learnable Tensor Function with Implicit Neural Representation

    Authors: Yuanyuan Liu, Yuanbiao Yang, Zhuo-Xu Cui, Qingyong Zhu, Jing Cheng, Congcong Liu, Jinwen Xie, Jingran Xu, Hairong Zheng, Dong Liang, Yanjie Zhu

    Abstract: Dynamic MRI plays a vital role in clinical practice by capturing both spatial details and dynamic motion, but its high spatiotemporal resolution is often limited by long scan times. Deep learning (DL)-based methods have shown promising performance in accelerating dynamic MRI. However, most existing algorithms rely on large fully-sampled datasets for training, which are difficult to acquire. Recent… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  13. arXiv:2505.19476  [pdf, ps, other

    eess.AS eess.SP

    FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

    Authors: Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie

    Abstract: Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inferenc… ▽ More

    Submitted 27 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

    Comments: Accepted to InterSpeech 2025

  14. arXiv:2505.13032  [pdf, other

    cs.SD cs.CL cs.MM eess.AS

    MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

    Authors: Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu , et al. (9 additional authors not shown)

    Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Open-source at https://github.com/ddlBoJack/MMAR

  15. arXiv:2505.10028  [pdf, ps, other

    cs.RO eess.SY

    Fast Heuristic Scheduling and Trajectory Planning for Robotic Fruit Harvesters with Multiple Cartesian Arms

    Authors: Yuankai Zhu, Stavros Vougioukas

    Abstract: This work proposes a fast heuristic algorithm for the coupled scheduling and trajectory planning of multiple Cartesian robotic arms harvesting fruits. Our method partitions the workspace, assigns fruit-picking sequences to arms, determines tight and feasible fruit-picking schedules and vehicle travel speed, and generates smooth, collision-free arm trajectories. The fruit-picking throughput achieve… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: This work will be submitted to the IEEE for possible publication

  16. arXiv:2505.09920  [pdf, ps, other

    cs.AI eess.SY

    Offline Reinforcement Learning for Microgrid Voltage Regulation

    Authors: Shan Yang, Yongli Zhu

    Abstract: This paper presents a study on using different offline reinforcement learning algorithms for microgrid voltage regulation with solar power penetration. When environment interaction is unviable due to technical or safety reasons, the proposed approach can still obtain an applicable model through offline-style training on a previously collected dataset, lowering the negative impact of lacking online… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted and presented at ICLR 2025 in Singapore, Apr. 28, 2025

  17. arXiv:2505.09012  [pdf, ps, other

    cs.AI eess.SY

    Deep Reinforcement Learning for Power Grid Multi-Stage Cascading Failure Mitigation

    Authors: Bo Meng, Chenghao Xu, Yongli Zhu

    Abstract: Cascading failures in power grids can lead to grid collapse, causing severe disruptions to social operations and economic activities. In certain cases, multi-stage cascading failures can occur. However, existing cascading-failure-mitigation strategies are usually single-stage-based, overlooking the complexity of the multi-stage scenario. This paper treats the multi-stage cascading failure problem… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted and presented at ICLR 2025 in Singapore, Apr. 28, 2025

  18. arXiv:2505.08682  [pdf, ps, other

    eess.SY

    Joint Communication Scheduling and Resource Allocation for Distributed Edge Learning: Seamless Integration in Next-Generation Wireless Networks

    Authors: Paul Zheng, Navid Keshtiarast, Pradyumna Kumar Bishoyi, Yao Zhu, Yulin Hu, Marina Petrova, Anke Schmeink

    Abstract: Distributed edge learning (DL) is considered a cornerstone of intelligence enablers, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and security. Integrating DL into the 6G networks requires a coexistence design with existing services such as high-bandwidth (HB) traffic like eMBB. Current designs i… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  19. arXiv:2505.08535  [pdf

    eess.SY cs.LG

    Diffusion-assisted Model Predictive Control Optimization for Power System Real-Time Operation

    Authors: Linna Xu, Yongli Zhu

    Abstract: This paper presents a modified model predictive control (MPC) framework for real-time power system operation. The framework incorporates a diffusion model tailored for time series generation to enhance the accuracy of the load forecasting module used in the system operation. In the absence of explicit state transition law, a model-identification procedure is leveraged to derive the system dynamics… ▽ More

    Submitted 14 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted by the 2025 IEEE PES General Meeting (PESGM), which will be held in Austin, TX, July 27-31, 2025

  20. arXiv:2505.07839  [pdf

    eess.IV cs.AI

    Sub-diffraction terahertz backpropagation compressive imaging

    Authors: Yongsheng Zhu, Shaojing Liu, Ximiao Wang, Runli Li, Haili Yang, Jiali Wang, Hongjia Zhu, Yanlin Ke, Ningsheng Xu, Huanjun Chen, Shaozhi Deng

    Abstract: Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backp… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  21. arXiv:2505.06250  [pdf, other

    eess.SP cs.AI cs.CV cs.LG

    DeltaDPD: Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion

    Authors: Yizhuo Wu, Yi Zhu, Kun Qian, Qinyu Chen, Anding Zhu, John Gajadharsing, Leo C. N. de Vreede, Chang Gao

    Abstract: Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This… ▽ More

    Submitted 29 April, 2025; originally announced May 2025.

    Comments: Accepted to IEEE Microwave and Wireless Technology Letters (MWTL)

  22. arXiv:2505.03123  [pdf

    eess.IV cs.CV cs.MM

    STG: Spatiotemporal Graph Neural Network with Fusion and Spatiotemporal Decoupling Learning for Prognostic Prediction of Colorectal Cancer Liver Metastasis

    Authors: Yiran Zhu, Wei Yang, Yan su, Zesheng Li, Chengchang Pan, Honggang Qi

    Abstract: We propose a multimodal spatiotemporal graph neural network (STG) framework to predict colorectal cancer liver metastasis (CRLM) progression. Current clinical models do not effectively integrate the tumor's spatial heterogeneity, dynamic evolution, and complex multimodal data relationships, limiting their predictive accuracy. Our STG framework combines preoperative CT imaging and clinical data int… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: 9 pages, 4 figures, 5 tables

  23. arXiv:2504.21612  [pdf, other

    eess.IV

    Selective Variable Convolution Meets Dynamic Content Guided Attention for Infrared Small Target Detection

    Authors: Yirui Chen, Yiming Zhu, Yuxin Jing, Tianpei Zhang, Yuchen Zheng

    Abstract: Infrared Small Target Detection (IRSTD) system aims to identify small targets in complex backgrounds. Due to the convolution operation in Convolutional Neural Networks (CNNs), applying traditional CNNs to IRSTD presents challenges, since the feature extraction of small targets is often insufficient, resulting in the loss of critical features. To address these issues, we propose a dynamic content g… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  24. arXiv:2504.20623  [pdf, other

    eess.SP

    Cell-free Fluid Antenna Multiple Access Networks

    Authors: Tianyu Han, Yongxu Zhu, Kai-Kit Wong, Gan Zheng, Hyundong Shin

    Abstract: Fluid antenna enables position reconfigurability that gives transceiver access to a high-resolution spatial signal and the ability to avoid interference through the ups and downs of fading channels. Previous studies investigated this fluid antenna multiple access (FAMA) approach in a single-cell setup only. In this paper, we consider a cell-free network architecture in which users are associated w… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  25. arXiv:2504.13131  [pdf, other

    eess.IV cs.AI cs.CV

    NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

    Authors: Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Yabin Zhang, Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, Yiting Liao, Yushen Zuo, Mingyang Wu, Renjie Li, Shengyun Zhong , et al. (88 additional authors not shown)

    Abstract: This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Challenge Report of NTIRE 2025; Methods from 18 Teams; Accepted by CVPR Workshop; 21 pages

  26. arXiv:2504.12889  [pdf, ps, other

    eess.SP eess.SY

    RIS-Assisted Beamfocusing in Near-Field IoT Communication Systems: A Transformer-Based Approach

    Authors: Quan Zhou, Jingjing Zhao, Kaiquan Cai, Yanbo Zhu

    Abstract: The massive number of antennas in extremely large aperture array (ELAA) systems shifts the propagation regime of signals in internet of things (IoT) communication systems towards near-field spherical wave propagation. We propose a reconfigurable intelligent surfaces (RIS)-assisted beamfocusing mechanism, where the design of the two-dimensional beam codebook that contains both the angular and dista… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  27. arXiv:2504.11114  [pdf, ps, other

    eess.SP

    Continuous Aperture Array (CAPA)-Based Secure Wireless Communications

    Authors: Jingjing Zhao, Haowen Song, Xidong Mu, Kaiquan Cai, Yanbo Zhu, Yuanwei Liu

    Abstract: A continuous aperture array (CAPA)-based secure communication system is investigated, where a base station equipped with a CAPA transmits signals to a legitimate user under the existence of an eavesdropper. For improving the secrecy performance, the artificial noise (AN) is employed at the BS for the jamming purpose. We aim at maximizing the secrecy rate by jointly optimizing the information-beari… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  28. arXiv:2504.04224  [pdf, other

    cs.SE eess.SY

    Exploration of Approaches for Robustness and Safety in a Low Code Open Environment for Factory Automation

    Authors: Gustavo Quiros A., Yi Peng Zhu, Tao Cui, Shaokai Lin, Marten Lohstroh, Edward A. Lee

    Abstract: This report is a compilation of technical knowledge and concepts that were produced by the authors and additional contributors in the context of the collaboration projects "Abstraction Requirements for Language of Choice in Industrial Automation" (FY21-22) and "Approaches for Robust and Safe Low-Code" (FY23-24) from Siemens Technology and the University of California, Berkeley. The primary objecti… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Comments: 15 pages, 4 figures, technical report

  29. arXiv:2503.23933  [pdf

    eess.IV

    PupiNet: Seamless OCT-OCTA Interconversion Through Wavelet-Driven and Multi-Scale Attention Mechanisms

    Authors: Renzhi Tian, Jinjie Wang, Wei Yang, Weizhen Li, Haoran Chen, Yiran Zhu, Chengchang Pan, Honggang Qi

    Abstract: Optical Coherence Tomography (OCT) and Optical Coherence Tomography Angiography (OCTA) are key diagnostic tools for clinical evaluation and management of retinal diseases. Compared to traditional OCT, OCTA provides richer microvascular information, but its acquisition requires specialized sensors and high-cost equipment, creating significant challenges for the clinical deployment of hardware-depen… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: 8 pages,4 figures,5 tables,submitted to the 33rd ACM International Conference on Multimedia(ACM MM 2025)

  30. arXiv:2503.11855  [pdf, other

    cs.RO eess.SY

    Learning-based Estimation of Forward Kinematics for an Orthotic Parallel Robotic Mechanism

    Authors: Jingzong Zhou, Yuhan Zhu, Xiaobin Zhang, Sunil Agrawal, Konstantinos Karydis

    Abstract: This paper introduces a 3D parallel robot with three identical five-degree-of-freedom chains connected to a circular brace end-effector, aimed to serve as an assistive device for patients with cervical spondylosis. The inverse kinematics of the system is solved analytically, whereas learning-based methods are deployed to solve the forward kinematics. The methods considered herein include a Koopman… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  31. arXiv:2503.10287  [pdf, other

    cs.SD cs.CV cs.GR eess.AS

    MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

    Authors: Hao Zhou, Xiaobao Guo, Yuzhe Zhu, Adams Wai-Kin Kong

    Abstract: Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-model task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating co… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  32. arXiv:2503.09652  [pdf

    eess.IV

    4D-ACFNet: A 4D Attention Mechanism-Based Prognostic Framework for Colorectal Cancer Liver Metastasis Integrating Multimodal Spatiotemporal Features

    Authors: Zesheng Li, Wei Yang, Yan Su, Yiran Zhu, Yuhan Tang, Haoran Chen, Chengchang Pan, Honggang Qi

    Abstract: Postoperative prognostic prediction for colorectal cancer liver metastasis (CRLM) remains challenging due to tumor heterogeneity, dynamic evolution of the hepatic microenvironment, and insufficient multimodal data fusion. To address these issues, we propose 4D-ACFNet, the first framework that synergistically integrates lightweight spatiotemporal modeling, cross-modal dynamic calibration, and perso… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: 8 pages,6 figures,2 tables,submitted to the 33rd ACM International Conference on Multimedia(ACM MM 2025)

  33. arXiv:2503.08147  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    FilmComposer: LLM-Driven Music Production for Silent Film Clips

    Authors: Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li

    Abstract: In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Project page: https://apple-jun.github.io/FilmComposer.github.io/

  34. arXiv:2503.07189  [pdf, ps, other

    cs.IT eess.SP

    Beamforming Design for Beyond Diagonal RIS-Aided Cell-Free Massive MIMO Systems

    Authors: Yizhuo Li, Jiakang Zheng, Bokai Xu, Yiyang Zhu, Jiayi Zhang, Bo Ai

    Abstract: Reconfigurable intelligent surface (RIS)-aided cell-free (CF) massive multiple-input multiple-output (mMIMO) is a promising architecture for further improving spectral efficiency (SE) with low cost and power consumption. However, conventional RIS has inevitable limitations due to its capability of only reflecting signals. In contrast, beyond-diagonal RIS (BD-RIS), with its ability to both reflect… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  35. arXiv:2503.07116  [pdf, other

    eess.SY

    Efficient Integration of Distributed Learning Services in Next-Generation Wireless Networks

    Authors: Paul Zheng, Navid Keshtiarast, Pradyumna Kumar Bishoyi, Yao Zhu, Yulin Hu, Marina Petrova, Anke Schmeink

    Abstract: Distributed learning (DL) is considered a cornerstone of intelligence enabler, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and security. Integrating DL into the 6G networks requires coexistence design with existing services such as high-bandwidth (HB) traffic like eMBB. Current designs in the li… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  36. arXiv:2503.00493  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

    Authors: Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, Lei Xie

    Abstract: Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited… ▽ More

    Submitted 10 June, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

    Comments: ACL2025 main, Codes available at https://github.com/Kevin-naticl/LLaSE-G1

  37. arXiv:2502.20396  [pdf, other

    cs.RO cs.AI cs.CV cs.LG eess.SY

    Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids

    Authors: Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu

    Abstract: Reinforcement learning has delivered promising results in achieving human- or even superhuman-level capabilities across diverse problem domains, but success in dexterous robot manipulation remains limited. This work investigates the key challenges in applying reinforcement learning to solve a collection of contact-rich manipulation tasks on a humanoid embodiment. We introduce novel techniques to o… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

    Comments: Project page can be found at https://toruowo.github.io/recipe/

  38. arXiv:2502.20224  [pdf

    eess.IV cs.AI cs.CV

    RURANET++: An Unsupervised Learning Method for Diabetic Macular Edema Based on SCSE Attention Mechanisms and Dynamic Multi-Projection Head Clustering

    Authors: Wei Yang, Yiran Zhu, Jiayu Shen, Yuhan Tang, Chengchang Pan, Hui He, Yan Su, Honggang Qi

    Abstract: Diabetic Macular Edema (DME), a prevalent complication among diabetic patients, constitutes a major cause of visual impairment and blindness. Although deep learning has achieved remarkable progress in medical image analysis, traditional DME diagnosis still relies on extensive annotated data and subjective ophthalmologist assessments, limiting practical applications. To address this, we present RUR… ▽ More

    Submitted 7 March, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

    Comments: 10 pages, 2 figures, 5 tables, submitted to The 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2025)

  39. arXiv:2502.19728  [pdf

    eess.SY

    Transient Stability Analysis and Fault Clearing Angle Estimation of VSG Based on Domain of Attraction Estimated by Trajectory Reversing Method

    Authors: Jiayue Lyu, Tianzhi Fang, Zhiheng Lin, Jingxue Han, Yantao Zhu

    Abstract: The virtual synchronous generator (VSG), with the analogous nonlinear power-angle relationship to the synchronous generator (SG), has attracted much attention as a promising solution for converter-based power systems. In this paper, a large signal model of the grid-connected VSG is first established. The trajectory reversing method (TRM) is then introduced to estimate the domain of attraction (DOA… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

    Comments: 9 pages,11 figures, references added

  40. arXiv:2502.19675  [pdf, other

    cs.IT eess.SP

    Joint Power Allocation and Phase Shift Design for Stacked Intelligent Metasurfaces-aided Cell-Free Massive MIMO Systems with MARL

    Authors: Yiyang Zhu, Jiayi Zhang, Enyu Shi, Ziheng Liu, Chau Yuen, Bo Ai

    Abstract: Cell-free (CF) massive multiple-input multiple-output (mMIMO) systems offer high spectral efficiency (SE) through multiple distributed access points (APs). However, the large number of antennas increases power consumption. We propose incorporating stacked intelligent metasurfaces (SIM) into CF mMIMO systems as a cost-effective, energy-efficient solution. This paper focuses on optimizing the joint… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  41. arXiv:2502.17781  [pdf, ps, other

    eess.SP

    Waveguide Division Multiple Access for Pinching-Antenna Systems (PASS)

    Authors: Jingjing Zhao, Xidong Mu, Kaiquan Cai, Yanbo Zhu, Yuanwei Liu

    Abstract: A novel concept of waveguide division multiple access (WDMA) is proposed for multi-user pinching-antenna systems (PASS). The key principle of WDMA is to allocate each user with a dedicated waveguide, which is regarded as a new type of radio resources, so as to facilitate multi-user communications. By adjusting the activation positions of pinching antennas (PAs) over each waveguide, the pinching be… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  42. arXiv:2502.17168  [pdf, other

    eess.SP

    SpikACom: A Neuromorphic Computing Framework for Green Communications

    Authors: Yanzhen Liu, Zhijin Qin, Yongxu Zhu, Geoffrey Ye Li

    Abstract: The ever-growing power consumption of wireless communication systems necessitates more energy-efficient algorithms. This paper introduces SpikACom ({Spik}ing {A}daptive {Com}munication), a neuromorphic computing-based framework for power-intensive wireless communication tasks. SpikACom leverages brain-inspired spiking neural networks (SNNs) for efficient signal processing. It is designed for dynam… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  43. arXiv:2502.11946  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu , et al. (120 additional authors not shown)

    Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu… ▽ More

    Submitted 18 February, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  44. arXiv:2502.10869  [pdf, other

    eess.SP

    Robust Multidimensional Graph Neural Networks for Signal Processing in Wireless Communications with Edge-Graph Information Bottleneck

    Authors: Ziheng Liu, Jiayi Zhang, Yiyang Zhu, Enyu Shi, Bo Ai

    Abstract: Signal processing is crucial for satisfying the high data rate requirements of future sixth-generation (6G) wireless networks. However, the rapid growth of wireless networks has brought about massive data traffic, which hinders the application of traditional optimization theory-based algorithms. Meanwhile, traditional graph neural networks (GNNs) focus on compressing inputs onto vertices to update… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

  45. arXiv:2502.09490  [pdf, other

    cs.LG eess.SY math.DS math.OC physics.flu-dyn

    Inverse Design with Dynamic Mode Decomposition

    Authors: Yunpeng Zhu, Liangliang Cheng, Anping Jing, Hanyu Huo, Ziqiang Lang, Bo Zhang, J. Nathan Kutz

    Abstract: We introduce a computationally efficient method for the automation of inverse design in science and engineering. Based on simple least-square regression, the underlying dynamic mode decomposition algorithm can be used to construct a low-rank subspace spanning multiple experiments in parameter space. The proposed inverse design dynamic mode composition (ID-DMD) algorithm leverages the computed low-… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: 29 pages, 19 figures

    MSC Class: 37M05; 37M10; 37M21 ACM Class: I.2.6; G.1.6; G.1.10

  46. arXiv:2502.05812  [pdf, other

    cs.IT eess.SY

    Multi-Agent Reinforcement Learning in Wireless Distributed Networks for 6G

    Authors: Jiayi Zhang, Ziheng Liu, Yiyang Zhu, Enyu Shi, Bokai Xu, Chau Yuen, Dusit Niyato, Mérouane Debbah, Shi Jin, Bo Ai, Xuemin, Shen

    Abstract: The introduction of intelligent interconnectivity between the physical and human worlds has attracted great attention for future sixth-generation (6G) networks, emphasizing massive capacity, ultra-low latency, and unparalleled reliability. Wireless distributed networks and multi-agent reinforcement learning (MARL), both of which have evolved from centralized paradigms, are two promising solutions… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

  47. arXiv:2502.04794  [pdf, other

    eess.IV cs.AI cs.CV

    MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin

    Authors: Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo

    Abstract: Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is introduced as a multimodal framework inspired by real-world diagnostic processes. It uses pretrained models such as DINOv2, Vision Transformer, and ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into low-dimensional, semantically meaningful features. A learnable self-attention-based fusion network then integrates… ▽ More

    Submitted 13 February, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

  48. arXiv:2502.01143  [pdf, other

    cs.RO cs.AI cs.LG eess.SY

    ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

    Authors: Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, Zeji Yi, Guannan Qu, Kris Kitani, Jessica Hodgins, Linxi "Jim" Fan, Yuke Zhu, Changliu Liu, Guanya Shi

    Abstract: Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive par… ▽ More

    Submitted 25 April, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: RSS 2025. Project website: https://agile.human2humanoid.com/

  49. arXiv:2502.01003  [pdf, other

    eess.SP

    Near-Field Integrated Sensing and Communications for Secure UAV Networks

    Authors: Jingjing Zhao, Songtao Xue, Kaiquan Cai, Xidong Mu, Yuanwei Liu, Yanbo Zhu

    Abstract: A novel near-field integrated sensing and communications framework for secure unmanned aerial vehicle (UAV) networks with high time efficiency is proposed. A ground base station (GBS) with large aperture size communicates with one communication UAV (C-UAV) under the existence of one eavesdropping UAV (E-UAV), where the artificial noise (AN) is employed for both jamming and sensing purpose. Given t… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

  50. arXiv:2501.14718  [pdf, other

    eess.IV

    Gland Segmentation Using SAM With Cancer Grade as a Prompt

    Authors: Yijie Zhu, Shan E Ahmed Raza

    Abstract: Cancer grade is a critical clinical criterion that can be used to determine the degree of cancer malignancy. Revealing the condition of the glands, a precise gland segmentation can assist in a more effective cancer grade classification. In machine learning, binary classification information about glands (i.e., benign and malignant) can be utilized as a prompt for gland segmentation and cancer grad… ▽ More

    Submitted 27 January, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: Accepted by ISBI 2025