-
Subpixel correction of diffraction pattern shifts in ptychography via automatic differentiation
Authors:
Zhengkang Xu,
Yanqi Chen,
Hao Xu,
Qingxin Wang,
Jin Niu,
Lei Huang,
Jiyue Tang,
Yongjun Ma,
Yutong Wang,
Yishi Shi,
Changjun Ke,
Jie Li,
Zhongwei Fan
Abstract:
Ptychography, a coherent diffraction imaging technique, has become an indispensable tool in materials characterization, biological imaging, and nanostructure analysis due to its capability for high-resolution, lensless reconstruction of complex-valued images. In typical workflows, raw diffraction patterns are commonly cropped to isolate the valid central region before reconstruction. However, if t…
▽ More
Ptychography, a coherent diffraction imaging technique, has become an indispensable tool in materials characterization, biological imaging, and nanostructure analysis due to its capability for high-resolution, lensless reconstruction of complex-valued images. In typical workflows, raw diffraction patterns are commonly cropped to isolate the valid central region before reconstruction. However, if the crop is misaligned from the diffraction pattern's zero-order, reconstruction may suffer from slower convergence, phase wrapping, and reduced image fidelity. These issues are further exacerbated in experimental configurations involving reflective geometries or broadband illumination, where incorrect cropping introduces systematic preprocessing errors that compromise the entire ptychographic inversion. To address this challenge, we present an approach based on automatic differentiation (AD), where the cropping shift is treated as an optimizable parameter within the reconstruction framework. By integrating shift correction into the backpropagation loop, our method simultaneously refines the object, probe, and shift positions without requiring manual tuning. Simulation results demonstrate that, even with initial offsets ranging up to 5 pixels, the proposed method achieves subpixel correction, with an average deviation below 0.5 pixels. Experiments in the extreme ultraviolet (EUV) regime further validate the method's robustness and effectiveness. This AD-based strategy enhances the automation and robustness of ptychographic reconstructions, and is adaptable to diverse experimental conditions.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
Joint Power Control and Precoding for Cell-Free Massive MIMO Systems With Sparse Multi-Dimensional Graph Neural Networks
Authors:
Yukun Ma,
Jiayi Zhang,
Ziheng Liu,
Guowei Shi,
Bo Ai
Abstract:
Cell-free massive multiple-input multiple-output (CF mMIMO) has emerged as a prominent candidate for future networks due to its ability to significantly enhance spectral efficiency by eliminating inter-cell interference. However, its practical deployment faces considerable challenges, such as high computational complexity and the optimization of its complex processing. To address these challenges,…
▽ More
Cell-free massive multiple-input multiple-output (CF mMIMO) has emerged as a prominent candidate for future networks due to its ability to significantly enhance spectral efficiency by eliminating inter-cell interference. However, its practical deployment faces considerable challenges, such as high computational complexity and the optimization of its complex processing. To address these challenges, this correspondence proposes a framework based on a sparse multi-dimensional graph neural network (SP-MDGNN), which sparsifies the connections between access points (APs) and user equipments (UEs) to significantly reduce computational complexity while maintaining high performance. In addition, the weighted minimum mean square error (WMMSE) algorithm is introduced as a comparative method to further analyze the trade-off between performance and complexity. Simulation results demonstrate that the sparse method achieves an optimal balance between performance and complexity, significantly reducing the computational complexity of the original MDGNN method while incurring only a slight performance degradation, providing insights for the practical deployment of CF mMIMO systems in large-scale network.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Unsupervised Learning-Based Joint Resource Allocation and Beamforming Design for RIS-Assisted MISO-OFDMA Systems
Authors:
Yu Ma,
Xingyu Zhou,
Xiao Li,
Le Liang,
Shi Jin
Abstract:
Reconfigurable intelligent surfaces (RIS) are key enablers for 6G wireless systems. This paper studies downlink transmission in an RIS-assisted MISO-OFDMA system, addressing resource allocation challenges. A two-stage unsupervised learning-based framework is proposed to jointly design RIS phase shifts, BS beamforming, and resource block (RB) allocation. The framework includes BeamNet, which predic…
▽ More
Reconfigurable intelligent surfaces (RIS) are key enablers for 6G wireless systems. This paper studies downlink transmission in an RIS-assisted MISO-OFDMA system, addressing resource allocation challenges. A two-stage unsupervised learning-based framework is proposed to jointly design RIS phase shifts, BS beamforming, and resource block (RB) allocation. The framework includes BeamNet, which predicts RIS phase shifts from CSI, and AllocationNet, which allocates RBs using equivalent CSI derived from BeamNet outputs. Active beamforming is implemented via maximum ratio transmission and water-filling. To handle discrete constraints while ensuring differentiability, quantization and the Gumbel-softmax trick are adopted. A customized loss and phased training enhance performance under QoS constraints. Simulations show the method achieves 99.93% of the sum rate of the SCA baseline with only 0.036% of its runtime, and it remains robust across varying channel and user conditions.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Hybrid Constellation Modulation for Symbol-Level Precoding in RIS-Enhanced MU-MISO Systems
Authors:
Yupeng Zheng,
Yi Ma,
Rahim Tafazolli
Abstract:
The application of symbol-level precoding (SLP) in reconfigurable intelligent surfaces (RIS) enhanced multi-user multiple-input single-output (MU-MISO) systems faces two main challenges. First, the state-of-the-art joint reflecting and SLP optimization approach requires exhaustive enumeration of all possible transmit symbol combinations, resulting in scalability issues as the modulation order and…
▽ More
The application of symbol-level precoding (SLP) in reconfigurable intelligent surfaces (RIS) enhanced multi-user multiple-input single-output (MU-MISO) systems faces two main challenges. First, the state-of-the-art joint reflecting and SLP optimization approach requires exhaustive enumeration of all possible transmit symbol combinations, resulting in scalability issues as the modulation order and number of users increase. Second, conventional quadrature amplitude modulation (QAM) exhibits strict constructive interference (CI) regions, limiting its effectiveness for CI exploitation in SLP. To address these challenges, this paper proposes a novel modulation scheme, termed hybrid-constellation modulation (HCM), which has a structure of superposed QAM and ASK sub-constellations (SCs). HCM extends the CI regions compared to QAM. Additionally, a two-stage reflecting and SLP optimization method is developed to support HCM. The proposed methods are designed for practical RIS with discrete phase shifts and has good scalability. Simulation results show that HCM achieves up to 1.5 dB and 1 dB SER gains over QAM with modulation order 16 and 64, respectively.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Cluster-Aware Two-Stage Method for Fast Iterative MIMO Detection in LEO Satellite Communications
Authors:
Jiuyu Liu,
Yi Ma,
Qihao Peng,
Rahim Tafazolli
Abstract:
In this paper, a cluster-aware two-stage multiple-input multiple-output (MIMO) detection method is proposed for direct-to-cell satellite communications. The method achieves computational efficiency by exploiting a distinctive property of satellite MIMO channels: users within the same geographical cluster exhibit highly correlated channel characteristics due to their physical proximity, which typic…
▽ More
In this paper, a cluster-aware two-stage multiple-input multiple-output (MIMO) detection method is proposed for direct-to-cell satellite communications. The method achieves computational efficiency by exploiting a distinctive property of satellite MIMO channels: users within the same geographical cluster exhibit highly correlated channel characteristics due to their physical proximity, which typically impedes convergence in conventional iterative MIMO detectors. The proposed method implements a two-stage strategy that first eliminates intra-cluster interference using computationally efficient small matrix inversions, then utilizes these pre-computed matrices to accelerate standard iterative MIMO detectors such as Gauss-Seidel (GS) and symmetric successive over-relaxation (SSOR) for effective inter-cluster interference cancellation. Computer simulations demonstrate that the proposed method achieves more than 12 times faster convergence under perfect channel state information. Even when accounting for channel estimation errors, the method maintains 9 times faster convergence, demonstrating its robustness and effectiveness for next-generation satellite MIMO communications.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
Authors:
Yinghao Ma,
Siyou Li,
Juntao Yu,
Emmanouil Benetos,
Akira Maezawa
Abstract:
Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats…
▽ More
Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.
△ Less
Submitted 27 June, 2025; v1 submitted 13 June, 2025;
originally announced June 2025.
-
Inverse Design in Distributed Circuits Using Single-Step Reinforcement Learning
Authors:
Jiayu Li,
Masood Mortazavi,
Ning Yan,
Yihong Ma,
Reza Zafarani
Abstract:
The goal of inverse design in distributed circuits is to generate near-optimal designs that meet a desirable transfer function specification. Existing design exploration methods use some combination of strategies involving artificial grids, differentiable evaluation procedures, and specific template topologies. However, real-world design practices often require non-differentiable evaluation proced…
▽ More
The goal of inverse design in distributed circuits is to generate near-optimal designs that meet a desirable transfer function specification. Existing design exploration methods use some combination of strategies involving artificial grids, differentiable evaluation procedures, and specific template topologies. However, real-world design practices often require non-differentiable evaluation procedures, varying topologies, and near-continuous placement spaces. In this paper, we propose DCIDA, a design exploration framework that learns a near-optimal design sampling policy for a target transfer function. DCIDA decides all design factors in a compound single-step action by sampling from a set of jointly-trained conditional distributions generated by the policy. Utilizing an injective interdependent ``map", DCIDA transforms raw sampled design ``actions" into uniquely equivalent physical representations, enabling the framework to learn the conditional dependencies among joint ``raw'' design decisions. Our experiments demonstrate DCIDA's Transformer-based policy network achieves significant reductions in design error compared to state-of-the-art approaches, with significantly better fit in cases involving more complex transfer functions.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Online Audio-Visual Autoregressive Speaker Extraction
Authors:
Zexu Pan,
Wupeng Wang,
Shengkui Zhao,
Chong Zhang,
Kun Zhou,
Yukun Ma,
Bin Ma
Abstract:
This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the informa…
▽ More
This paper proposes a novel online audio-visual speaker extraction model. In the streaming regime, most studies optimize the audio network only, leaving the visual frontend less explored. We first propose a lightweight visual frontend based on depth-wise separable convolution. Then, we propose a lightweight autoregressive acoustic encoder to serve as the second cue, to actively explore the information in the separated speech signal from past steps. Scenario-wise, for the first time, we study how the algorithm performs when there is a change in focus of attention, i.e., the target speaker. Experimental results on LRS3 datasets show that our visual frontend performs comparably to the previous state-of-the-art on both SkiM and ConvTasNet audio backbones with only 0.1 million network parameters and 2.1 MACs per second of processing. The autoregressive acoustic encoder provides an additional 0.9 dB gain in terms of SI-SNRi, and its momentum is robust against the change in attention.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding
Authors:
Haitao Li,
Ziyu Li,
Yiheng Mao,
Ziyi Liu,
Zhoujian Sun,
Zhengxing Huang
Abstract:
The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a br…
▽ More
The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice. Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset. A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
ZeroSep: Separate Anything in Audio with Zero Training
Authors:
Chao Huang,
Yuesheng Ma,
Junxuan Huang,
Susan Liang,
Yunlong Tang,
Jing Bi,
Wenqiang Liu,
Nima Mesgarani,
Chenliang Xu
Abstract:
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of ge…
▽ More
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Dynamic Resource Allocation in Distributed MIMO-LEO Satellite Networks
Authors:
Qihao Peng,
Qu Luo,
Yi Ma,
Chuan Heng Foh,
Pei Xiao,
Maged Elkashlan,
Rahim Tafazolli,
George K. Karagiannidis
Abstract:
This paper characterizes the impacts of channel estimation errors and Rician factors on achievable data rate and investigates the user scheduling strategy, combining scheme, power control, and dynamic bandwidth allocation to maximize the sum data rate in the distributed multiple-input-multiple-output (MIMO)-enabled low earth orbit (LEO) satellite networks. However, due to the resource-assignment p…
▽ More
This paper characterizes the impacts of channel estimation errors and Rician factors on achievable data rate and investigates the user scheduling strategy, combining scheme, power control, and dynamic bandwidth allocation to maximize the sum data rate in the distributed multiple-input-multiple-output (MIMO)-enabled low earth orbit (LEO) satellite networks. However, due to the resource-assignment problem, it is challenging to find the optimal solution for maximizing the sum data rate. To transform this problem into a more tractable form, we first quantify the channel estimation errors based on the minimum mean square error (MMSE) estimator and rigorously derive a closed-form lower bound of the achievable data rate, offering an explicit formulation for resource allocation. Then, to solve the NP-hard problem, we decompose it into three sub-problems, namely, user scheduling strategy, joint combination and power control, and dynamic bandwidth allocation, by using alternative optimization (AO). Specifically, the user scheduling is formulated as a graph coloring problem by iteratively updating an undirected graph based on user requirements, which is then solved using the DSatur algorithm. For the combining weights and power control, the successive convex approximation (SCA) and geometrical programming (GP) are adopted to obtain the sub-optimal solution with lower complexity. Finally, the optimal bandwidth allocation can be achieved by solving the concave problem.
Numerical results validate the analytical tightness of the derived bound, especially for large Rician factors, and demonstrate significant performance gains over other benchmarks.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction
Authors:
Zexu Pan,
Shengkui Zhao,
Tingting Wang,
Kun Zhou,
Yukun Ma,
Chong Zhang,
Bin Ma
Abstract:
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to pr…
▽ More
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
PhySense: Sensor Placement Optimization for Accurate Physics Sensing
Authors:
Yuezhou Ma,
Haixu Wu,
Hang Zhou,
Huikun Weng,
Jianmin Wang,
Mingsheng Long
Abstract:
Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placeme…
▽ More
Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered.
△ Less
Submitted 26 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Authors:
Ziyang Ma,
Yinghao Ma,
Yanqiao Zhu,
Chen Yang,
Yi-Wen Chao,
Ruiyang Xu,
Wenxi Chen,
Yuanzhe Chen,
Zhuo Chen,
Jian Cong,
Kai Li,
Keliang Li,
Siyou Li,
Xinfeng Li,
Xiquan Li,
Zheng Lian,
Yuzhe Liang,
Minghao Liu,
Zhikang Niu,
Tianrui Wang,
Yuping Wang,
Yuxuan Wang,
Yihao Wu,
Guanrou Yang,
Jianwei Yu
, et al. (9 additional authors not shown)
Abstract:
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that…
▽ More
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
Authors:
Zhiyuan Chen,
Keyi Li,
Yifan Jia,
Le Ye,
Yufei Ma
Abstract:
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip…
▽ More
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. When compared with 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase. Code is available at https://github.com/ccccczzy/icc.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Enhanced Robust Tracking Control: An Online Learning Approach
Authors:
Ao Jin,
Weijian Zhao,
Yifeng Ma,
Panfeng Huang,
Fan Zhang
Abstract:
This work focuses the tracking control problem for nonlinear systems subjected to unknown external disturbances. Inspired by contraction theory, a neural network-dirven CCM synthesis is adopted to obtain a feedback controller that could track any feasible trajectory. Based on the observation that the system states under continuous control input inherently contain embedded information about unknown…
▽ More
This work focuses the tracking control problem for nonlinear systems subjected to unknown external disturbances. Inspired by contraction theory, a neural network-dirven CCM synthesis is adopted to obtain a feedback controller that could track any feasible trajectory. Based on the observation that the system states under continuous control input inherently contain embedded information about unknown external disturbances, we propose an online learning scheme that captures the disturbances dyanmics from online historical data and embeds the compensation within the CCM controller. The proposed scheme operates as a plug-and-play module that intrinsically enhances the tracking performance of CCM synthesis. The numerical simulations on tethered space robot and PVTOL demonstrate the effectiveness of proposed scheme. The source code of the proposed online learning scheme can be found at https://github.com/NPU-RCIR/Online_CCM.git.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
ResiTok: A Resilient Tokenization-Enabled Framework for Ultra-Low-Rate and Robust Image Transmission
Authors:
Zhenyu Liu,
Yi Ma,
Rahim Tafazolli
Abstract:
Real-time transmission of visual data over wireless networks remains highly challenging, even when leveraging advanced deep neural networks, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose a novel Resilient Tokenization-Enabled (ResiTok) framework designed for ultra-low-rate image transmission that achieves exceptional robustn…
▽ More
Real-time transmission of visual data over wireless networks remains highly challenging, even when leveraging advanced deep neural networks, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose a novel Resilient Tokenization-Enabled (ResiTok) framework designed for ultra-low-rate image transmission that achieves exceptional robustness while maintaining high reconstruction quality. By reorganizing visual information into hierarchical token groups consisting of essential key tokens and supplementary detail tokens, ResiTok enables progressive encoding and graceful degradation of visual quality under constrained channel conditions. A key contribution is our resilient 1D tokenization method integrated with a specialized zero-out training strategy, which systematically simulates token loss during training, empowering the neural network to effectively compress and reconstruct images from incomplete token sets. Furthermore, the channel-adaptive coding and modulation design dynamically allocates coding resources according to prevailing channel conditions, yielding superior semantic fidelity and structural consistency even at extremely low channel bandwidth ratios. Evaluation results demonstrate that ResiTok outperforms state-of-the-art methods in both semantic similarity and visual quality, with significant advantages under challenging channel conditions.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Easz: An Agile Transformer-based Image Compression Framework for Resource-constrained IoTs
Authors:
Yu Mao,
Jingzong Li,
Jun Wang,
Hong Xu,
Tei-Wei Kuo,
Nan Guan,
Chun Jason Xue
Abstract:
Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities.…
▽ More
Neural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying the neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities. We take a step to solve the challenges by proposing a new transformer-based edge-compute-free image coding framework called Easz. Easz shifts the computational overhead to the server, and hence avoids the heavy encoding and model switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality.
△ Less
Submitted 14 May, 2025; v1 submitted 3 May, 2025;
originally announced May 2025.
-
Evaluation of Switching Technologies for Reflective and Transmissive RISs at Sub-THz Frequencies
Authors:
Sofia I. Inácio,
Yihan Ma,
Qi Luo,
Luca Lucci,
Awanish Kumar,
José Luis Gonzalez Jimenez,
Bruno Reig,
Alexandre Siligaris,
Denis Mercier,
Jonas Deuermeier,
Asal Kiazadeh,
Verónica Lain-Rubio,
Oleg Cojocari,
Tung D. Phan,
Ping Jack Soh,
Sérgio Matos,
George C. Alexandropoulos,
LuÃs M. Pessoa,
Antonio Clemente
Abstract:
For the upcoming 6G wireless networks, reconfigurable intelligent surfaces are an essential technology, enabling dynamic beamforming and signal manipulation in both reflective and transmissive modes. It is expected to utilize frequency bands in the millimeter-wave and THz, which presents unique opportunities but also significant challenges. The selection of switching technologies that can support…
▽ More
For the upcoming 6G wireless networks, reconfigurable intelligent surfaces are an essential technology, enabling dynamic beamforming and signal manipulation in both reflective and transmissive modes. It is expected to utilize frequency bands in the millimeter-wave and THz, which presents unique opportunities but also significant challenges. The selection of switching technologies that can support high-frequency operation with minimal loss and high efficiency is particularly complex. In this work, we demonstrate the potential of advanced components such as Schottky diodes, memristor switches, liquid metal-based switches, phase change materials, and RF-SOI technology in RIS designs as an alternative to overcome limitations inherent in traditional technologies in D-band (110-170 GHz).
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
SA-MIMO: Scalable Quantum-Based Wireless Communications
Authors:
Jiuyu Liu,
Yi Ma,
Rahim Tafazolli
Abstract:
Rydberg atomic receivers offer a quantum-native alternative to conventional RF front-ends by directly detecting electromagnetic fields via highly excited atomic states. While their quantum-limited sensitivity and hardware simplicity make them promising for future wireless systems, extending their use to scalable multi-antenna and multi-carrier configurations, termed Scalable Atomic-MIMO (SA-MIMO),…
▽ More
Rydberg atomic receivers offer a quantum-native alternative to conventional RF front-ends by directly detecting electromagnetic fields via highly excited atomic states. While their quantum-limited sensitivity and hardware simplicity make them promising for future wireless systems, extending their use to scalable multi-antenna and multi-carrier configurations, termed Scalable Atomic-MIMO (SA-MIMO), remains largely unexplored. This paper introduces a novel RF transmitter-atomic receiver architecture that addresses this gap. The core idea lies in a novel modulation technique called Phase-Rotated Symbol Spreading (PRSS), which transforms the nonlinear phase retrieval problem inherent to atomic detection into a tractable linear demultiplexing task. PRSS enables efficient signal processing and supports scalable MUX/DeMUX operations in both atomic MIMO and atomic OFDM systems. Simulation results show that the proposed system achieves up to 2.5 dB gain under optimal maximum-likelihood detection and over 10 dB under suboptimal detection in MIMO settings. These results establish PRSS assisted SA-MIMO as a promising architecture for realizing high-sensitivity, interference-resilient atomic wireless communication.
△ Less
Submitted 5 May, 2025; v1 submitted 27 April, 2025;
originally announced April 2025.
-
Block-Weighted Lasso for Joint Optimization of Memory Depth and Kernels in Wideband DPD
Authors:
Jinfei Wang,
Yi Ma,
Fei Tong,
Ziming He
Abstract:
The optimizations of both memory depth and kernel functions are critical for wideband digital pre-distortion (DPD). However, the memory depth is usually determined via exhaustive search over a wide range for the sake of linearization optimality, followed by the kernel selection of each memory depth, yielding excessive computational cost. In this letter, we aim to provide an efficient solution that…
▽ More
The optimizations of both memory depth and kernel functions are critical for wideband digital pre-distortion (DPD). However, the memory depth is usually determined via exhaustive search over a wide range for the sake of linearization optimality, followed by the kernel selection of each memory depth, yielding excessive computational cost. In this letter, we aim to provide an efficient solution that jointly optimizes the memory depth and kernels while preserving reasonable linearization performance. Specifically, we propose to formulate this optimization as a blockweighted least absolute shrinkage and selection operator (Lasso) problem, where kernels are assigned regularization weights based on their polynomial orders. Then, a block coordinate descent algorithm is introduced to solve the block-weighted Lasso problem. Measurement results on a generalized memory polynomial (GMP) model demonstrates that our proposed solution reduces memory depth by 31.6% and kernel count by 85% compared to the full GMP, while achieving -46.4 dB error vector magnitude (EVM) for signals of 80 MHz bandwidth. In addition, the proposed solution outperforms both the full GMP and the GMP pruned by standard Lasso by at least 0.7 dB in EVM.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Data-Importance-Aware Power Allocation for Adaptive Real-Time Communication in Computer Vision Applications
Authors:
Chunmei Xu,
Yi Ma,
Rahim Tafazolli,
Jiangzhou Wang
Abstract:
Life-transformative applications such as immersive extended reality are revolutionizing wireless communications and computer vision (CV). This paper presents a novel framework for importance-aware adaptive data transmissions, designed specifically for real-time CV applications where task-specific fidelity is critical. A novel importance-weighted mean square error (IMSE) metric is introduced as a t…
▽ More
Life-transformative applications such as immersive extended reality are revolutionizing wireless communications and computer vision (CV). This paper presents a novel framework for importance-aware adaptive data transmissions, designed specifically for real-time CV applications where task-specific fidelity is critical. A novel importance-weighted mean square error (IMSE) metric is introduced as a task-oriented measure of reconstruction quality, considering sub-pixel-level importance (SP-I) and semantic segment-level importance (SS-I) models. To minimize IMSE under total power constraints, data-importance-aware waterfilling approaches are proposed to optimally allocate transmission power according to data importance and channel conditions, prioritizing sub-streams with high importance. Simulation results demonstrate that the proposed approaches significantly outperform margin-adaptive waterfilling and equal power allocation strategies. The data partitioning that combines both SP-I and SS-I models is shown to achieve the most significant improvements, with normalized IMSE gains exceeding $7\,$dB and $10\,$dB over the baselines at high SNRs ($>10\,$dB). These substantial gains highlight the potential of the proposed framework to enhance data efficiency and robustness in real-time CV applications, especially in bandwidth-limited and resource-constrained environments.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Distributed Fault-Tolerant Control for Heterogeneous MAS with Prescribed Performance under Communication Failures
Authors:
Yongkang Zhang,
Bin Jiang,
Yajie Ma
Abstract:
This paper presents a novel approach employing prescribed performance control to address the distributed fault-tolerant formation control problem in a heterogeneous UAV-UGV cooperative system under a directed interaction topology and communication link failures. The proposed distributed fault-tolerant control scheme enables UAVs to accurately track a virtual leader's trajectory and achieve the des…
▽ More
This paper presents a novel approach employing prescribed performance control to address the distributed fault-tolerant formation control problem in a heterogeneous UAV-UGV cooperative system under a directed interaction topology and communication link failures. The proposed distributed fault-tolerant control scheme enables UAVs to accurately track a virtual leader's trajectory and achieve the desired formation, while ensuring UGVs converge within the convex hull formed by leader UAVs. By accounting for differences in system parameters and state dimensions between UAVs and UGVs, the method leverages performance functions to guarantee predefined transient and steady-state behavior. Additionally, a variable prescribed performance boundary control strategy with an adaptive learning rate is introduced to tackle actuator saturation, ensuring reliable formation tracking in real-world scenarios. Simulation results demonstrate the effectiveness and robustness of the proposed approach.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Optimal Sensor Placement Using Combinations of Hybrid Measurements for Source Localization
Authors:
Kang Tang,
Sheng Xu,
Yuqi Yang,
He Kong,
Yongsheng Ma
Abstract:
This paper focuses on static source localization employing different combinations of measurements, including time-difference-of-arrival (TDOA), received-signal-strength (RSS), angle-of-arrival (AOA), and time-of-arrival (TOA) measurements. Since sensor-source geometry significantly impacts localization accuracy, the strategies of optimal sensor placement are proposed systematically using combinati…
▽ More
This paper focuses on static source localization employing different combinations of measurements, including time-difference-of-arrival (TDOA), received-signal-strength (RSS), angle-of-arrival (AOA), and time-of-arrival (TOA) measurements. Since sensor-source geometry significantly impacts localization accuracy, the strategies of optimal sensor placement are proposed systematically using combinations of hybrid measurements. Firstly, the relationship between sensor placement and source estimation accuracy is formulated by a derived Cramér-Rao bound (CRB). Secondly, the A-optimality criterion, i.e., minimizing the trace of the CRB, is selected to calculate the smallest reachable estimation mean-squared-error (MSE) in a unified manner. Thirdly, the optimal sensor placement strategies are developed to achieve the optimal estimation bound. Specifically, the specific constraints of the optimal geometries deduced by specific measurement, i.e., TDOA, AOA, RSS, and TOA, are found and discussed theoretically. Finally, the new findings are verified by simulation studies.
△ Less
Submitted 9 April, 2025; v1 submitted 2 April, 2025;
originally announced April 2025.
-
Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge
Authors:
Yudi Sang,
Yanzhen Liu,
Sutuke Yibulayimu,
Yunning Wang,
Benjamin D. Killeen,
Mingxu Liu,
Ping-Cheng Ku,
Ole Johannsen,
Karol Gotkowski,
Maximilian Zenk,
Klaus Maier-Hein,
Fabian Isensee,
Peiyan Yue,
Yi Wang,
Haidong Yu,
Zhaohong Pan,
Yutong He,
Xiaokun Liang,
Daiqi Liu,
Fuxin Fan,
Artur Jurgas,
Andrzej Skalski,
Yuxi Ma,
Jing Yang,
Szymon Płotka
, et al. (11 additional authors not shown)
Abstract:
The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture…
▽ More
The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Structure Identification of NDS with Descriptor Subsystems under Asynchronous, Non-Uniform, and Slow-Rate Sampling
Authors:
Yunxiang Ma,
Tong Zhou
Abstract:
Networked dynamic systems (NDS) exhibit collective behavior shaped by subsystem dynamics and complex interconnections, yet identifying these interconnections remains challenging due to irregularities in sampled data, including asynchronous, non-uniform, and low-rate sampling. This paper proposes a novel two-stage structure identification algorithm that leverages system zero-order moments, a concep…
▽ More
Networked dynamic systems (NDS) exhibit collective behavior shaped by subsystem dynamics and complex interconnections, yet identifying these interconnections remains challenging due to irregularities in sampled data, including asynchronous, non-uniform, and low-rate sampling. This paper proposes a novel two-stage structure identification algorithm that leverages system zero-order moments, a concept traditionally used in model order reduction, to bridge system identification and model reduction. First, zero-order moments are estimated from steady-state time-domain outputs; second, subsystem interconnections are explicitly reconstructed from these moments. The method generalizes existing approaches by handling asynchronous, non-uniform, and slow sampling simultaneously, eliminating constraints on input signal periodicity and extending applicability to multi-input multi-output NDS with arbitrary interconnections. Unlike black-box identification techniques, our approach explicitly recovers subsystem interconnection structures. Validation on the IEEE 14-bus system demonstrates the algorithm's effectiveness in recovering subsystem interconnections from irregular sampling data.
△ Less
Submitted 27 May, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
RIS-Assisted Passive Localization (RAPL): An Efficient Zero-Overhead Framework Using Conditional Sample Mean
Authors:
Jiawei Yao,
Yijie Mao,
Mingzhe Chen,
Ye Hu
Abstract:
Reconfigurable Intelligent Surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and…
▽ More
Reconfigurable Intelligent Surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and the RIS. To address these challenges, in this work, we move beyond conventional methods and introduce a novel data-driven, multiple RISs-assisted passive localization approach (RAPL). The proposed method includes two stages, the angle-of-directions (AoDs) between the RISs and the user is estimated by using the conditional sample mean in the first stage, and then the user's position is determined based on the estimated multiple AoD pairs in the second stage. This approach only utilizes the existing communication signals between the user and the BS, relying solely on the measurement of received signal power at each BS antenna for a set of randomly generated phase shifts across all RISs. Moreover, by obviating the need for real-time RIS phase shift optimization or user-to-BS pilot transmissions, the method introduces no additional communication overhead, making it highly suitable for deployment in real-world networks. The proposed scheme is then extended to multi-RIS scenarios considering both parallel and cascaded RIS topologies. Numerical results show that the proposed RAPL improves localization accuracy while significantly reducing energy and signaling overhead compared to conventional methods.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
RIS-Assisted Localization: A Novel Conditional Sample Mean Approach without CSI
Authors:
Jiawei Yao,
Yijie Mao,
Mingzhe Chen
Abstract:
Reconfigurable intelligent surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and…
▽ More
Reconfigurable intelligent surface (RIS) has been recognized as a promising solution for enhancing localization accuracy. Traditional RIS-based localization methods typically rely on prior channel knowledge, beam scanning, and pilot-based assistance. These approaches often result in substantial energy and computational overhead, and require real-time coordination between the base station (BS) and the RIS. In this work, we propose a novel multiple RISs aided localization approach to address these challenges. The proposed method first estimates the angle-of-directions (AoDs) between the RISs and the user using the conditional sample mean approach, and then uses the estimated multiple AoD pairs to determine the user's position. This approach only requires measuring the received signal strength at the BS for a set of randomly generated phase shifts across all RISs, thereby eliminating the need for real-time RIS phase shift design or user-to-BS pilot transmissions. Numerical results show that the proposed localization approach improves localization accuracy while significantly reducing energy and signaling overhead compared to conventional methods.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression
Authors:
Yu Mao,
Jun Wang,
Nan Guan,
Chun Jason Xue
Abstract:
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Inter…
▽ More
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
RAISE: Optimizing RIS Placement to Maximize Task Throughput in Multi-Server Vehicular Edge Computing
Authors:
Yanan Ma,
Zhengru Fang,
Longzhi Yuan,
Yiqin Deng,
Xianhao Chen,
Yuguang Fang
Abstract:
Given the limited computing capabilities on autonomous vehicles, onboard processing of large volumes of latency-sensitive tasks presents significant challenges. While vehicular edge computing (VEC) has emerged as a solution, offloading data-intensive tasks to roadside servers or other vehicles is hindered by large obstacles like trucks/buses and the surge in service demands during rush hours. To a…
▽ More
Given the limited computing capabilities on autonomous vehicles, onboard processing of large volumes of latency-sensitive tasks presents significant challenges. While vehicular edge computing (VEC) has emerged as a solution, offloading data-intensive tasks to roadside servers or other vehicles is hindered by large obstacles like trucks/buses and the surge in service demands during rush hours. To address these challenges, Reconfigurable Intelligent Surface (RIS) can be leveraged to mitigate interference from ground signals and reach more edge servers by elevating RIS adaptively. To this end, we propose RAISE, an optimization framework for RIS placement in multi-server VEC systems. Specifically, RAISE optimizes RIS altitude and tilt angle together with the optimal task assignment to maximize task throughput under deadline constraints. To find a solution, a two-layer optimization approach is proposed, where the inner layer exploits the unimodularity of the task assignment problem to derive the efficient optimal strategy while the outer layer develops a near-optimal hill climbing (HC) algorithm for RIS placement with low complexity. Extensive experiments demonstrate that the proposed RAISE framework consistently outperforms existing benchmarks.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Authors:
Ruibin Yuan,
Hanfeng Lin,
Shuyue Guo,
Ge Zhang,
Jiahao Pan,
Yongyi Zang,
Haohe Liu,
Yiming Liang,
Wenye Ma,
Xingjian Du,
Xinrun Du,
Zhen Ye,
Tianyu Zheng,
Yinghao Ma,
Minghao Liu,
Zeyue Tian,
Ziya Zhou,
Liumeng Xue,
Xingwei Qu,
Yizhi Li,
Shangda Wu,
Tianhao Shen,
Ziyang Ma,
Jun Zhan,
Chunhui Wang
, et al. (32 additional authors not shown)
Abstract:
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate…
▽ More
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Distributed Resource Block Allocation for Wideband Cell-free System
Authors:
Yang Ma,
Shengqian Han,
Chenyang Yang
Abstract:
This paper studies distributed resource block (RB) allocation in wideband orthogonal frequency-division multiplexing (OFDM) cell-free systems. We propose a novel distributed sequential algorithm and its two variants, which optimize RB allocation based on the information obtained through over-the-air (OTA) transmissions between access points (APs) and user equipments, enabling local decision update…
▽ More
This paper studies distributed resource block (RB) allocation in wideband orthogonal frequency-division multiplexing (OFDM) cell-free systems. We propose a novel distributed sequential algorithm and its two variants, which optimize RB allocation based on the information obtained through over-the-air (OTA) transmissions between access points (APs) and user equipments, enabling local decision updates at each AP. To reduce the overhead of OTA transmission, we further develop a distributed deep learning (DL)-based method to learn the RB allocation policy. Simulation results demonstrate that the proposed distributed algorithms perform close to the centralized algorithm, while the DL-based method outperforms existing baseline methods.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Semi-Supervised Medical Image Segmentation via Knowledge Mining from Large Models
Authors:
Yuchen Mao,
Hongwei Li,
Yinyi Lai,
Giorgos Papanastasiou,
Peng Qi,
Yunjie Yang,
Chengjia Wang
Abstract:
Large-scale vision models like SAM have extensive visual knowledge, yet their general nature and computational demands limit their use in specialized tasks like medical image segmentation. In contrast, task-specific models such as U-Net++ often underperform due to sparse labeled data. This study introduces a strategic knowledge mining method that leverages SAM's broad understanding to boost the pe…
▽ More
Large-scale vision models like SAM have extensive visual knowledge, yet their general nature and computational demands limit their use in specialized tasks like medical image segmentation. In contrast, task-specific models such as U-Net++ often underperform due to sparse labeled data. This study introduces a strategic knowledge mining method that leverages SAM's broad understanding to boost the performance of small, locally hosted deep learning models.
In our approach, we trained a U-Net++ model on a limited labeled dataset and extend its capabilities by converting SAM's output infered on unlabeled images into prompts. This process not only harnesses SAM's generalized visual knowledge but also iteratively improves SAM's prediction to cater specialized medical segmentation tasks via U-Net++. The mined knowledge, serving as "pseudo labels", enriches the training dataset, enabling the fine-tuning of the local network.
Applied to the Kvasir SEG and COVID-QU-Ex datasets which consist of gastrointestinal polyp and lung X-ray images respectively, our proposed method consistently enhanced the segmentation performance on Dice by 3% and 1% respectively over the baseline U-Net++ model, when the same amount of labelled data were used during training (75% and 50% of labelled data). Remarkably, our proposed method surpassed the baseline U-Net++ model even when the latter was trained exclusively on labeled data (100% of labelled data). These results underscore the potential of knowledge mining to overcome data limitations in specialized models by leveraging the broad, albeit general, knowledge of large-scale models like SAM, all while maintaining operational efficiency essential for clinical applications.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Deep Reinforcement Learning-Based Semi-Autonomous Control for Magnetic Micro-robot Navigation with Immersive Manipulation
Authors:
Yudong Mao,
Dandan Zhang
Abstract:
Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not prov…
▽ More
Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not provide an intuitive manner for operators to manipulate micro-robots. These limitations increase the cognitive load on operators, who must interpret limited feedback and translate it into effective control actions. To address these challenges, we propose a Deep Reinforcement Learning-Based Semi-Autonomous Control (DRL-SC) framework for magnetic micro-robot navigation in a simulated microvascular system. Our framework integrates Mixed Reality (MR) to facilitate immersive manipulation of micro-robots, thereby enhancing situational awareness and control precision. Simulation and experimental results demonstrate that our approach significantly improves navigation efficiency, reduces control errors, and enhances the overall robustness of the system in simulated microvascular environments.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images
Authors:
YingLiang Ma,
Sandra Howell,
Aldo Rinaldi,
Tarv Dhanjal,
Kawal S. Rhode
Abstract:
Objective: Interventional devices, catheters and insertable imaging devices such as transesophageal echo (TOE) probes are routinely used in minimally invasive cardiovascular procedures. Detecting their positions and orientations in X-ray fluoroscopic images is important for many clinical applications. Method: In this paper, a novel attention mechanism was designed to guide a convolution neural net…
▽ More
Objective: Interventional devices, catheters and insertable imaging devices such as transesophageal echo (TOE) probes are routinely used in minimally invasive cardiovascular procedures. Detecting their positions and orientations in X-ray fluoroscopic images is important for many clinical applications. Method: In this paper, a novel attention mechanism was designed to guide a convolution neural network (CNN) model to the areas of wires in X-ray images, as nearly all interventional devices and catheters used in cardiovascular procedures contain wires. The attention mechanism includes multi-scale Gaussian derivative filters and a dot-product-based attention layer. By utilizing the proposed attention mechanism, a lightweight foundation model can be created to detect multiple objects simultaneously with higher precision and real-time speed. Results: The proposed model was trained and tested on a total of 12,438 X-ray images. An accuracy of 0.88 was achieved for detecting an echo probe and 0.87 for detecting an artificial valve at 58 FPS. The accuracy was measured by intersection-over-union (IoU). We also achieved a 99.8% success rate in detecting a 10-electrode catheter and a 97.8% success rate in detecting an ablation catheter. Conclusion: Our detection foundation model can simultaneously detect and identify both interventional devices and flexible catheters in real-time X-ray fluoroscopic images. Significance: The proposed model employs a novel attention mechanism to achieve high-performance object detection, making it suitable for various clinical applications and robotic-assisted surgeries. Codes are available at https://github.com/YingLiangMa/AttWire.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution
Authors:
Zelin Li,
Chenwei Wang,
Zhaoke Huang,
Yiming MA,
Cunmin Zhao,
Zhongying Zhao,
Hong Yan
Abstract:
3D fluorescence microscopy is essential for understanding fundamental life processes through long-term live-cell imaging. However, due to inherent issues in imaging principles, it faces significant challenges including spatially varying noise and anisotropic resolution, where the axial resolution lags behind the lateral resolution up to 4.5 times. Meanwhile, laser power is kept low to maintain cel…
▽ More
3D fluorescence microscopy is essential for understanding fundamental life processes through long-term live-cell imaging. However, due to inherent issues in imaging principles, it faces significant challenges including spatially varying noise and anisotropic resolution, where the axial resolution lags behind the lateral resolution up to 4.5 times. Meanwhile, laser power is kept low to maintain cell viability, leading to inaccessible low-noise and high-resolution paired ground truth (GT). To tackle these limitations, a dual Cycle-consistent Diffusion is proposed to effectively mine intra-volume imaging priors within 3D cell volumes in an unsupervised manner, i.e., Volume Tells (VTCD), achieving de-noising and super-resolution (SR) simultaneously. Specifically, a spatially iso-distributed denoiser is designed to exploit the noise distribution consistency between adjacent low-noise and high-noise regions within the 3D cell volume, suppressing the spatially varying noise. Then, in light of the structural consistency of the cell volume, a cross-plane global-propagation SR module propagates high-resolution details from the XY plane into adjacent regions in the XZ and YZ planes, progressively enhancing resolution across the entire 3D cell volume. Experimental results on 10 in vivo cellular dataset demonstrate high improvements in both denoising and super-resolution, with axial resolution enhanced from ~ 430 nm to ~ 90 nm.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Rate Splitting Multiple Access for Simultaneous Lightwave Information and Power Transfer
Authors:
Zhengqing Qiu,
Yijie Mao
Abstract:
This paper initiate the application of rate splitting multiple access (RSMA) for simultaneous lightwave information and power transfer (SLIPT), where users require to decode information and harvest energy. We focus on a time-splitting (TS) mode where information decoding and energy harvesting are separated in two different phases. Based on the proposed system model, we design a constrained-concave…
▽ More
This paper initiate the application of rate splitting multiple access (RSMA) for simultaneous lightwave information and power transfer (SLIPT), where users require to decode information and harvest energy. We focus on a time-splitting (TS) mode where information decoding and energy harvesting are separated in two different phases. Based on the proposed system model, we design a constrained-concave-convex programming (CCCP) algorithm to solve the optimization problem of maximizing the worst-case rate among users subject to the harvested energy constraint at each user. Specifically, the proposed algorithm exploits transformation of the bilinear function, semidefinite relaxation (SDR), CCCP, and a penalty method to effectively deal with the non-convex constraints and objective function. Numerical results show that our proposed RSMA-aided SLIPT outperforms the existing baselines based on space-division multiple access (SDMA) and non-orthogonal multiple access (NOMA).
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation
Authors:
Chong Zhang,
Yukun Ma,
Qian Chen,
Wen Wang,
Shengkui Zhao,
Zexu Pan,
Hao Wang,
Chongjia Ni,
Trung Hieu Nguyen,
Kun Zhou,
Yidi Jiang,
Chaohong Tan,
Zhifu Gao,
Zhihao Du,
Bin Ma
Abstract:
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sam…
▽ More
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
Data-Importance-Aware Waterfilling for Adaptive Real-Time Communication in Computer Vision Applications
Authors:
Chunmei Xu,
Yi Ma,
Rahim Tafazolli
Abstract:
This paper presents a novel framework for importance-aware adaptive data transmission, designed specifically for real-time computer vision (CV) applications where task-specific fidelity is critical. An importance-weighted mean square error (IMSE) metric is introduced, assigning data importance based on bit positions within pixels and semantic relevance within visual segments, thus providing a task…
▽ More
This paper presents a novel framework for importance-aware adaptive data transmission, designed specifically for real-time computer vision (CV) applications where task-specific fidelity is critical. An importance-weighted mean square error (IMSE) metric is introduced, assigning data importance based on bit positions within pixels and semantic relevance within visual segments, thus providing a task-oriented measure of reconstruction quality.To minimize IMSE under the total power constraint, a data-importance-aware waterfilling approach is proposed to optimally allocate transmission power according to data importance and channel conditions. Simulation results demonstrate that the proposed approach significantly outperforms margin-adaptive waterfilling and equal power allocation strategies, achieving more than $7$ dB and $10$ dB gains in normalized IMSE at high SNRs ($> 10$ dB), respectively. These results highlight the potential of the proposed framework to enhance data efficiency and robustness in real-time CV applications, especially in bandwidth-limited and resource-constrained environments.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
Transfer Learning Assisted Fast Design Migration Over Technology Nodes: A Study on Transformer Matching Network
Authors:
Chenhao Chu,
Yuhao Mao,
Hua Wang
Abstract:
In this study, we introduce an innovative methodology for the design of mm-Wave passive networks that leverages knowledge transfer from a pre-trained synthesis neural network (NN) model in one technology node and achieves swift and reliable design adaptation across different integrated circuit (IC) technologies, operating frequencies, and metal options. We prove this concept through simulation-bas…
▽ More
In this study, we introduce an innovative methodology for the design of mm-Wave passive networks that leverages knowledge transfer from a pre-trained synthesis neural network (NN) model in one technology node and achieves swift and reliable design adaptation across different integrated circuit (IC) technologies, operating frequencies, and metal options. We prove this concept through simulation-based demonstrations focusing on the training and comparison of the coefficient of determination (R2) of synthesis NNs for 1:1 on-chip transformers in GlobalFoundries(GF) 22nm FDX+ (target domain), with and without transfer learning from a model trained in GF 45nm SOI (source domain). In the experiments, we explore varying target data densities of 0.5%, 1%, 5%, and 100% with a complete dataset of 0.33 million in GF 22FDX+, and for comparative analysis, apply source data densities of 25%, 50%, 75%, and 100% with a complete dataset of 2.5 million in GF 45SOI. With the source data only at 30GHz, the experiments span target data from two metal options in GF 22FDX+ at frequencies of 30 and 39 GHz. The results prove that the transfer learning with the source domain knowledge (GF 45SOI) can both accelerate the training process in the target domain (GF 22FDX+) and improve the R2 values compared to models without knowledge transfer. Furthermore, it is observed that a model trained with just 5% of target data and augmented by transfer learning achieves R2 values superior to a model trained with 20% of the data without transfer, validating the advantage seen from 1% to 5% data density. This demonstrates a notable reduction of 4X in the necessary dataset size highlighting the efficacy of utilizing transfer learning to mm-Wave passive network design. The PyTorch learning and testing code is publicly available at https://github.com/ChenhaoChu/RFIC-TL.
△ Less
Submitted 11 March, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
Audio-FLAN: A Preliminary Release
Authors:
Liumeng Xue,
Ziya Zhou,
Jiahao Pan,
Zixuan Li,
Shuai Fan,
Yinghao Ma,
Sitong Cheng,
Dongchao Yang,
Haohan Guo,
Yujia Xiao,
Xinsheng Wang,
Zixuan Shen,
Chuanbo Zhu,
Xinshen Zhang,
Tianchi Liu,
Ruibin Yuan,
Zeyue Tian,
Haohe Liu,
Emmanouil Benetos,
Ge Zhang,
Yike Guo,
Wei Xue
Abstract:
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin…
▽ More
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Importance-Aware Source-Channel Coding for Multi-Modal Task-Oriented Semantic Communication
Authors:
Yi Ma,
Chunmei Xu,
Zhenyu Liu,
Siqi Zhang,
Rahim Tafazolli
Abstract:
This paper explores the concept of information importance in multi-modal task-oriented semantic communication systems, emphasizing the need for high accuracy and efficiency to fulfill task-specific objectives. At the transmitter, generative AI (GenAI) is employed to partition visual data objects into semantic segments, each representing distinct, task-relevant information. These segments are subse…
▽ More
This paper explores the concept of information importance in multi-modal task-oriented semantic communication systems, emphasizing the need for high accuracy and efficiency to fulfill task-specific objectives. At the transmitter, generative AI (GenAI) is employed to partition visual data objects into semantic segments, each representing distinct, task-relevant information. These segments are subsequently encoded into tokens, enabling precise and adaptive transmission control. Building on this frame work, we present importance-aware source and channel coding strategies that dynamically adjust to varying levels of significance at the segment, token, and bit levels. The proposed strategies prioritize high fidelity for essential information while permitting controlled distortion for less critical elements, optimizing overall resource utilization. Furthermore, we address the source-channel coding challenge in semantic multiuser systems, particularly in multicast scenarios, where segment importance varies among receivers. To tackle these challenges, we propose solutions such as rate-splitting coded progressive transmission, ensuring flexibility and robustness in task-specific semantic communication.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Advancing User-Voice Interaction: Exploring Emotion-Aware Voice Assistants Through a Role-Swapping Approach
Authors:
Yong Ma,
Yuchong Zhang,
Di Fu,
Stephanie Zubicueta Portales,
Danica Kragic,
Morten Fjeld
Abstract:
As voice assistants (VAs) become increasingly integrated into daily life, the need for emotion-aware systems that can recognize and respond appropriately to user emotions has grown. While significant progress has been made in speech emotion recognition (SER) and sentiment analysis, effectively addressing user emotions-particularly negative ones-remains a challenge. This study explores human emotio…
▽ More
As voice assistants (VAs) become increasingly integrated into daily life, the need for emotion-aware systems that can recognize and respond appropriately to user emotions has grown. While significant progress has been made in speech emotion recognition (SER) and sentiment analysis, effectively addressing user emotions-particularly negative ones-remains a challenge. This study explores human emotional response strategies in VA interactions using a role-swapping approach, where participants regulate AI emotions rather than receiving pre-programmed responses. Through speech feature analysis and natural language processing (NLP), we examined acoustic and linguistic patterns across various emotional scenarios. Results show that participants favor neutral or positive emotional responses when engaging with negative emotional cues, highlighting a natural tendency toward emotional regulation and de-escalation. Key acoustic indicators such as root mean square (RMS), zero-crossing rate (ZCR), and jitter were identified as sensitive to emotional states, while sentiment polarity and lexical diversity (TTR) distinguished between positive and negative responses. These findings provide valuable insights for developing adaptive, context-aware VAs capable of delivering empathetic, culturally sensitive, and user-aligned responses. By understanding how humans naturally regulate emotions in AI interactions, this research contributes to the design of more intuitive and emotionally intelligent voice assistants, enhancing user trust and engagement in human-AI interactions.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model
Authors:
Hang Yin,
Li Qiao,
Yu Ma,
Shuo Sun,
Kan Li,
Zhen Gao,
Dusit Niyato
Abstract:
Despite significant advancements in traditional syntactic communications based on Shannon's theory, these methods struggle to meet the requirements of 6G immersive communications, especially under challenging transmission conditions. With the development of generative artificial intelligence (GenAI), progress has been made in reconstructing videos using high-level semantic information. In this pap…
▽ More
Despite significant advancements in traditional syntactic communications based on Shannon's theory, these methods struggle to meet the requirements of 6G immersive communications, especially under challenging transmission conditions. With the development of generative artificial intelligence (GenAI), progress has been made in reconstructing videos using high-level semantic information. In this paper, we propose a scalable generative video semantic communication framework that extracts and transmits semantic information to achieve high-quality video reconstruction. Specifically, at the transmitter, description and other condition signals (e.g., first frame, sketches, etc.) are extracted from the source video, functioning as text and structural semantics, respectively. At the receiver, the diffusion-based GenAI large models are utilized to fuse the semantics of the multiple modalities for reconstructing the video. Simulation results demonstrate that, at an ultra-low channel bandwidth ratio (CBR), our scheme effectively captures semantic information to reconstruct videos aligned with human perception under different signal-to-noise ratios. Notably, the proposed ``First Frame+Desc." scheme consistently achieves CLIP score exceeding 0.92 at CBR = 0.0057 for SNR > 0 dB. This demonstrates its robust performance even under low SNR conditions.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Authors:
Ailin Huang,
Boyong Wu,
Bruce Wang,
Chao Yan,
Chen Hu,
Chengli Feng,
Fei Tian,
Feiyu Shen,
Jingbei Li,
Mingrui Chen,
Peng Liu,
Ruihang Miao,
Wang You,
Xi Chen,
Xuerui Yang,
Yechang Huang,
Yuxiang Zhang,
Zheng Gong,
Zixin Zhang,
Hongyu Zhou,
Jianjian Sun,
Brian Li,
Chengting Feng,
Changyi Wan,
Hanpeng Hu
, et al. (120 additional authors not shown)
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contribu…
▽ More
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
ELAA-ISAC: Environmental Mapping Utilizing the LoS State of Communication Channel
Authors:
Jiuyu Liu,
Chunmei Xu,
Yi Ma,
Rahim Tafazolli,
Ahmed Elzanaty
Abstract:
In this paper, a novel environmental mapping method is proposed to outline the indoor layout utilizing the line-of-sight (LoS) state information of extremely large aperture array (ELAA) channels. It leverages the spatial resolution provided by ELAA and the mobile terminal (MT)'s mobility to infer the presence and location of obstacles in the environment. The LoS state estimation is formulated as a…
▽ More
In this paper, a novel environmental mapping method is proposed to outline the indoor layout utilizing the line-of-sight (LoS) state information of extremely large aperture array (ELAA) channels. It leverages the spatial resolution provided by ELAA and the mobile terminal (MT)'s mobility to infer the presence and location of obstacles in the environment. The LoS state estimation is formulated as a binary hypothesis testing problem, and the optimal decision rule is derived based on the likelihood ratio test. Subsequently, the theoretical error probability of LoS estimation is derived, showing close alignment with simulation results. Then, an environmental mapping method is proposed, which progressively outlines the layout by combining LoS state information from multiple MT locations. It is demonstrated that the proposed method can accurately outline the environment layout, with the mapping accuracy improving as the number of service-antennas and MT locations increases. This paper also investigates the impact of channel estimation error and non-LoS (NLoS) components on the quality of environmental mapping. The proposed method exhibits particularly promising performance in LoS dominated wireless environments characterized by high Rician K-factor. Specifically, it achieves an average intersection over union (IoU) exceeding 80% when utilizing 256 service antennas and 18 MT locations.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Exploiting Non-uniform Quantization for Enhanced ILC in Wideband Digital Pre-distortion
Authors:
Jinfei Wang,
Yi Ma,
Fei Tong,
Ziming He
Abstract:
In this paper, it is identified that lowering the reference level at the vector signal analyzer can significantly improve the performance of iterative learning control (ILC). We present a mathematical explanation for this phenomenon, where the signals experience logarithmic transform prior to analogue-to-digital conversion, resulting in non-uniform quantization. This process reduces the quantizati…
▽ More
In this paper, it is identified that lowering the reference level at the vector signal analyzer can significantly improve the performance of iterative learning control (ILC). We present a mathematical explanation for this phenomenon, where the signals experience logarithmic transform prior to analogue-to-digital conversion, resulting in non-uniform quantization. This process reduces the quantization noise of low-amplitude signals that constitute a substantial portion of orthogonal frequency division multiplexing (OFDM) signals, thereby improving ILC performance. Measurement results show that compared to setting the reference level to the peak amplitude, lowering the reference level achieves 3 dB improvement on error vector magnitude (EVM) and 15 dB improvement on normalized mean square error (NMSE) for 320 MHz WiFi OFDM signals.
△ Less
Submitted 28 February, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control
Authors:
Kaizhen Zhu,
Mokai Pan,
Yuexin Ma,
Yanwei Fu,
Jingyi Yu,
Jingya Wang,
Ye Shi
Abstract:
Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations,…
▽ More
Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob's $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at https://github.com/UniDB-SOC/UniDB/.
△ Less
Submitted 6 June, 2025; v1 submitted 8 February, 2025;
originally announced February 2025.
-
Direct Uplink Connectivity in Space MIMO Systems with THz and FSO Inter-Satellite Links
Authors:
Zohre Mashayekh Bakhsh,
Yasaman Omid,
Gaojie Chen,
Farbod Kayhan,
Yi Ma,
Rahim Tafazolli
Abstract:
This paper investigates uplink transmission from a single-antenna mobile phone to a cluster of satellites, emphasizing the role of inter-satellite links (ISLs) in facilitating cooperative signal detection. The study focuses on non-ideal ISLs, examining both terahertz (THz) and free-space optical (FSO) ISLs concerning their ergodic capacity. We present a practical scenario derived from the recent 3…
▽ More
This paper investigates uplink transmission from a single-antenna mobile phone to a cluster of satellites, emphasizing the role of inter-satellite links (ISLs) in facilitating cooperative signal detection. The study focuses on non-ideal ISLs, examining both terahertz (THz) and free-space optical (FSO) ISLs concerning their ergodic capacity. We present a practical scenario derived from the recent 3GPP standard, specifying the frequency band, bandwidth, user and satellite antenna gains, power levels, and channel characteristics in alignment with the latest 3GPP for non-terrestrial networks (NTN). Additionally, we propose a satellite selection method to identify the optimal satellite as the master node (MN), responsible for signal processing. This method takes into account both the user-satellite link and ISL channels. For the THz ISL analysis, we derive a closed-form approximation for ergodic capacity under two scenarios: one with instantaneous channel state information (CSI) and another with only statistical CSI shared between satellites. For the FSO ISL analysis, we present a closed-form approximate upper bound for ergodic capacity, accounting for the impact of pointing error loss. Furthermore, we evaluate the effects of different ISL frequencies and pointing errors on spectral efficiency. Simulation results demonstrate that multi-satellite multiple-input multiple-output (MIMO) satellite communication (SatCom) significantly outperforms single-satellite SatCom in terms of spectral efficiency. Additionally, our approximated upper bound for ergodic capacity closely aligns with results obtained from Monte Carlo simulations.
△ Less
Submitted 2 February, 2025;
originally announced February 2025.
-
Joint Active and Passive Beamforming Optimization for Beyond Diagonal RIS-aided Multi-User Communications
Authors:
Xiaohua Zhou,
Tianyu Fang,
Yijie Mao
Abstract:
Benefiting from its capability to generalize existing reconfigurable intelligent surface (RIS) architectures and provide additional design flexibility via interactions between RIS elements, beyond-diagonal RIS (BD-RIS) has attracted considerable research interests recently. However, due to the symmetric and unitary passive beamforming constraint imposed on BD-RIS, existing joint active and passive…
▽ More
Benefiting from its capability to generalize existing reconfigurable intelligent surface (RIS) architectures and provide additional design flexibility via interactions between RIS elements, beyond-diagonal RIS (BD-RIS) has attracted considerable research interests recently. However, due to the symmetric and unitary passive beamforming constraint imposed on BD-RIS, existing joint active and passive beamforming optimization algorithms for BD-RIS either exhibit high computational complexity to achieve near optimal solutions or rely on heuristic algorithms with substantial performance loss. In this paper, we address this issue by proposing an efficient optimization framework for BD-RIS assisted multi-user multi-antenna communication networks. Specifically, we solve the weighted sum rate maximization problem by introducing a novel beamforming optimization algorithm that alternately optimizes active and passive beamforming matrices using iterative closed-form solutions. Numerical results demonstrate that our algorithm significantly reduces computational complexity while ensuring a sub-optimal solution.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.