Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > eess

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Electrical Engineering and Systems Science

  • New submissions
  • Cross-lists
  • Replacements

See recent articles

Showing new listings for Friday, 5 September 2025

Total of 97 entries
Showing up to 2000 entries per page: fewer | more | all

New submissions (showing 35 of 35 entries)

[1] arXiv:2509.03543 [pdf, html, other]
Title: Latent Space Single-Pixel Imaging Under Low-Sampling Conditions
Chenyu Yuan
Subjects: Image and Video Processing (eess.IV); Optics (physics.optics)

In recent years, the introduction of deep learning into the field of single-pixel imaging has garnered significant attention. However, traditional networks often operate within the pixel space. To address this, we innovatively migrate single-pixel imaging to the latent space, naming this framework LSSPI (Latent Space Single-Pixel Imaging). Within the latent space, we conduct in-depth explorations into both reconstruction and generation tasks for single-pixel imaging. Notably, this approach significantly enhances imaging capabilities even under low sampling rate conditions. Compared to conventional deep learning networks, LSSPI not only reconstructs images with higher signal-to-noise ratios (SNR) and richer details under equivalent sampling rates but also enables blind denoising and effective recovery of high-frequency information. Furthermore, by migrating single-pixel imaging to the latent space, LSSPI achieves superior advantages in terms of model parameter efficiency and reconstruction speed. Its excellent computational efficiency further positions it as an ideal solution for low-sampling single-pixel imaging applications, effectively driving the practical implementation of single-pixel imaging technology.

[2] arXiv:2509.03685 [pdf, html, other]
Title: Data-Driven Smart Maintenance of Historic Buildings
Zhongjun Ni (Department of Science and Technology, Linköping University, Campus Norrköping, Norrköping, Sweden)
Comments: Doctoral thesis, Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2444
Journal-ref: Link\"oping: Link\"oping University Electronic Press, 2025. , p. 88
Subjects: Systems and Control (eess.SY)

Digital transformation in the built environment offers new opportunities to improve building maintenance through data-driven approaches. Smart monitoring, predictive modeling, and artificial intelligence can enhance decision-making and enable proactive strategies. The preservation of historic buildings is an important scenario where preventive maintenance is essential to ensure long-term sustainability while protecting heritage values. This thesis presents a comprehensive solution for data-driven smart maintenance of historic buildings, integrating Internet of Things (IoT), cloud computing, edge computing, ontology-based data modeling, and machine learning to improve indoor climate management, energy efficiency, and conservation practices.
This thesis advances data-driven conservation of historic buildings by combining smart monitoring, digital twins, and artificial intelligence. The proposed methods enable preventive maintenance and pave the way for the next generation of heritage conservation strategies.

[3] arXiv:2509.03686 [pdf, html, other]
Title: Multi-Sensor Fusion for Extended Object Tracking Exploiting Active and Passive Radio Signals
Hong Zhu, Alexander Venus, Erik Leitinger, Klaus Witrisal
Subjects: Signal Processing (eess.SP)

Reliable and robust positioning of radio devices remains a challenging task due to multipath propagation, hardware impairments, and interference from other radio transmitters. A frequently overlooked but critical factor is the agent itself, e.g., the user carrying the device, which potentially obstructs line-of-sight (LOS) links to the base stations (anchors). This paper addresses the problem of accurate positioning in scenarios where LOS links are partially blocked by the agent. The agent is modeled as an extended object (EO) that scatters, attenuates, and blocks radio signals. We propose a Bayesian method that fuses ``active'' measurements (between device and anchors) with ``passive'' multistatic radar-type measurements (between anchors, reflected by the EO). To handle measurement origin uncertainty, we introduce an multi-sensor and multiple-measurement probabilistic data association (PDA) algorithm that jointly fuses all EO-related measurements. Furthermore, we develop an EO model tailored to agents such as human users, accounting for multiple reflections scattered off the body surface, and propose a simplified variant for low-complexity implementation. Evaluation on both synthetic and real radio measurements demonstrates that the proposed algorithm outperforms conventional PDA methods based on point target assumptions, particularly during and after obstructed line-of-sight (OLOS) conditions.

[4] arXiv:2509.03694 [pdf, html, other]
Title: Parameter Tuning Under Uncertain Road Perception in Driver Assistance Systems
Leon Greiser, Christian Rathgeber, Vladislav Nenchev, Sören Hohmann
Subjects: Systems and Control (eess.SY)

Advanced driver assistance systems have improved comfort, safety, and efficiency of modern vehicles. However, sensor limitations lead to noisy lane estimates that pose a significant challenge in developing performant control architectures. Lateral trajectory planning often employs an optimal control formulation to maintain lane position and minimize steering effort. The parameters are often tuned manually, which is a time-intensive procedure. This paper presents an automatic parameter tuning method for lateral planning in lane-keeping scenarios based on recorded data, while taking into account noisy road estimates. By simulating the lateral vehicle behavior along a reference curve, our approach efficiently optimizes planner parameters for automated driving and demonstrates improved performance on previously unseen test data.

[5] arXiv:2509.03721 [pdf, html, other]
Title: Avoidance of an unexpected obstacle without reinforcement learning: Why not using advanced control-theoretic tools?
Cédric Join, Michel Fliess
Comments: IEEE 2025 - 13th International Conference on Systems and Control (ICSC) - October 22-24, 2025 - Marrakesh, Morocco
Subjects: Systems and Control (eess.SY); Robotics (cs.RO); Optimization and Control (math.OC)

This communication on collision avoidance with unexpected obstacles is motivated by some critical appraisals on reinforcement learning (RL) which "requires ridiculously large numbers of trials to learn any new task" (Yann LeCun). We use the classic Dubins' car in order to replace RL with flatness-based control, combined with the HEOL feedback setting, and the latest model-free predictive control approach. The two approaches lead to convincing computer experiments where the results with the model-based one are only slightly better. They exhibit a satisfactory robustness with respect to randomly generated mismatches/disturbances, which become excellent in the model-free case. Those properties would have been perhaps difficult to obtain with today's popular machine learning techniques in AI. Finally, we should emphasize that our two methods require a low computational burden.

[6] arXiv:2509.03789 [pdf, html, other]
Title: Decentralized Safety-Critical Control of Resilient DC Microgrids with Large-Signal Stability Guarantees
Muratkhan Abdirash, Xiaofan Cui
Subjects: Systems and Control (eess.SY)

The increasing penetration of distributed energy resources and power-electronics interfaces in DC microgrids, coupled with rising cyber threats, necessitates primary controllers that are provably safe, cyber-resilient, and practical. The increasing penetration of distributed energy resources and power-electronics interfaces in DC microgrids, coupled with rising cyber threats, necessitates primary controllers that are provably safe, cyber-resilient, and practical. Conventional droop-based methods remain prevalent due to their simplicity, yet their design is largely empirical and conservative, lacking rigorous guarantees. Advanced strategies improve certain aspects, but often sacrifice scalability, robustness, or formal safety. In this work, we propose a Distributed Safety-Critical Controller (DSCC) that systematically integrates global stabilization with formal safety guarantees in a fully decentralized manner. Leveraging control barrier functions and the port-Hamiltonian system theory, the DSCC achieves scalable safe stabilization while preserving real-time implementability. High-fidelity switched-circuit simulations validate the controller's advantages under various contingencies. This framework paves the way for resilient, safety-critical, and scalable control in next-generation DC microgrids.

[7] arXiv:2509.03825 [pdf, html, other]
Title: Sensor placement for sparse force reconstruction
Jeunghoon Lee
Journal-ref: Mechanical Systems and Signal Processing 2025
Subjects: Signal Processing (eess.SP)

The present study proposes a Gram-matrix-based sensor placement strategy for sparse force reconstruction in the frequency domain. A modal decomposition of the Gram matrix reveals that its structure is dominated by a few modes near the target frequency, and that each modal contribution reflects the spatial correlation of the corresponding mode shape. This suggests that placing sensors near nodal regions where spatial correlation is low can reduce coherence in the frequency response function (FRF) matrix and improve force reconstruction accuracy. To translate the physical insight into a practical design framework, a greedy algorithm is proposed to select sensor locations that minimize the off-diagonal energy of the Gram matrix. Numerical simulations and experimental validations demonstrate that the proposed method yields robust and accurate force estimation, outperforming heuristic sensor layouts.

[8] arXiv:2509.03836 [pdf, html, other]
Title: On the Performance Analysis of Pinching-Antenna-Enabled SWIPT Systems
Bingxin Zhang, Han Zhang, Kun Yang, Yizhe Zhao, Kezhi Wang
Subjects: Systems and Control (eess.SY)

In this paper, we studies the performance of a novel simultaneous wireless information and power transfer (SWIPT) system enabled by a flexible pinching-antenna. To support flexible deployment and optimize energy-rate performance, we propose three practical pinching antenna placement-schemes: the edge deployment scheme (EDS), the center deployment scheme (CDS), and the diagonal deployment scheme (DDS). Moreover, a hybrid time-switching (TS) and power-splitting (PS) protocol is introduced, allowing dynamic adjustment between energy harvesting and information decoding. Under each deployment strategy and the transmission protocol, closed-form expressions for the average harvested energy and average achievable rate of a randomly located user equipment (UE) are derived based on the optimal positioning of the pinching-antenna. Numerical simulations confirm the accuracy of the theoretical analysis and illustrate the trade-off between rate and energy harvesting under different schemes.

[9] arXiv:2509.03839 [pdf, html, other]
Title: Reservoir Predictive Path Integral Control for Unknown Nonlinear Dynamics
Daisuke Inoue, Tadayoshi Matsumori, Gouhei Tanaka, Yuji Ito
Comments: Submitted to IEEE for possible publication, 13 pages, 7 figures
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Chaotic Dynamics (nlin.CD)

Neural networks capable of approximating complex nonlinearities have found extensive application in data-driven control of nonlinear dynamical systems. However, fast online identification and control of unknown dynamics remain central challenges. This paper integrates echo-state networks (ESNs) -- reservoir computing models implemented with recurrent neural networks -- and model predictive path integral (MPPI) control -- sampling-based variants of model predictive control -- to meet these challenges. The proposed reservoir predictive path integral (RPPI) enables fast learning of nonlinear dynamics with ESN and exploits the learned nonlinearities directly in parallelized MPPI control computation without linearization approximations. The framework is further extended to uncertainty-aware RPPI (URPPI), which leverages ESN uncertainty to balance exploration and exploitation: exploratory inputs dominate during early learning, while exploitative inputs prevail as model confidence grows. Experiments on controlling the Duffing oscillator and four-tank systems demonstrate that URPPI improves control performance, reducing control costs by up to 60% compared to traditional quadratic programming-based model predictive control methods.

[10] arXiv:2509.03899 [pdf, html, other]
Title: Sample Efficient Certification of Discrete-Time Control Barrier Functions
Sampath Kumar Mulagaleti, Andrea Del Prete
Comments: 8 pages, accepted for publication in proceedings of IEEE CDC 2025
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

Control Invariant (CI) sets are instrumental in certifying the safety of dynamical systems. Control Barrier Functions (CBFs) are effective tools to compute such sets, since the zero sublevel sets of CBFs are CI sets. However, computing CBFs generally involves addressing a complex robust optimization problem, which can be intractable. Scenario-based methods have been proposed to simplify this computation. Then, one needs to verify if the CBF actually satisfies the robust constraints. We present an approach to perform this verification that relies on Lipschitz arguments, and forms the basis of a certification algorithm designed for sample efficiency. Through a numerical example, we validated the efficiency of the proposed procedure.

[11] arXiv:2509.03902 [pdf, html, other]
Title: Hierarchical Sparse Sound Field Reconstruction with Spherical and Linear Microphone Arrays
Shunxi Xu, Craig T. Jin
Comments: Accepted by APSIPA ASC 2025
Subjects: Audio and Speech Processing (eess.AS)

Spherical microphone arrays (SMAs) are widely used for sound field analysis, and sparse recovery (SR) techniques can significantly enhance their spatial resolution by modeling the sound field as a sparse superposition of dominant plane waves. However, the spatial resolution of SMAs is fundamentally limited by their spherical harmonic order, and their performance often degrades in reverberant environments. This paper proposes a two-stage SR framework with residue refinement that integrates observations from a central SMA and four surrounding linear microphone arrays (LMAs). The core idea is to exploit complementary spatial characteristics by treating the SMA as a primary estimator and the LMAs as a spatially complementary refiner. Simulation results demonstrate that the proposed SMA-LMA method significantly enhances spatial energy map reconstruction under varying reverberation conditions, compared to both SMA-only and direct one-step joint processing. These results demonstrate the effectiveness of the proposed framework in enhancing spatial fidelity and robustness in complex acoustic environments.

[12] arXiv:2509.03979 [pdf, html, other]
Title: A Low-Cost Open-Source BLE-Based Asian Hornet Tracking System
Gilles Callebaut, Jan Van Moer
Subjects: Signal Processing (eess.SP)

The Asian hornet (Vespa velutina) poses a serious threat to ecosystems and beekeeping. Locating nests is essential, but usually involves time-consuming manual triangulation. We present a low-cost, open-source tracking system based on Bluetooth Low Energy (BLE). The system consists of a lightweight BLE tag and a software-defined radio (SDR) receiver implemented in GNU Radio. By bypassing the BLE stack, we embed a custom pseudo-noise (PN) sequence in the uncoded PHY for correlation-based detection. Using a Yagi antenna and PlutoSDR, the receiver performs digital beam sweeping to determine the tag's direction. Field tests show reliable angular resolution at 50m and a communication range up to 360m. While our modulation increases receiver complexity, it enables future improvements such as multichannel spreading and tag identification. The design is fully open-source and provides a scalable framework for hornet tracking and related applications in environmental monitoring.

[13] arXiv:2509.03980 [pdf, html, other]
Title: Approximate Message Passing for Multi-Preamble Detection in OTFS Random Access
Alessandro Mirri, Vishnu Teja Kunde, Enrico Paolini, Jean-Francois Chamberland
Subjects: Signal Processing (eess.SP)

This article addresses the problem of multiple preamble detection in random access systems based on orthogonal time frequency space (OTFS) signaling. This challenge is formulated as a structured sparse recovery problem in the complex domain. To tackle it, the authors propose a new approximate message passing (AMP) algorithm that enforces double sparsity: the sparse selection of preambles and the inherent sparsity of OTFS signals in the delay-Doppler domain. From an algorithmic standpoint, the non-separable complex sparsity constraint necessitates a careful derivation and leads to the design of a novel AMP denoiser. Simulation results demonstrate that the proposed method achieves robust detection performance and delivers significant gains over state-of-the-art techniques.

[14] arXiv:2509.03983 [pdf, html, other]
Title: Joint Frequency-Space Sparse Reconstruction for DOA Estimation under Coherent Sources and Amplitude-Phase Errors
Yutong Chen, Cong Zhou, Changsheng You, Shuo Shi
Subjects: Signal Processing (eess.SP)

In this letter, we propose a joint frequency-space sparse reconstruction method for direction-of-arrival (DOA) estimation, which effectively addresses the issues arising from the existence of coherent sources and array amplitude-phase errors. Specifically, by using an auxiliary source with known angles, we first construct the real steering vectors (RSVs) based on the spectral peaks of received signals in the frequency domain, which serve as a complete basis matrix for compensation for amplitude-phase errors. Then, we leverage the spectral sparsity of snapshot data in the frequency domain and the spatial sparsity of incident directions to perform the DOA estimation according to the sparse reconstruction method. The proposed method does not require iterative optimization, hence exhibiting low computational complexity. Numerical results demonstrate that the proposed DOA estimation method achieves higher estimation accuracy for coherent sources as compared to various benchmark schemes.

[15] arXiv:2509.04005 [pdf, html, other]
Title: Robust MIMO Semantic Communication with Imperfect CSI via Knowledge Distillation
Mingze Gong, Shuoyao Wang, Shijian Gao, Jia Yan, Suzhi Bi
Subjects: Signal Processing (eess.SP)

Semantic communication (SemComm) has emerged as a new communication paradigm. To enhance efficiency, multiple-input-multiple-output (MIMO) technology has been further integrated into SemComm systems. However, existing MIMO SemComm systems assume perfect channel matrix estimation for channel-adaptive joint source-channel coding, which is impractical due to hardware and pilot overhead constraints. In this paper, we propose a semantic image transmission system with channel matrix and channel noise adaptation, named HANA-JSCC, to cope with channel estimation errors in MIMO systems. We propose a channel matrix adaptor that collaborates with the channel codec to adapt to misaligned channel state information, thereby mitigating the impact of estimation errors. Since the relationship between the estimated channel matrix and true channel matrix is ill-posed (one-to-many), we further introduce a two-stage training strategy with knowledge distillation to overcome the convergence difficulties caused by the ill-posed problem. Comparing with the state-of-the-art benchmarks, HANA-JSCC achieves $0.40\sim0.54$dB higher average performance across various noise and estimation error levels in various datasets.

[16] arXiv:2509.04051 [pdf, html, other]
Title: Neural Video Compression with In-Loop Contextual Filtering and Out-of-Loop Reconstruction Enhancement
Yaojun Wu, Chaoyi Lin, Yiming Wang, Semih Esenlik, Zhaobin Zhang, Kai Zhang, Li Zhang
Comments: 9 pages, 8 figures, Accepted to ACMMM 2025
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)

This paper explores the application of enhancement filtering techniques in neural video compression. Specifically, we categorize these techniques into in-loop contextual filtering and out-of-loop reconstruction enhancement based on whether the enhanced representation affects the subsequent coding loop. In-loop contextual filtering refines the temporal context by mitigating error propagation during frame-by-frame encoding. However, its influence on both the current and subsequent frames poses challenges in adaptively applying filtering throughout the sequence. To address this, we introduce an adaptive coding decision strategy that dynamically determines filtering application during encoding. Additionally, out-of-loop reconstruction enhancement is employed to refine the quality of reconstructed frames, providing a simple yet effective improvement in coding efficiency. To the best of our knowledge, this work presents the first systematic study of enhancement filtering in the context of conditional-based neural video compression. Extensive experiments demonstrate a 7.71% reduction in bit rate compared to state-of-the-art neural video codecs, validating the effectiveness of the proposed approach.

[17] arXiv:2509.04055 [pdf, html, other]
Title: Constellation Shaping for OFDM-ISAC Systems: From Theoretical Bounds to Practical Implementation
Benedikt Geiger, Fan Liu, Shihang Lu, Andrej Rode, Daniel Gil Gaviria, Charlotte Muth, Laurent Schmalen
Comments: 13 pages, 14 figures, Submitted to IEEE Transactions on Communications (TCOM) for peer review
Subjects: Signal Processing (eess.SP)

Integrated sensing and communications (ISAC) promises new use cases for mobile communication systems by reusing the communication signal for radar-like sensing. However, sensing and communications (S&C) impose conflicting requirements on the modulation format, resulting in a tradeoff between their corresponding performance. This paper investigates constellation shaping as a means to simultaneously improve S&C performance in orthogonal frequency division multiplexing (OFDM)-based ISAC systems. We begin by deriving how the transmit symbols affect detection performance and derive theoretical lower and upper bounds on the maximum achievable information rate under a given sensing constraint. Using an autoencoder-based optimization, we investigate geometric, probabilistic, and joint constellation shaping, where joint shaping combines both approaches, employing both optimal maximum a-posteriori decoding and practical bit-metric decoding. Our results show that constellation shaping enables a flexible trade-off between S&C, can approach the derived upper bound, and significantly outperforms conventional modulation formats. Motivated by its practical implementation feasibility, we review probabilistic amplitude shaping (PAS) and propose a generalization tailored to ISAC. For this generalization, we propose a low-complexity log-likelihood ratio computation with negligible rate loss. We demonstrate that combining conventional and generalized PAS enables a flexible and low-complexity tradeoff between S&C, closely approaching the performance of joint constellation shaping.

[18] arXiv:2509.04060 [pdf, html, other]
Title: Physics-Informed Detection of Friction Anomalies in Satellite Reaction Wheels
Alejandro Penacho Riveiros, Nicola Bastianello, Karl H. Johansson, Matthieu Barreau
Subjects: Systems and Control (eess.SY)

As the number of satellites in orbit has increased exponentially in recent years, ensuring their correct functionality has started to require automated methods to decrease human workload. In this work, we present an algorithm that analyzes the on-board data related to friction from the Reaction Wheel Assemblies (RWA) of a satellite and determines their operating status, distinguishing between nominal status and several possible anomalies that require preventive measures to be taken. The algorithm first uses a model based on hybrid systems theory to extract the information relevant to the problem. The extraction process combines techniques in changepoint detection, dynamic programming, and maximum likelihood in a structured way. A classifier then uses the extracted information to determine the status of the RWA. This last classifier has been previously trained with a labelled dataset produced by a high-fidelity simulator, comprised for the most part of nominal data. The final algorithm combines model-based and data-based approaches to obtain satisfactory results with an accuracy around 95%.

[19] arXiv:2509.04072 [pdf, html, other]
Title: LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis
Gaspard Michel, Elena V. Epure, Christophe Cerisara
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Text-to-speech (TTS) systems have recently achieved more expressive and natural speech synthesis by scaling to large speech datasets. However, the proportion of expressive speech in such large-scale corpora is often unclear. Besides, existing expressive speech corpora are typically smaller in scale and primarily used for benchmarking TTS systems. In this paper, we introduce the LibriQuote dataset, an English corpus derived from read audiobooks, designed for both fine-tuning and benchmarking expressive zero-shot TTS system. The training dataset includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech drawn from character quotations. Each utterance in the expressive subset is supplemented with the context in which it was written, along with pseudo-labels of speech verbs and adverbs used to describe the quotation (\textit{e.g. ``he whispered softly''}). Additionally, we provide a challenging 7.5 hour test set intended for benchmarking TTS systems: given a neutral reference speech as input, we evaluate system's ability to synthesize an expressive utterance while preserving reference timbre. We validate qualitatively the test set by showing that it covers a wide range of emotions compared to non-expressive speech, along with various accents. Extensive subjective and objective evaluations show that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility, and that recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances. The dataset and evaluation code are freely available. Audio samples can be found at this https URL.

[20] arXiv:2509.04096 [pdf, html, other]
Title: Low-Power Impact Detection and Localization on Forklifts Using Wireless IMU Sensors
Lyssa Ramaut, Chesney Buyle, Jona Cappelle, Liesbet Van der Perre
Comments: This paper is accepted in IEEE Sensors 2025
Subjects: Systems and Control (eess.SY)

Forklifts are essential for transporting goods in industrial environments. These machines face wear and tear during field operations, along with rough terrain, tight spaces and complex handling scenarios. This increases the likelihood of unintended impacts, such as collisions with goods, infrastructure, or other machinery. In addition, deliberate misuse has been stated, compromising safety and equipment integrity. This paper presents a low-cost and low-power impact detection system based on multiple wireless sensor nodes measuring 3D accelerations. These were deployed in a measurement campaign covering realworld operational scenarios. An algorithm was developed, based on this collected data, to differentiate high-impact events from normal usage and to localize detected collisions on the forklift. The solution successfully detects and localizes impacts, while maintaining low power consumption, enabling reliable forklift monitoring with multi-year sensor autonomy.

[21] arXiv:2509.04116 [pdf, html, other]
Title: Remote Estimation for Markov Jump Linear Systems: A Distributionally Robust Approach
Ioannis Tzortzis, Themistoklis Charalambous, Charalambos D. Charalambous
Subjects: Systems and Control (eess.SY)

This paper considers the problem of remote state estimation for Markov jump linear systems in the presence of uncertainty in the posterior mode probabilities. Such uncertainty may arise when the estimator receives noisy or incomplete measurements over an unreliable communication network. To address this challenge, the estimation problem is formulated within a distributionally robust framework, where the true posterior is assumed to lie within a total variation distance ball centered at the nominal posterior. The resulting minimax formulation yields an estimator that extends the classical MMSE solution with additional terms that account for mode uncertainty. A tractable implementation is developed using a distributionally robust variant of the first-order generalized pseudo-Bayesian algorithm. A numerical example is provided to illustrate the applicability and effectiveness of the approach.

[22] arXiv:2509.04118 [pdf, html, other]
Title: EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding
Junqi Liao, Yaojun Wu, Chaoyi Lin, Zhipin Deng, Li Li, Dong Liu, Xiaoyan Sun
Comments: 9 pages, 8 figures, Accepted to ACMMM 2025
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)

Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: this https URL.

[23] arXiv:2509.04196 [pdf, html, other]
Title: Laplacian Flows in Complex-valued Directed Networks: Analysis, Design, and Consensus
Aditi Saxena, Twinkle Tripathy, Rajasekhar Anguluri
Subjects: Systems and Control (eess.SY)

In the interdisciplinary field of network science, a complex-valued network, with edges assigned complex weights, provides a more nuanced representation of relationships by capturing both the magnitude and phase of interactions. Additionally, an important application of this setting arises in distribution power grids. Motivated by the richer framework, we study the necessary and sufficient conditions for achieving consensus in both strongly and weakly connected digraphs. The paper establishes that complex-valued Laplacian flows converge to consensus subject to an additional constraint termed as real dominance which relies on the phase angles of the edge weights. Our approach builds on the complex Perron-Frobenius properties to study the spectral properties of the Laplacian and its relation to graphical conditions. Finally, we propose modified flows that guarantee consensus even if the original network does not converge to consensus. Additionally, we explore diffusion in complex-valued networks as a dual process of consensus and simulate our results on synthetic and real-world networks.

[24] arXiv:2509.04199 [pdf, html, other]
Title: On the Effect of Sampling-Time Jitter
Dieter Schwarzmann, Simon Käser
Comments: Submitted for review as letter in IEEE Journal for Transactions on Control Systems Technology
Subjects: Systems and Control (eess.SY); Signal Processing (eess.SP)

This brief, aimed at practitioners, offers an analysis of the effect of sampling-time jitter, i. e., the error produced by execution-time inaccuracies. We propose reinterpreting jitter-afflicted linear time-invariant systems through equivalent jitter-free analogs. By constructing a perceived system that absorbs the effects of timing perturbations into its dynamics, we find an affine scaling of jitter. We examine both measurement and implementation scenarios, demonstrating that the presence of jitter effectively scales the system matrices. Moreover, we observe that, in the Laplace domain, jitter can be interpreted as a frequency scaling.

[25] arXiv:2509.04213 [pdf, html, other]
Title: Sailing Towards Zero-Shot State Estimation using Foundation Models Combined with a UKF
Tobin Holtmann, David Stenger, Andres Posada-Moreno, Friedrich Solowjow, Sebastian Trimpe
Comments: Accepted for publication at CDC2025
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

State estimation in control and systems engineering traditionally requires extensive manual system identification or data-collection effort. However, transformer-based foundation models in other domains have reduced data requirements by leveraging pre-trained generalist models. Ultimately, developing zero-shot foundation models of system dynamics could drastically reduce manual deployment effort. While recent work shows that transformer-based end-to-end approaches can achieve zero-shot performance on unseen systems, they are limited to sensor models seen during training. We introduce the foundation model unscented Kalman filter (FM-UKF), which combines a transformer-based model of system dynamics with analytically known sensor models via an UKF, enabling generalization across varying dynamics without retraining for new sensor configurations. We evaluate FM-UKF on a new benchmark of container ship models with complex dynamics, demonstrating a competitive accuracy, effort, and robustness trade-off compared to classical methods with approximate system knowledge and to an end-to-end approach. The benchmark and dataset are open sourced to further support future research in zero-shot state estimation via foundation models.

[26] arXiv:2509.04220 [pdf, html, other]
Title: Compatibility of Multiple Control Barrier Functions for Constrained Nonlinear Systems
Max H. Cohen, Eugene Lavretsky, Aaron D. Ames
Comments: To appear at IEEE CDC 2025
Subjects: Systems and Control (eess.SY); Robotics (cs.RO); Optimization and Control (math.OC)

Control barrier functions (CBFs) are a powerful tool for the constrained control of nonlinear systems; however, the majority of results in the literature focus on systems subject to a single CBF constraint, making it challenging to synthesize provably safe controllers that handle multiple state constraints. This paper presents a framework for constrained control of nonlinear systems subject to box constraints on the systems' vector-valued outputs using multiple CBFs. Our results illustrate that when the output has a vector relative degree, the CBF constraints encoding these box constraints are compatible, and the resulting optimization-based controller is locally Lipschitz continuous and admits a closed-form expression. Additional results are presented to characterize the degradation of nominal tracking objectives in the presence of safety constraints. Simulations of a planar quadrotor are presented to demonstrate the efficacy of the proposed framework.

[27] arXiv:2509.04280 [pdf, other]
Title: Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation
Tobias Raichle, Niels Edinger, Bin Yang
Comments: This work has been submitted to the IEEE for possible publication
Subjects: Audio and Speech Processing (eess.AS)

Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present LaDen (latent denoising), the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.

[28] arXiv:2509.04288 [pdf, html, other]
Title: Reinforcement Learning for Robust Ageing-Aware Control of Li-ion Battery Systems with Data-Driven Formal Verification
Rudi Coppola, Hovsep Touloujian, Pierfrancesco Ombrini, Manuel Mazo Jr
Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)

Rechargeable lithium-ion (Li-ion) batteries are a ubiquitous element of modern technology. In the last decades, the production and design of such batteries and their adjacent embedded charging and safety protocols, denoted by Battery Management Systems (BMS), has taken central stage. A fundamental challenge to be addressed is the trade-off between the speed of charging and the ageing behavior, resulting in the loss of capacity in the battery cell. We rely on a high-fidelity physics-based battery model and propose an approach to data-driven charging and safety protocol design. Following a Counterexample-Guided Inductive Synthesis scheme, we combine Reinforcement Learning (RL) with recent developments in data-driven formal methods to obtain a hybrid control strategy: RL is used to synthesise the individual controllers, and a data-driven abstraction guides their partitioning into a switched structure, depending on the initial output measurements of the battery. The resulting discrete selection among RL-based controllers, coupled with the continuous battery dynamics, realises a hybrid system. When a design meets the desired criteria, the abstraction provides probabilistic guarantees on the closed-loop performance of the cell.

[29] arXiv:2509.04308 [pdf, html, other]
Title: Learning Optimal Crew Dispatch for Grid Restoration Following an Earthquake
Farshad Amani, Faezeh Ardali, Amin Kargarian
Subjects: Systems and Control (eess.SY)

Post-disaster crew dispatch is a critical but computationally intensive task. Traditional mixed-integer linear programming methods often require minutes to several hours to compute solutions, leading to delays that hinder timely decision-making in highly dynamic restoration environments. To address this challenge, we propose a novel learning-based framework that integrates transformer architectures with deep reinforcement learning (DRL) to deliver near real-time decision support without compromising solution quality. Crew dispatch is formulated as a sequential decision-making problem under uncertainty, where transformers capture high-dimensional system states and temporal dependencies, while DRL enables adaptive and scalable decision-making. Earthquake-induced distribution network damage is first characterized using established seismic standards, followed by a scenario generation and reduction pipeline that aggregates probable outcomes into a single geospatial impact map. Conditioned on this map, the proposed framework generates second-level dispatch strategies, trained offline on simulated and historical events and deployed online for rapid response. In addition to substantial runtime improvements, the proposed method enhances system resilience by enabling faster and more effective recovery and restoration. Case studies, particularly on the 2869-bus European gas and power network, demonstrate that the method substantially accelerates restoration while maintaining high-quality solutions, underscoring its potential for practical deployment in large-scale disaster response.

[30] arXiv:2509.04309 [pdf, html, other]
Title: Reliable Clutter Suppression for Slow-Moving Weak Target Radar Detection
R. Zhang, J. Xue, T. Zhang
Comments: 25 pages, 20 figures, journal extended by an IEEE ICC conference article
Subjects: Signal Processing (eess.SP)

Reliable slow-moving weak target detection in complicated environments is challenging due to the masking effects from the surrounding strong reflectors. The traditional Moving Target Indication (MTI) may suppress the echoes from not only the static interference objects (IOs), but also the desired slow-moving weak target. According to the low-rank and sparse properties of the range-velocity maps across different radar scans, a novel clutter suppression scheme based on the Go decomposition (Godec) framework is proposed in this paper. The simulation results show that with the existence of masking effects, the target detection scheme based on Godec clutter suppression can reliably detect the slow-moving weak target, compared to the traditional MTI-based scheme. Besides, the time consumption comparison is conducted, demonstrating that the proposed solution is one that sacrifices time complexity in exchange for enhanced reliability. Additionally, the tradeoffs among the number of false alarm cells, the detection probability and the iteration times for convergence have been revealed, guiding parameter settings of the proposed solution in practical applications. Experiment validation is also conducted to verify the proposed solution, providing further insight into the scenarios where the solution is most applicable.

[31] arXiv:2509.04388 [pdf, html, other]
Title: Impact on transient stability of self-synchronisation control strategies in grid-forming VSC-based generators
Regulo E. Avila-Martinez, Xavier Guillaud, Javier Renedo, Luis Rouco, Aurelio Garcia-Cerrada, Lukas Sigrist
Comments: 36 pages, 18 figures, 6 tables
Subjects: Systems and Control (eess.SY)

Grid-forming voltage source converters (GFM-VSCs) are emerging as a solution for integrating renewable energy resources (RERs) into power systems. GFM-VSCs need a self-synchronisation strategy to ensure that all converters and generators in the power system are in synchronism and they reach the same frequency in steady state. The self-synchronisation strategy in GFM-VSCs that has received most attention in previous research is virtual synchronous machine (VSM) control. However, no systematic study of the effects on transient stability of different variants of this strategy has been carried out in previous work. This paper analyses and compares transient stability of four self-synchronisation strategies for GFM-VSCs: VSM without phase-locked loop (PLL), VSM with PLL, VSM without PLL using wash-out filter and integral-proportional (IP) controller. The paper also analyses two different methods that can \color{black} be applied to GFM-VSC self-synchronisation strategies to improve transient stability: the concept of virtual unsaturated active-power controller (VAPC), proposed in previous work, and an algorithm for frequency limitation in the GFM-VSC (FLC), which is proposed in this paper.

[32] arXiv:2509.04390 [pdf, html, other]
Title: Accelerated Interactive Auralization of Highly Reverberant Spaces using Graphics Hardware
Hannes Rosseel, Toon van Waterschoot
Comments: 8 pages, 6 figures, submitted to Journal of the Audio Engineering Society
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Interactive acoustic auralization allows users to explore virtual acoustic environments in real-time, enabling the acoustic recreation of concert hall or Historical Worship Spaces (HWS) that are either no longer accessible, acoustically altered, or impractical to visit. Interactive acoustic synthesis requires real-time convolution of input signals with a set of synthesis filters that model the space-time acoustic response of the space. The acoustics in concert halls and HWS are both characterized by a long reverberation time, resulting in synthesis filters containing many filter taps. As a result, the convolution process can be computationally demanding, introducing significant latency that limits the real-time interactivity of the auralization system. In this paper, the implementation of a real-time multichannel loudspeaker-based auralization system is presented. This system is capable of synthesizing the acoustics of highly reverberant spaces in real-time using GPU-acceleration. A comparison between traditional CPU-based convolution and GPU-accelerated convolution is presented, showing that the latter can achieve real-time performance with significantly lower latency. Additionally, the system integrates acoustic synthesis with acoustic feedback cancellation on the GPU, creating a unified loudspeaker-based auralization framework that minimizes processing latency.

[33] arXiv:2509.04399 [pdf, html, other]
Title: Leveraging Equivariances and Symmetries in the Control Barrier Function Synthesis
Adrian Wiltz, Dimos V. Dimarogonas
Comments: 15 pages
Subjects: Systems and Control (eess.SY); Robotics (cs.RO)

The synthesis of Control Barrier Functions (CBFs) often involves demanding computations or a meticulous construction. However, structural properties of the system dynamics and constraints have the potential to mitigate these challenges. In this paper, we explore how equivariances in the dynamics, loosely speaking a form of symmetry, can be leveraged in the CBF synthesis. Although CBFs are generally not inherently symmetric, we show how equivariances in the dynamics and symmetries in the constraints induce symmetries in CBFs derived through reachability analysis. This insight allows us to infer their CBF values across the entire domain from their values on a subset, leading to significant computational savings. Interestingly, equivariances can be even leveraged to the CBF synthesis for non-symmetric constraints. Specifically, we show how a partially known CBF can be leveraged together with equivariances to construct a CBF for various new constraints. Throughout the paper, we provide examples illustrating the theoretical findings. Furthermore, a numerical study investigates the computational gains from invoking equivariances into the CBF synthesis.

[34] arXiv:2509.04412 [pdf, html, other]
Title: Relative Localization of UAV Swarms in GNSS-Denied Conditions
Guangyu Lei, Yuqi Ping, Tianhao Liang, Huahao Ding, Tingting Zhang
Comments: Manuscript submitted to IEEE Globecom 2025
Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)

Relative localization of unmanned aerial vehicle (UAV) swarms in global navigation satellite system (GNSS) denied environments is essential for emergency rescue and battlefield reconnaissance. Existing methods suffer from significant localization errors among UAVs due to packet loss and high computational complexity in large swarms. This paper proposes a clustering-based framework where the UAVs simultaneously use communication signals for channel estimation and ranging. Firstly, the spectral clustering is utilized to divide the UAV swarm into different sub-clusters, where matrix completion and multidimensional scaling yield high-precision relative coordinates. Subsequently, a global map is created by the inter-cluster anchor fusion. A case study of UAV integrated communication and sensing (ISAC) system is presented, where the Orthogonal Time Frequency Space (OTFS) is adopted for ranging and communication. Experimental results show that the proposed method reduces localization errors in large swarms and loss of range information. It also explores the impact of signal parameters on communication and localization, highlighting the interplay between communication and localization performance.

[35] arXiv:2509.04413 [pdf, html, other]
Title: SAFE--MA--RRT: Multi-Agent Motion Planning with Data-Driven Safety Certificates
Babak Esmaeili, Hamidreza Modares
Comments: Submitted to IEEE Transactions on Automation Science and Engineering
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)

This paper proposes a fully data-driven motion-planning framework for homogeneous linear multi-agent systems that operate in shared, obstacle-filled workspaces without access to explicit system models. Each agent independently learns its closed-loop behavior from experimental data by solving convex semidefinite programs that generate locally invariant ellipsoids and corresponding state-feedback gains. These ellipsoids, centered along grid-based waypoints, certify the dynamic feasibility of short-range transitions and define safe regions of operation. A sampling-based planner constructs a tree of such waypoints, where transitions are allowed only when adjacent ellipsoids overlap, ensuring invariant-to-invariant transitions and continuous safety. All agents expand their trees simultaneously and are coordinated through a space-time reservation table that guarantees inter-agent safety by preventing simultaneous occupancy and head-on collisions. Each successful edge in the tree is equipped with its own local controller, enabling execution without re-solving optimization problems at runtime. The resulting trajectories are not only dynamically feasible but also provably safe with respect to both environmental constraints and inter-agent collisions. Simulation results demonstrate the effectiveness of the approach in synthesizing synchronized, safe trajectories for multiple agents under shared dynamics and constraints, using only data and convex optimization tools.

Cross submissions (showing 15 of 15 entries)

[36] arXiv:2509.03521 (cross-list from q-bio.NC) [pdf, html, other]
Title: BiND: A Neural Discriminator-Decoder for Accurate Bimanual Trajectory Prediction in Brain-Computer Interfaces
Timothee Robert, MohammadAli Shaeri, Mahsa Shoaran
Comments: Accepted for publication in IEEE Neural Engineering (NER) Conference'25
Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

Decoding bimanual hand movements from intracortical recordings remains a critical challenge for brain-computer interfaces (BCIs), due to overlapping neural representations and nonlinear interlimb interactions. We introduce BiND (Bimanual Neural Discriminator-Decoder), a two-stage model that first classifies motion type (unimanual left, unimanual right, or bimanual) and then uses specialized GRU-based decoders, augmented with a trial-relative time index, to predict continuous 2D hand velocities. We benchmark BiND against six state-of-the-art models (SVR, XGBoost, FNN, CNN, Transformer, GRU) on a publicly available 13-session intracortical dataset from a tetraplegic patient. BiND achieves a mean $R^2$ of 0.76 ($\pm$0.01) for unimanual and 0.69 ($\pm$0.03) for bimanual trajectory prediction, surpassing the next-best model (GRU) by 2% in both tasks. It also demonstrates greater robustness to session variability than all other benchmarked models, with accuracy improvements of up to 4% compared to GRU in cross-session analyses. This highlights the effectiveness of task-aware discrimination and temporal modeling in enhancing bimanual decoding.

[37] arXiv:2509.03525 (cross-list from cs.CL) [pdf, other]
Title: Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, Maryam Zolnoori
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.

[38] arXiv:2509.03526 (cross-list from cs.CL) [pdf, html, other]
Title: Enhancing Speech Large Language Models through Reinforced Behavior Alignment
Yansong Liu, Jiateng Li, Yuan Liu
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

[39] arXiv:2509.03529 (cross-list from cs.CL) [pdf, html, other]
Title: Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages
Alejandro Álvarez Castro, Joaquín Ordieres-Meré
Comments: Presented at NLMLT2025 (this https URL), 15 pages, 5 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.

[40] arXiv:2509.03738 (cross-list from cs.LG) [pdf, html, other]
Title: Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces
Bahareh Tolooshams, Ailsa Shen, Anima Anandkumar
Comments: Tolooshams and Shen has equal contribution. preprint
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)

We frame the problem of unifying representations in neural models as one of sparse model recovery and introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO). While the Platonic Representation Hypothesis suggests that neural networks converge to similar representations across architectures, the representational properties of neural operators remain underexplored despite their growing importance in scientific computing. We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators. We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.

[41] arXiv:2509.03762 (cross-list from cs.NI) [pdf, html, other]
Title: Drift Plus Optimistic Penalty -- A Learning Framework for Stochastic Network Optimization
Sathwik Chadaga, Eytan Modiano
Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)

We consider the problem of joint routing and scheduling in queueing networks, where the edge transmission costs are unknown. At each time-slot, the network controller receives noisy observations of transmission costs only for those edges it selects for transmission. The network controller's objective is to make routing and scheduling decisions so that the total expected cost is minimized. This problem exhibits an exploration-exploitation trade-off, however, previous bandit-style solutions cannot be directly applied to this problem due to the queueing dynamics. In order to ensure network stability, the network controller needs to optimize throughput and cost simultaneously. We show that the best achievable cost is lower bounded by the solution to a static optimization problem, and develop a network control policy using techniques from Lyapunov drift-plus-penalty optimization and multi-arm bandits. We show that the policy achieves a sub-linear regret of order $O(\sqrt{T}\log T)$, as compared to the best policy that has complete knowledge of arrivals and costs. Finally, we evaluate the proposed policy using simulations and show that its regret is indeed sub-linear.

[42] arXiv:2509.03804 (cross-list from cs.RO) [pdf, html, other]
Title: Real-Time Buoyancy Estimation for AUV Simulations Using Convex Hull-Based Submerged Volume Calculation
Ad-Deen Mahbub, Md Ragib Shaharear
Comments: 7 pages, 10 figures
Journal-ref: AQ2UASIM: ADVANCING QUANTITATIVE AND QUALITATIVE SIMULATORS FOR MARINE APPLICATIONS, ICRA(2025)
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Accurate real-time buoyancy modeling is essential for high-fidelity Autonomous Underwater Vehicle (AUV) simulations, yet NVIDIA Isaac Sim lacks a native buoyancy system, requiring external solutions for precise underwater physics. This paper presents a novel convex hull-based approach to dynamically compute the submerged volume of an AUV in real time. By extracting mesh geometry from the simulation environment and calculating the hull portion intersecting the water level along the z-axis, our method enhances accuracy over traditional geometric approximations. A cross-sectional area extension reduces computational overhead, enabling efficient buoyant force updates that adapt to orientation, depth, and sinusoidal wave fluctuations (+-0.3 m). Tested on a custom AUV design for SAUVC 2025, this approach delivers real-time performance and scalability, improving simulation fidelity for underwater robotics research without precomputed hydrodynamic models.

[43] arXiv:2509.03818 (cross-list from cs.NI) [pdf, html, other]
Title: A Versatile and Programmable UAV Platform for Radio Access Network and End-to-End Cellular Measurements
Sherwan Jalal Abdullah, Sravan Reddy Chintareddy, Victor S. Frost, Shawn Keshmiri, Morteza Hashemi
Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)

In this work, we develop a measurement platform to capture mobile network performance metrics including coverage and quality of service in regions where conventional coverage testing approaches are frequently time-intensive, labor-demanding, and occasionally hazardous. Traditionally, crowd-sourcing methods are used to collect cellular network performance metrics. However, these approaches are inadequate in rural areas due to low-density population, and difficult terrain. The platform described here is a UAV-based and is designed to investigate the mobile network performance through aerial operations and gather Radio Access Network (RAN) signal alongside end-to-end network performance metrics. Our platform gathers metrics through the integration of an onboard computation unit and commercial off-the-shelf cellular modem. The gathered data are subsequently analyzed and displayed using geospatial mapping utilities and statistical techniques to deliver key observations on cellular network performance. Experimental results showed that the received signal power improves at higher altitudes due to enhanced line-of-sight (LoS) conditions as expected. However, the signal quality degrades as a result of increased interference from neighboring cells. The analysis reveals that for most of the geographic area covered in the initial experiments the system maintained acceptable signal quality, with adequate throughput performance for both uplink and downlink communications, while maintaining satisfactory round-trip time characteristics. Notably, the experiment showed that a strong radio signal metric for a given cell does not necessarily translate to consistent spatial coverage across the tested region.

[44] arXiv:2509.03879 (cross-list from cs.CR) [pdf, html, other]
Title: ShieldMMU: Detecting and Defending against Controlled-Channel Attacks in Shielding Memory System
Gang Liu, Ningjie Li, Cen Chen
Subjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)

Intel SGX and hypervisors isolate non-privileged programs from other software, ensuring confidentiality and integrity. However, side-channel attacks continue to threaten Intel SGX's security, enabling malicious OS to manipulate PTE present bits, induce page faults, and steal memory access traces. Despite extensive research, existing defenses focus on detection or rely on impractical solutions. This paper presents ShieldMMU, a comprehensive solution for mitigating controlled channel attacks, balancing compatibility, performance, and usability. Leveraging a Merkle Tree-inspired Defense Tree (DD-Tree), ShieldMMU protects PTE integrity by detecting, locating, and restoring attacked PTEs. It identifies MMU page table lookup events and side-channel attacks, promptly restoring PTE parameters to prevent page fault traps and ensure secure non-privileged application operation within SGX. Our experiments confirm ShieldMMU's enhanced security and acceptable latency performance.

[45] arXiv:2509.03913 (cross-list from cs.SD) [pdf, html, other]
Title: SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution
Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv
Comments: 5 pages
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets

[46] arXiv:2509.03953 (cross-list from cs.AI) [pdf, html, other]
Title: Handling Infinite Domain Parameters in Planning Through Best-First Search with Delayed Partial Expansions
Ángel Aso-Mollar, Diego Aineto, Enrico Scala, Eva Onaindia
Comments: To appear in the Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025)
Subjects: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC); Systems and Control (eess.SY)

In automated planning, control parameters extend standard action representations through the introduction of continuous numeric decision variables. Existing state-of-the-art approaches have primarily handled control parameters as embedded constraints alongside other temporal and numeric restrictions, and thus have implicitly treated them as additional constraints rather than as decision points in the search space. In this paper, we propose an efficient alternative that explicitly handles control parameters as true decision points within a systematic search scheme. We develop a best-first, heuristic search algorithm that operates over infinite decision spaces defined by control parameters and prove a notion of completeness in the limit under certain conditions. Our algorithm leverages the concept of delayed partial expansion, where a state is not fully expanded but instead incrementally expands a subset of its successors. Our results demonstrate that this novel search algorithm is a competitive alternative to existing approaches for solving planning problems involving control parameters.

[47] arXiv:2509.04014 (cross-list from math.OC) [pdf, html, other]
Title: Distance Between Stochastic Linear Systems
Venkatraman Renganathan, Sei Zhen Khong
Comments: Submitted to SIAM Journal on Control and Optimization. 27 Pages in total
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)

This manuscript proposes a distance measure between stochastic linear dynamical systems. While the existing stochastic control theory is well equipped to handle dynamical systems with stochastic uncertainties, a paradigm shift using distance measure based decision making is required for the effective further exploration of the field. As a first step, a distance measure between two linear time invariant stochastic dynamical systems is proposed here, extending the existing distance metrics between deterministic linear dynamical systems. Distance measure for stochastic systems is proposed for the frequency domain setting as the worst-case point-wise in frequency Wasserstein distance between distributions characterising the uncertainties using inverse stereographic projection on the Riemann sphere. For the time domain setting, the proposed distance corresponds to the gap metric induced type-$q$ Wasserstein distance between the push-forward measures under both systems' corresponding measurable maps from the parameter space to their respective space of system plants. It is proved and demonstrated using numerical simulation that the proposed frequency domain distance measure shall never exceed the proposed time domain distance measure counterpart. Lower and upper bounds are provided for the proposed distance measures in both frequency and time domain settings. The proposed distance measures induce a topology in the corresponding (frequency/time) domain space of stochastic dynamical systems and will facilitate the provision of probabilistic guarantees on system robustness and controller performances.

[48] arXiv:2509.04088 (cross-list from cs.HC) [pdf, html, other]
Title: Spiking Neural Network Decoders of Finger Forces from High-Density Intramuscular Microelectrode Arrays
Farah Baracat, Agnese Grison, Dario Farina, Giacomo Indiveri, Elisa Donati
Subjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)

Restoring naturalistic finger control in assistive technologies requires the continuous decoding of motor intent with high accuracy, efficiency, and robustness. Here, we present a spike-based decoding framework that integrates spiking neural networks (SNNs) with motor unit activity extracted from high-density intramuscular microelectrode arrays. We demonstrate simultaneous and proportional decoding of individual finger forces from motor unit spike trains during isometric contractions at 15% of maximum voluntary contraction using SNNs. We systematically evaluated alternative SNN decoder configurations and compared two possible input modalities: physiologically grounded motor unit spike trains and spike-encoded intramuscular EMG signals. Through this comparison, we quantified trade-offs between decoding accuracy, memory footprint, and robustness to input errors. The results showed that shallow SNNs can reliably decode finger-level motor intent with competitive accuracy and minimal latency, while operating with reduced memory requirements and without the need for external preprocessing buffers. This work provides a practical blueprint for integrating SNNs into finger-level force decoding systems, demonstrating how the choice of input representation can be strategically tailored to meet application-specific requirements for accuracy, robustness, and memory efficiency.

[49] arXiv:2509.04090 (cross-list from math.OC) [pdf, html, other]
Title: Optimal Control for Minimizing Inescapable Ellipsoids in Linear Periodically Time-Varying Systems Under Bounded Disturbances
Egor Dogadin, Alexey Peregudin
Journal-ref: E. Dogadin and A. Peregudin, "Optimal Control for Minimizing Inescapable Ellipsoids in Linear Periodically Time-Varying Systems Under Bounded Disturbances," in IEEE Control Systems Letters, vol. 9, pp. 228-233, 2025
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)

This letter addresses optimal controller design for periodic linear time-varying systems under unknown-but-bounded disturbances. We introduce differential Lyapunov-type equations to describe time-varying inescapable ellipsoids and define an integral-based measure of their size. To minimize this measure, we develop a differential Riccati equation-based approach that provides exact solutions for state-feedback, observer synthesis, and output-feedback control. A key component is a systematic procedure for determining the optimal time-varying parameter, reducing an infinite-dimensional optimization to a simple iterative process. A numerical example validates the method's effectiveness.

[50] arXiv:2509.04137 (cross-list from physics.app-ph) [pdf, other]
Title: Active Dual-Gated Graphene Transistors for Low-Noise, Drift-Stable, and Tunable Chemical Sensing
Vinay Kammarchedu, Heshmat Asgharian, Hossein Chenani, Aida Ebrahimi
Subjects: Applied Physics (physics.app-ph); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Systems and Control (eess.SY)

Graphene field-effect transistors (GFETs) are among the most promising platforms for ultrasensitive chemical and biological sensing due to their high carrier mobility, large surface area, and low intrinsic noise. However, conventional single-gate GFET sensors in liquid environments suffer from severe limitations, including signal drift, charge trapping, and insufficient signal amplification. Here, we introduce a dual-gate GFET architecture that integrates a high-k hafnium dioxide local back gate with an electrolyte top gate, coupled with real-time feedback biasing. This design enables capacitive signal amplification while simultaneously suppressing gate leakage and low-frequency noise. By systematically evaluating seven distinct operational modes, we identify the Dual Mode Fixed configuration as optimal, achieving up to 20x signal gain, > 15x lower drift compared with gate-swept methods, and up to 7x higher signal to noise ratio across a diverse range of analytes, including neurotransmitters, volatile organic compounds, environmental contaminants, and proteins. We further demonstrate robust, multiplexed detection using a PCB-integrated GFET sensor array, underscoring the scalability and practicality of the platform for portable, high-throughput sensing in complex environments. Together, these advances establish a versatile and stable sensing technology capable of real-time, label-free detection of molecular targets under ambient and physiological conditions, with broad applicability in health monitoring, food safety, agriculture, and environmental screening.

Replacement submissions (showing 47 of 47 entries)

[51] arXiv:2403.05702 (replaced) [pdf, html, other]
Title: Spatial-aware Transformer-GRU Framework for Enhanced Glaucoma Diagnosis from 3D OCT Imaging
Mona Ashtari-Majlan, David Masip
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Glaucoma, a leading cause of irreversible blindness, necessitates early detection for accurate and timely intervention to prevent irreversible vision loss. In this study, we present a novel deep learning framework that leverages the diagnostic value of 3D Optical Coherence Tomography (OCT) imaging for automated glaucoma detection. In this framework, we integrate a pre-trained Vision Transformer on retinal data for rich slice-wise feature extraction and a bidirectional Gated Recurrent Unit for capturing inter-slice spatial dependencies. This dual-component approach enables comprehensive analysis of local nuances and global structural integrity, crucial for accurate glaucoma diagnosis. Experimental results on a large dataset demonstrate the superior performance of the proposed method over state-of-the-art ones, achieving an F1-score of 93.01%, Matthews Correlation Coefficient (MCC) of 69.33%, and AUC of 94.20%. The framework's ability to leverage the valuable information in 3D OCT data holds significant potential for enhancing clinical decision support systems and improving patient outcomes in glaucoma management.

[52] arXiv:2404.02574 (replaced) [pdf, html, other]
Title: A Learning With Errors based encryption scheme for dynamic controllers that discloses residue signal for anomaly detection
Yeongjun Jang, Joowon Lee, Junsoo Kim, Takashi Tanaka, Hyungbo Shim
Comments: 11 pages, 4 figures
Subjects: Systems and Control (eess.SY)

Although encrypted control systems ensure confidentiality of private data, it is challenging to detect anomalies without the secret key as all signals remain encrypted. To address this issue, we propose a homomorphic encryption scheme for dynamic controllers that automatically discloses the residue signal for anomaly detection, while keeping all other signals private. To this end, we characterize the zero-dynamics of an encrypted dynamic system over a finite field of integers and incorporate it into a Learning With Errors (LWE) based scheme. We then present a method to further utilize the disclosed residue signal for implementing dynamic controllers over encrypted data, which does not involve re-encryption even when they have non-integer state matrices.

[53] arXiv:2410.02807 (replaced) [pdf, html, other]
Title: AutoPETIII: The Tracer Frontier. What Frontier?
Zacharia Mesbah, Léo Mottay, Romain Modzelewski, Pierre Decazes, Sébastien Hapdey, Su Ruan, Sébastien Thureau
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

For the last three years, the AutoPET competition gathered the medical imaging community around a hot topic: lesion segmentation on Positron Emitting Tomography (PET) scans. Each year a different aspect of the problem is presented; in 2024 the multiplicity of existing and used tracers was at the core of the challenge. Specifically, this year's edition aims to develop a fully automatic algorithm capable of performing lesion segmentation on a PET/CT scan, without knowing the tracer, which can either be a FDG or PSMA-based tracer. In this paper we describe how we used the nnUNetv2 framework to train two sets of 6 fold ensembles of models to perform fully automatic PET/CT lesion segmentation as well as a MIP-CNN to choose which set of models to use for segmentation.

[54] arXiv:2410.07982 (replaced) [pdf, html, other]
Title: Window Function-less DFT with Reduced Noise and Latency for Real-Time Music Analysis
Cai Biesinger, Hiromitsu Awano, Masanori Hashimoto
Comments: 5 pages, 4 figures, Final version accepted to EUSIPCO 2025. TeX-generated PDF exemption due to formatting problems on arXiv. This version: clarified text throughout, replaced final figure with better representative one, added bin index k to several equations
Subjects: Audio and Speech Processing (eess.AS)

Music analysis applications demand algorithms that can provide both high time and frequency resolution while minimizing noise in an already-noisy signal. Real-time analysis additionally demands low latency and low computational requirements. We propose a DFT-based algorithm that accomplishes all these requirements by extending a method that post-processes DFT output without the use of window functions. Our approach yields greatly reduced sidelobes and noise, and improves time resolution without sacrificing frequency resolution. We use exponentially spaced output bins which directly map to notes in music. The resulting improved performance, compared to existing FFT and DFT-based approaches, creates possibilities for improved real-time visualizations, and contributes to improved analysis quality in other applications such as automatic transcription.

[55] arXiv:2411.14013 (replaced) [pdf, html, other]
Title: Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints
Matías Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer
Subjects: Audio and Speech Processing (eess.AS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method's ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.

[56] arXiv:2412.03901 (replaced) [pdf, other]
Title: Certified Learning of Incremental ISS Controllers for Unknown Nonlinear Polynomial Dynamics
Mahdieh Zaker, David Angeli, Abolfazl Lavaei
Subjects: Systems and Control (eess.SY)

Incremental input-to-state stability (delta-ISS) offers a robust framework to ensure that small input variations result in proportionally minor deviations in the state of a nonlinear system. This property is essential in practical applications where input precision cannot be guaranteed. However, analyzing delta-ISS demands precise knowledge of system dynamics to assess the state's incremental response to input changes, posing a challenge in real-world scenarios where mathematical models are unknown. In this work, we develop a data-driven approach to design delta-ISS Lyapunov functions together with their corresponding delta-ISS controllers for continuous-time input-affine nonlinear systems with polynomial dynamics, ensuring the delta-ISS property is achieved without requiring knowledge of the system dynamics. In our data-driven scheme, we collect only two sets of input-state trajectories from sufficiently excited dynamics. By fulfilling a specific rank condition, we design delta-ISS controllers using the collected samples through formulating a sum-of-squares optimization program. The effectiveness of our data-driven approach is evidenced by its application to a physical case study.

[57] arXiv:2502.06289 (replaced) [pdf, other]
Title: Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?
Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, An Ran Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Y. Cheung, Pearse A. Keane, Yih Chung Tham
Comments: Accepted by Ophthalmology Science and is currently in press
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.

[58] arXiv:2503.11993 (replaced) [pdf, html, other]
Title: Impact of Frequency on Diffraction-Aided Wireless Positioning
Gaurav Duggal, Anand M. Kumar, R. Michael Buehrer, Harpreet S. Dhillon, Nishith Tripathi, Jeffrey H. Reed
Comments: Accepted for publication in ICC 2025
Subjects: Signal Processing (eess.SP)

This paper tackles the challenge of accurate positioning in Non-Line-of-Sight (NLoS) environments, with a focus on indoor public safety scenarios where NLoS bias severely impacts localization performance. We explore Diffraction MultiPath Components (MPC) as a critical mechanism for Outdoor-to-Indoor (O2I) signal propagation and its role in positioning. The proposed system comprises outdoor Uncrewed Aerial Vehicle (UAV) transmitters and indoor receivers that require localization. To facilitate diffraction-based positioning, we develop a method to isolate diffraction MPCs at indoor receivers and validate its effectiveness using a ray-tracing-generated dataset, which we have made publicly available. Our evaluation across the FR1, FR2, and FR3 frequency bands within the 5G/6G spectrum confirms the viability of diffraction-based positioning techniques for next-generation wireless networks.

[59] arXiv:2504.00276 (replaced) [pdf, html, other]
Title: On-the-fly Surrogation for Complex Nonlinear Dynamics
E. Javier Olucha, Rajiv Singh, Amritam Das, Roland Tóth
Comments: 64th IEEE Conference on Decision and Control, 2025 [Accepted] this https URL
Subjects: Systems and Control (eess.SY)

High-fidelity models are essential for accurately capturing nonlinear system dynamics. However, simulation of these models is often computationally too expensive and, due to their complexity, they are not directly suitable for analysis, control design or real-time applications. Surrogate modelling techniques seek to construct simplified representations of these systems with minimal complexity, but adequate information on the dynamics given a simulation, analysis or synthesis objective at hand. Despite the widespread availability of system linearizations and the growing computational potential of autograd methods, there is no established approach that systematically exploits them to capture the underlying global nonlinear dynamics. This work proposes a novel surrogate modelling approach that can efficiently build a global representation of the dynamics on-the-fly from local system linearizations without ever explicitly computing a model. Using radial basis function interpolation and the second fundamental theorem of calculus, the surrogate model is only computed at its evaluation, enabling rapid computation for simulation and analysis and seamless incorporation of new linearization data. The efficiency and modelling capabilities of the method are demonstrated on simulation examples.

[60] arXiv:2504.03157 (replaced) [pdf, html, other]
Title: Taming High-Dimensional Dynamics: Learning Optimal Projections onto Spectral Submanifolds
Hugo Buurmeijer, Luis A. Pabon, John Irvin Alora, Roshan S. Kaundinya, George Haller, Marco Pavone
Subjects: Systems and Control (eess.SY); Robotics (cs.RO)

High-dimensional nonlinear systems pose considerable challenges for modeling and control across many domains, from fluid mechanics to advanced robotics. Such systems are typically approximated with reduced-order models, which often rely on orthogonal projections, a simplification that may lead to large prediction errors. In this work, we derive optimality of fiber-aligned projections onto spectral submanifolds, preserving the nonlinear geometric structure and minimizing long-term prediction error. We propose a data-driven procedure to learn these projections from trajectories and demonstrate its effectiveness through a 180-dimensional robotic system. Our reduced-order models achieve up to fivefold improvement in trajectory tracking accuracy under model predictive control compared to the state of the art.

[61] arXiv:2504.16874 (replaced) [pdf, html, other]
Title: Adaptive RIS Control for Mobile mmWave NLoS Communication Using Single-Bit Feedback
Hamed Radpour, Markus Hofer, Thomas Zemen
Comments: 6 pages, submitted to IEEE Global Communications (GLOBECOM25) Conference
Subjects: Systems and Control (eess.SY)

Reconfigurable intelligent surfaces (RISs) are emerging as key enablers of reliable industrial automation in the millimeter-wave (mmWave) band, particularly in environments with frequent line-of-sight (LoS) blockage. While prior works have largely focused on theoretical aspects, real-time validation under user mobility remains underexplored. In this work, we propose and experimentally evaluate an adaptive beamforming algorithm that enables RIS reconfiguration via a low-rate feedback link from the mobile user equipment (UE) to the RIS controller, operating without requiring UE position knowledge. The algorithm maintains the received signal power above a predefined threshold using only a single-bit comparison of received power levels. To analyze the algorithms performance, we establish a simulation-based Monte Carlo (MC) optimization benchmark that assumes full UE position knowledge, accounts for practical hardware constraints, and serves as an upper bound for performance evaluation. Using a hexagonal RIS with 127 elements and 1-bit phase quantization at 23.8 GHz, we validate the proposed approach in a semi-anechoic environment over a 60 cm by 92 cm area. The results demonstrate that the single-bit feedback-driven algorithm closes much of the performance gap to the MC upper bound while achieving up to 24 dB gain in received power compared to an inactive RIS baseline. These findings highlight the practical potential of feedback-based adaptive RIS control for robust mmWave non-line-of-sight (NLoS) communication with mobile users.

[62] arXiv:2504.18175 (replaced) [pdf, html, other]
Title: Generative AI for Physical-Layer Authentication
Rui Meng, Xiqi Cheng, Song Gao, Xiaodong Xu, Chen Dong, Guoshun Nan, Xiaofeng Tao, Ping Zhang, Tony Q. S. Quek
Comments: 10 pages, 3 figures
Subjects: Signal Processing (eess.SP)

In recent years, Artificial Intelligence (AI)-driven Physical-Layer Authentication (PLA), which focuses on achieving endogenous security and intelligent identity authentication, has attracted considerable interest. When compared with Discriminative AI (DAI), Generative AI (GAI) offers several advantages, such as fingerprint data augmentation, fingerprint denoising and reconstruction, and protection against adversarial attacks. Inspired by these innovations, this paper provides a systematic exploration of GAI's integration into PLA frameworks. We commence with a review of representative authentication techniques, emphasizing PLA's inherent strengths. Following this, we revisit four typical GAI models and contrast the limitations of DAI with the potential of GAI in addressing PLA challenges, including insufficient fingerprint data, environment noises and inferences, perturbations in fingerprint data, and complex tasks. Specifically, we delve into providing GAI-enhanced methods for PLA across the fingerprint collection, model training, and performance optimization phases in detail. Moreover, we present a case study that combines fingerprint extrapolation and Generative Diffusion Model (GDM) to illustrate the superiority of GAI in bolstering the reliability of PLA. Additionally, we outline potential future research directions for GAI-based PLA.

[63] arXiv:2506.07634 (replaced) [pdf, html, other]
Title: SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li
Comments: Submitted to NeurIPS2025
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM)

Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: this https URL. The code and model weights have been released on this https URL .

[64] arXiv:2506.10221 (replaced) [pdf, html, other]
Title: Model Predictive Control-Based Optimal Energy Management of Autonomous Electric Vehicles Under Cold Temperatures
Shanthan Kumar Padisala, Satadru Dey
Subjects: Systems and Control (eess.SY)

In autonomous electric vehicles (AEVs), battery energy must be judiciously allocated to satisfy primary propulsion demands and secondary auxiliary demands, particularly the Heating, Ventilation, and Air Conditioning (HVAC) system. This becomes especially critical when the battery is in a low state of charge under cold ambient conditions, and cabin heating and battery preconditioning (prior to actual charging) can consume a significant percentage of available energy, directly impacting the driving range. In such cases, one usually prioritizes propulsion or applies heuristic rules for thermal management, often resulting in suboptimal energy utilization. There is a pressing need for a principled approach that can dynamically allocate battery power in a way that balances thermal comfort, battery health and preconditioning, along with range preservation. This paper attempts to address this issue using real-time Model Predictive Control to optimize the power consumption between the propulsion, HVAC, and battery temperature preparation so that it can be charged immediately once the destination is reached.

[65] arXiv:2507.01204 (replaced) [pdf, html, other]
Title: LotteryCodec: Searching the Implicit Representation in a Random Network for Low-Complexity Image Compression
Haotian Wu, Gongpu Chen, Pier Luigi Dragotti, Deniz Gündüz
Journal-ref: International Conference on Machine Learning (2025)
Subjects: Image and Video Processing (eess.IV); Information Theory (cs.IT)

We introduce and validate the lottery codec hypothesis, which states that untrained subnetworks within randomly initialized networks can serve as synthesis networks for overfitted image compression, achieving rate-distortion (RD) performance comparable to trained networks. This hypothesis leads to a new paradigm for image compression by encoding image statistics into the network substructure. Building on this hypothesis, we propose LotteryCodec, which overfits a binary mask to an individual image, leveraging an over-parameterized and randomly initialized network shared by the encoder and the decoder. To address over-parameterization challenges and streamline subnetwork search, we develop a rewind modulation mechanism that improves the RD performance. LotteryCodec outperforms VTM and sets a new state-of-the-art in single-image compression. LotteryCodec also enables adaptive decoding complexity through adjustable mask ratios, offering flexible compression solutions for diverse device constraints and application requirements.

[66] arXiv:2507.05785 (replaced) [pdf, html, other]
Title: Robust Bandwidth Estimation for Real-Time Communication with Offline Reinforcement Learning
Jian Kai, Tianwei Zhang, Zihan Ling, Yang Cao, Can Shen
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

Accurate bandwidth estimation (BWE) is critical for real-time communication (RTC) systems. Traditional heuristic approaches offer limited adaptability under dynamic networks, while online reinforcement learning (RL) suffers from high exploration costs and potential service disruptions. Offline RL, which leverages high-quality data collected from real-world environments, offers a promising alternative. However, challenges such as out-of-distribution (OOD) actions, policy extraction from behaviorally diverse datasets, and reliable deployment in production systems remain unsolved. We propose RBWE, a robust bandwidth estimation framework based on offline RL that integrates Q-ensemble (an ensemble of Q-functions) with a Gaussian mixture policy to mitigate OOD risks and enhance policy learning. A fallback mechanism ensures deployment stability by switching to heuristic methods under high uncertainty. Experimental results show that RBWE reduces overestimation errors by 18% and improves the 10th percentile Quality of Experience (QoE) by 18.6%, demonstrating its practical effectiveness in real-world RTC applications. The implementation is publicly available at this https URL.

[67] arXiv:2507.23266 (replaced) [pdf, html, other]
Title: CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025
Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee
Comments: Accepted at China's 20th National Conference on Man-Machine Speech Communication (NCMMSC 2025)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling (ASTP) to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% equal error rate (EER), while the WavLM-Large+SE-ResFFN model excels in the 'Seen' setting with 94.42% accuracy and 5.49% EER. These findings highlight a trade-off between model complexity and generalisation, and underscore the importance of architectural choices in fine-grained speaker modelling. Our analysis also reveals the impact of speaker identity, annotation subjectivity, and data imbalance on system performance, pointing to future directions for improving robustness and fairness in timbre attribute detection.

[68] arXiv:2508.08715 (replaced) [pdf, html, other]
Title: MultiGen: Child-Friendly Multilingual Speech Generator with LLMs
Xiaoxue Gao, Huayun Zhang, Nancy F. Chen
Comments: 5 pages
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)

Generative speech models have demonstrated significant potential in improving human-machine interactions, offering valuable real-world applications such as language learning for children. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiGen, a multilingual speech generation model with child-friendly interaction, leveraging LLM architecture for speech generation tailored for low-resource languages. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, which can be used to facilitate young children's communication with AI systems through culturally relevant context in three low-resource languages: Singaporean accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiGen compared to baseline methods.

[69] arXiv:2508.12059 (replaced) [pdf, html, other]
Title: Co-Investment with Payoff-Sharing Mechanism for Cooperative Decision-Making in Network Design Games
Mingjia He, Andrea Censi, Emilio Frazzoli, Gioele Zardini
Subjects: Systems and Control (eess.SY); Computer Science and Game Theory (cs.GT)

Network-based systems are inherently interconnected, with the design and performance of subnetworks being interdependent. However, the decisions of self-interested operators may lead to suboptimal outcomes for users and the overall system. This paper explores cooperative mechanisms that can simultaneously benefit both operators and users. We address this challenge using a game-theoretical framework that integrates both non-cooperative and cooperative game theory. In the non-cooperative stage, we propose a network design game in which subnetwork decision-makers strategically design local infrastructures. In the cooperative stage, co-investment with payoff-sharing mechanism is developed to enlarge collective benefits and fairly distribute them. To demonstrate the effectiveness of our framework, we conduct case studies on the Sioux Falls network and real-world public transport networks in Zurich and Winterthur, Switzerland. Our evaluation considers impacts on environmental sustainability, social welfare, and economic efficiency. The proposed framework provides a foundation for improving interdependent networked systems by enabling strategic cooperation among self-interested operators.

[70] arXiv:2508.18915 (replaced) [pdf, html, other]
Title: Performance Analysis of Underwater Optical Wireless Communication Using O-RIS and Fiber Optic Backhaul (Extended version)
Aboozar Heydaribeni, Hamzeh Beyranvand
Comments: This is version 2 (v2) of the manuscript with further improvements and refinements
Subjects: Systems and Control (eess.SY)

This Letter presents a novel hybrid underwater wireless optical communication (UWOC) system that integrates underwater optical access points (UOAPs) with a passive optical network (PON)-based fiber-optic backhaul to provide a resilient backbone. A hard switching mechanism is employed between direct and optical reconfigurable intelligent surface (O-RIS)-assisted links to ensure reliable connectivity. Unlike previous studies, the proposed system is evaluated under both active and multiple passive O-RIS configurations. To enhance reliability, the Selection Combining (SC) and Maximal Ratio Combining (MRC) schemes are applied. Analytical and simulation results demonstrate that optimal O-RIS placement significantly enhances system performance. However, in the linear regime, placing it too close to the receiver causes degradation due to increased path loss and beam jitter in an identical water type. Moreover, increasing the number of O-RIS elements within practical limits further improves overall system performance and enhances adaptability to variations in the underwater channel.

[71] arXiv:2509.00964 (replaced) [pdf, html, other]
Title: Doubly-Dispersive Continuous MIMO Systems: Channel Modeling and Beamforming Design
Kuranage Roche Rayan Ranasinghe, Zhaolin Wang, Hyeon Seok Rou, Giuseppe Thadeu Freitas de Abreu, Emil Björnson
Comments: Submitted to IEEE Transactions on Wireless Communications
Subjects: Signal Processing (eess.SP)

We address the modeling and optimal beamforming (BF) design for multiple-input multiple-output (MIMO) continuous aperture array (CAPA) systems operating over doubly-dispersive (DD) channels. First, a comprehensive DD continuous MIMO (DDC MIMO) channel model that incorporates CAPAs at both the transmitter (TX) and receiver (RX) is derived, which is used to obtain explicit input-output (I/O) relations for various waveforms well suited to integrated sensing and communications (ISAC) and robust to DD channels, namely orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM). Then, functional optimization problems are formulated for the design of TX and RX BF matrices that maximize received power, in which novel low-complexity, closed-form solutions are obtained via the calculus of variations (CoV) method, yielding expressions closely related to the classical matched filter commonly used in conventional MIMO systems. Simulation results confirm that the proposed TX/RX BF designs with CAPAs provide significant performance and computational complexity gains over conventional MIMO systems in DD channels.

[72] arXiv:2509.01856 (replaced) [pdf, html, other]
Title: Safety-Critical Multi-Agent MCTS for Mixed Traffic Coordination at Unsignalized Roundabout
Zhihao Lin, Shuo Liu, Zhen Tian, Dezong Zhao, Jianglin Lan
Comments: 12 pages, 10 figures
Subjects: Systems and Control (eess.SY)

Decision-making at unsignalized roundabouts poses substantial challenges for autonomous vehicles (AVs), particularly in mixed traffic environments where AVs must coordinate safely with human-driven vehicles (HDVs). This paper presents a safety-critical multi-agent Monte Carlo Tree Search (MCTS) framework that integrates both deterministic and probabilistic prediction models to facilitate cooperative decision-making in complex roundabout scenarios. The proposed framework introduces three key innovations: (1) a hierarchical safety assessment module that systematically addresses AV-to-AV (A2A), AV-to-HDV (A2H), and AV-to-Road (A2R) interactions through dynamic safety thresholds and spatiotemporal risk evaluation; (2) an adaptive HDV behavior prediction scheme that combines the Intelligent Driver Model (IDM) with probabilistic uncertainty modeling; and (3) a multi-objective reward optimization strategy that jointly considers safety, efficiency, and cooperative intent. Extensive simulation results validate the effectiveness of the proposed approach under both fully autonomous (100% AVs) and mixed traffic (50% AVs + 50% HDVs) conditions. Compared to benchmark methods, our framework consistently reduces trajectory deviations across all AVs and significantly lowers the rate of Post-Encroachment Time (PET) violations, achieving only 1.0\% in the fully autonomous scenario and 3.2% in the mixed traffic setting.

[73] arXiv:2509.01875 (replaced) [pdf, html, other]
Title: RadioDiff-Loc: Diffusion Model Enhanced Scattering Congnition for NLoS Localization with Sparse Radio Map Estimation
Xiucheng Wang, Qiming Zhang, Nan Cheng
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

Accurate localization of non-cooperative signal sources in non-line-of-sight (NLoS) environments remains a critical challenge with a wide range of applications, including autonomous navigation, industrial automation, and emergency response. In such settings, traditional positioning techniques relying on line-of-sight (LoS) or cooperative signaling fail due to severe multipath propagation and unknown transmit power. This paper proposes a novel generative inference framework for NLoS localization based on conditional diffusion models. By leveraging the physical insight that diffracted electromagnetic energy concentrates near building edges, we develop a sampling strategy that collects sparse received signal strength (RSS) measurements at the geometric vertices of obstacles--locations that maximize Fisher information and mutual information with respect to the unknown source. To overcome the lack of known transmission power, we normalize all sampled RSS values relative to the maximum observed intensity, enabling the construction of a power-invariant radio map (RM). A conditional diffusion model is trained to reconstruct the full RM based on environmental layout and sparse RSS observations. Localization is then achieved by identifying the brightest point on the generated RM. Moreover, the proposed framework is compatible with existing RSS-based localization algorithms, enabling a dual-driven paradigm that fuses physical knowledge and data-driven inference for improved accuracy. Extensive theoretical analysis and empirical validation demonstrate that our approach achieves high localization accuracy with significantly reduced sampling cost, offering a scalable and physically grounded solution for non-cooperative NLoS emitter localization.

[74] arXiv:2509.01889 (replaced) [pdf, html, other]
Title: From Evaluation to Optimization: Neural Speech Assessment for Downstream Applications
Yu Tsao
Comments: 5 pages, 1 figure
Subjects: Audio and Speech Processing (eess.AS)

The evaluation of synthetic and processed speech has long been a cornerstone of audio engineering and speech science. Although subjective listening tests remain the gold standard for assessing perceptual quality and intelligibility, their high cost, time requirements, and limited scalability present significant challenges in the rapid development cycles of modern speech technologies. Traditional objective metrics, while computationally efficient, often rely on a clean reference signal, making them intrusive approaches. This presents a major limitation, as clean signals are often unavailable in real-world applications. In recent years, numerous neural network-based speech assessment models have been developed to predict quality and intelligibility, achieving promising results. Beyond their role in evaluation, these models are increasingly integrated into downstream speech processing tasks. This review focuses on their role in two main areas: (1) serving as differentiable perceptual proxies that not only assess but also guide the optimization of speech enhancement and synthesis models; and (2) enabling the detection of salient speech characteristics to support more precise and efficient downstream processing. Finally, we discuss current limitations and outline future research directions to further advance the integration of speech assessment into speech processing pipelines.

[75] arXiv:2509.02116 (replaced) [pdf, html, other]
Title: Affine-Doppler Division Multiplexing for High-Mobility Wireless Communications Systems
Yuanfang Ma, Zulin Wang, Peng Yuan, Qin Huang, Yuanhan Ni
Comments: 7 pages, 4 figures, 1 table
Subjects: Signal Processing (eess.SP)

Affine Frequency Division Multiplexing (AFDM) has been regarded as a candidate integrated sensing and communications (ISAC) waveform owing to its superior communication performance, outperforming the Orthogonal Time-Frequency Space (OTFS) that has been researched for a longer time. However, since the above two waveforms are incompatible with each other, the state-of-the-art methods well-designed for OTFS may not be directly applicable to AFDM. This paper introduces a new orthogonal multicarrier waveform, namely Affine-Doppler Division Multiplexing (ADDM), which can provide a generic framework and subsume the existing OTFS and AFDM as a particular case. ADDM modulating information symbols in the Affine-Doppler (A-D) domain based on a two-dimensional (2D) transform can enjoy both excellent unambiguous Doppler and Doppler resolution, which is the same as AFDM but outperforms OTFS. Moreover, benefiting from the 2D transform, the symbols block of ADDM in the A-D domain undergoes a 2D cyclic shift produced by the delay and the Doppler of the channel, similar to the 2D cyclic shift in the delay-Doppler domain of cyclic prefix (CP)-OTFS. This offers a potential to directly apply the state-of-the-art methods well-designed for OTFS and AFDM to ADDM. Numerical results show that ADDM achieves comparable BER performance with AFDM but outperforms OTFS in high-mobility scenarios.

[76] arXiv:2509.02238 (replaced) [pdf, other]
Title: On the Effect of Tap Changers and Nonlinear Loads on Voltage Stability
Andrea Zanelli, Dirk Schmidt, Matthias Resch, Marco Giovanelli, Martin Geidl, Walter Sattinger
Subjects: Systems and Control (eess.SY)

On 21 June 2024, a severe incident happened in the South-Eastern part of the Continental European power system. After a voltage collapse, large parts of Albania, Montenegro, Bosnia and Herzegovina as well as Croatia suffered from a blackout [1]. The initial tripping of two transmission lines resulted in a voltage collapse in these countries. Investigations have shown that a) transformers with on-load tap changers (OLTC) and b) nonlinear loads, in particular air conditioning systems, played a significant role in this event. Motivated by this, we carry out an assessment of the effect of OLTC on voltage stability in the presence of nonlinear loads. By doing this we hope to further shed some light on the potential instability mechanisms that can be triggered in scenarios like the above-mentioned blackout.

[77] arXiv:2509.02591 (replaced) [pdf, html, other]
Title: Ensemble of Pathology Foundation Models for MIDOG 2025 Track 2: Atypical Mitosis Classification
Mieko Ochi, Bae Yuan
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Mitotic figures are classified into typical and atypical variants, with atypical counts correlating strongly with tumor aggressiveness. Accurate differentiation is therefore essential for patient prognostication and resource allocation, yet remains challenging even for expert pathologists. Here, we leveraged Pathology Foundation Models (PFMs) pre-trained on large histopathology datasets and applied parameter-efficient fine-tuning via low-rank adaptation. In addition, we incorporated ConvNeXt V2, a state-of-the-art convolutional neural network architecture, to complement PFMs. During training, we employed a fisheye transform to emphasize mitoses and Fourier Domain Adaptation using ImageNet target images. Finally, we ensembled multiple PFMs to integrate complementary morphological insights, achieving competitive balanced accuracy on the Preliminary Evaluation Phase dataset.

[78] arXiv:2509.02724 (replaced) [pdf, other]
Title: Recall Gabor Communication Theory and Joint Time-Frequency Analysis
Xiang-Gen Xia
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)

In this article, we first briefly recall Gabor's communication theory and then Gabor transform and expansion, and also its connection with joint time frequency analysis.

[79] arXiv:2509.03013 (replaced) [pdf, html, other]
Title: Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM
Ryandhimas E. Zezario, Dyah A.M.G. Wisnu, Hsin-Min Wang, Yu Tsao
Comments: Accepted to APSIPA ASC 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Non-intrusive speech intelligibility prediction remains challenging due to variability in speakers, noise conditions, and subjective perception. We propose an uncertainty-aware approach that leverages Whisper embeddings in combination with statistical features, specifically the mean, standard deviation, and entropy computed across the embedding dimensions. The entropy, computed via a softmax over the feature dimension, serves as a proxy for uncertainty, complementing global information captured by the mean and standard deviation. To model the sequential structure of speech, we adopt a scalar long short-term memory (sLSTM) network, which efficiently captures long-range dependencies. Building on this foundation, we propose iMTI-Net, an improved multi-target intelligibility prediction network that integrates convolutional neural network (CNN) and sLSTM components within a multitask learning framework. It jointly predicts human intelligibility scores and machine-based word error rates (WER) from Google ASR and Whisper. Experimental results show that iMTI-Net outperforms the original MTI-Net across multiple evaluation metrics, demonstrating the effectiveness of incorporating uncertainty-aware features and the CNN-sLSTM architecture.

[80] arXiv:2306.17477 (replaced) [pdf, html, other]
Title: Beyond-Voice: Towards Continuous 3D Hand Pose Tracking on Commercial Home Assistant Devices
Yin Li, Rohan Reddy, Cheng Zhang, Rajalakshmi Nandakumar
Comments: Accepted by IPSN 2024
Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)

The surging popularity of home assistants and their voice user interface (VUI) have made them an ideal central control hub for smart home devices. However, current form factors heavily rely on VUI, which poses accessibility and usability issues; some latest ones are equipped with additional cameras and displays, which are costly and raise privacy concerns. These concerns jointly motivate Beyond-Voice, a novel high-fidelity acoustic sensing system that allows commodity home assistant devices to track and reconstruct hand poses continuously. It transforms the home assistant into an active sonar system using its existing onboard microphones and speakers. We feed a high-resolution range profile to the deep learning model that can analyze the motions of multiple body parts and predict the 3D positions of 21 finger joints, bringing the granularity for acoustic hand tracking to the next level. It operates across different environments and users without the need for personalized training data. A user study with 11 participants in 3 different environments shows that Beyond-Voice can track joints with an average mean absolute error of 16.47mm without any training data provided by the testing subject.

[81] arXiv:2401.13980 (replaced) [pdf, html, other]
Title: A Superposition Code-Based Semantic Communication Approach with Quantifiable and Controllable Security
Weixuan Chen, Shuo Shao, Qianqian Yang, Zhaoyang Zhang, Ping Zhang
Subjects: Information Theory (cs.IT); Image and Video Processing (eess.IV)

This paper addresses the challenge of achieving security in semantic communication (SemCom) over a wiretap channel, where a legitimate receiver coexists with an eavesdropper experiencing a poorer channel condition. Despite previous efforts to secure SemCom against eavesdroppers, guarantee of approximately zero information leakage remains an open issue. In this work, we propose a secure SemCom approach based on superposition code, aiming to provide quantifiable and controllable security for digital SemCom systems. The proposed method employs a double-layered constellation map, where semantic information is associated with satellite constellation points and cloud center constellation points are randomly selected. By carefully allocating power between these two layers of constellation, we ensure that the symbol error probability (SEP) of the eavesdropper when decoding satellite constellation points is nearly equivalent to random guessing, while maintaining a low SEP for the legitimate receiver to successfully decode the semantic information. Simulation results demonstrate that the peak signal-to-noise ratio (PSNR) and mean squared error (MSE) of the eavesdropper's reconstructed data, under the proposed method, can range from decoding Gaussian-distributed random noise to approaching the variance of the data. This validates the effectiveness of our method in nearly achieving the experimental upper bound of security for digital SemCom systems when both eavesdroppers and legitimate users utilize identical decoding schemes. Furthermore, the proposed method consistently outperforms benchmark techniques, showcasing superior data security and robustness against eavesdropping. The implementation code is publicly available at: this https URL A-Superposition-Code-Based-Semantic-Communication.

[82] arXiv:2403.10796 (replaced) [pdf, html, other]
Title: CoPlay: Audio-agnostic Cognitive Scaling for Acoustic Sensing
Yin Li, Bo Liu, Rajalakshmi Nanadakumar
Comments: ICCCN'25
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Acoustic sensing manifests great potential in various applications that encompass health monitoring, gesture interface and imaging by leveraging the speakers and microphones on smart devices. However, in ongoing research and development in acoustic sensing, one problem is often overlooked: the same speaker, when used concurrently for sensing and other traditional applications (like playing music), could cause interference in both making it impractical to use in the real world. The strong ultrasonic sensing signals mixed with music would overload the speaker's mixer. To confront this issue of overloaded signals, current solutions are clipping or down-scaling, both of which affect the music playback quality and also sensing range and accuracy. To address this challenge, we propose CoPlay, a deep learning based optimization algorithm to cognitively adapt the sensing signal. It can 1) maximize the sensing signal magnitude within the available bandwidth left by the concurrent music to optimize sensing range and accuracy and 2) minimize any consequential frequency distortion that can affect music playback. In this work, we design a deep learning model and test it on common types of sensing signals (sine wave or Frequency Modulated Continuous Wave FMCW) as inputs with various agnostic concurrent music and speech. First, we evaluated the model performance to show the quality of the generated signals. Then we conducted field studies of downstream acoustic sensing tasks in the real world. A study with 12 users proved that respiration monitoring and gesture recognition using our adapted signal achieve similar accuracy as no-concurrent-music scenarios, while clipping or down-scaling manifests worse accuracy. A qualitative study also manifests that the music play quality is not degraded, unlike traditional clipping or down-scaling methods.

[83] arXiv:2411.12736 (replaced) [pdf, html, other]
Title: ACING: Actor-Critic for Instruction Learning in Black-Box LLMs
Salma Kharrat, Fares Fourati, Marco Canini
Comments: Accepted at EMNLP 2025
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)

The effectiveness of Large Language Models (LLMs) in solving tasks depends significantly on the quality of their instructions, which often require substantial human effort to craft. This underscores the need for automated instruction optimization. However, optimizing instructions is particularly challenging when working with black-box LLMs, where model parameters and gradients are inaccessible. We introduce ACING, an actor-critic reinforcement learning framework that formulates instruction optimization as a stateless, continuous-action problem, enabling exploration of infinite instruction spaces using only black-box feedback. ACING automatically discovers prompts that outperform human-written prompts in 76% of instruction-induction tasks, with gains of up to 33 points and a 10-point median improvement over the best automatic baseline in 33 tasks spanning instruction-induction, summarization, and chain-of-thought reasoning. Extensive ablations highlight its robustness and efficiency. An implementation of ACING is available at this https URL.

[84] arXiv:2412.06662 (replaced) [pdf, html, other]
Title: Stochastic LQR Design With Disturbance Preview
Jietian Liu, Laurent Lessard, Peter Seiler
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)

This paper considers the discrete-time, stochastic LQR problem with $p$ steps of disturbance preview information where $p$ is finite. We first derive the solution for this problem on a finite horizon with linear, time-varying dynamics and time-varying costs. Next, we derive the solution on the infinite horizon with linear, time-invariant dynamics and time-invariant costs. Our proofs rely on the well-known principle of optimality. We provide an independent proof for the principle of optimality that relies only on nested information structure. Finally, we show that the finite preview controller converges to the optimal noncausal controller as the preview horizon $p$ tends to infinity. We also provide a simple example to illustrate both the finite and infinite horizon results.

[85] arXiv:2504.00244 (replaced) [pdf, html, other]
Title: System Identification from Partial Observations under Adversarial Attacks
Jihun Kim, Javad Lavaei
Comments: 8 pages, 3 figures
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)

This paper is concerned with the partially observed linear system identification, where the goal is to obtain reasonably accurate estimation of the balanced truncation of the true system up to order $k$ from output measurements. We consider the challenging case of system identification under adversarial attacks, where the probability of having an attack at each time is $\Theta(1/k)$ while the value of the attack is arbitrary. We first show that the $\ell_1$-norm estimator exactly identifies the true Markov parameter matrix for nilpotent systems under any type of attack. We then build on this result to extend it to general systems and show that the estimation error exponentially decays as $k$ grows. The estimated balanced truncation model accordingly shows an exponentially decaying error for the identification of the true system up to a similarity transformation. This work is the first to provide the input-output analysis of the system with partial observations under arbitrary attacks.

[86] arXiv:2504.05001 (replaced) [pdf, html, other]
Title: SILVIA: Ultra-precision formation flying demonstration for space-based interferometry
Takahiro Ito, Kiwamu Izumi, Isao Kawano, Ikkoh Funaki, Shuichi Sato, Tomotada Akutsu, Kentaro Komori, Mitsuru Musha, Yuta Michimura, Satoshi Satoh, Takuya Iwaki, Kentaro Yokota, Kenta Goto, Katsumi Furukawa, Taro Matsuo, Toshihiro Tsuzuki, Katsuhiko Yamada, Takahiro Sasaki, Taisei Nishishita, Yuki Matsumoto, Chikako Hirose, Wataru Torii, Satoshi Ikari, Koji Nagano, Masaki Ando, Seiji Kawamura, Hidehiro Kaneda, Shinsuke Takeuchi, Shinichiro Sakai
Comments: 10 pages, 6 figures, accepted for publication in Publications of the Astronomical Society of Japan
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Systems and Control (eess.SY); General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Detectors (physics.ins-det)

We propose SILVIA (Space Interferometer Laboratory Voyaging towards Innovative Applications), a mission concept designed to demonstrate ultra-precision formation flying between three spacecraft separated by 100 m. SILVIA aims to achieve sub-micrometer precision in relative distance control by integrating spacecraft sensors, laser interferometry, low-thrust and low-noise micro-propulsion for real-time measurement and control of distances and relative orientations between spacecraft. A 100-meter-scale mission in a near-circular low Earth orbit has been identified as an ideal, cost-effective setting for demonstrating SILVIA, as this configuration maintains a good balance between small relative perturbations and low risk for collision. This mission will fill the current technology gap towards future missions, including gravitational wave observatories such as DECIGO (DECihertz Interferometer Gravitational wave Observatory), designed to detect the primordial gravitational wave background, and high-contrast nulling infrared interferometers like LIFE (Large Interferometer for Exoplanets), designed for direct imaging of thermal emissions from nearby terrestrial planet candidates. The mission concept and its key technologies are outlined, paving the way for the next generation of high-precision space-based observatories.

[87] arXiv:2504.09885 (replaced) [pdf, html, other]
Title: Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li
Comments: 15 pages, 7 figures, Accepted to ACMMM 2025
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)

Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics. Our project is available at this https URL.

[88] arXiv:2504.10268 (replaced) [pdf, other]
Title: Theoretical Model of Microparticle-Assisted Super-Resolution Microscopy
A. R Bekirov
Subjects: Optics (physics.optics); Image and Video Processing (eess.IV)

We present the first three-dimensional theoretical model of microparticle-assisted super-resolution imaging, enabling accurate simulation of virtual image formation. The model reveals that accounting for partial spatial coherence of illumination is a fundamental prerequisite for achieving super-resolution. We also propose a novel illumination strategy based on suppressing the normal component of incident light, which enhances image contrast and resolution. The results establish a consistent wave-optical framework that reproduces experimentally observed subwavelength imaging and clarifies the underlying physical mechanisms.

[89] arXiv:2504.11411 (replaced) [pdf, html, other]
Title: Breaking the TDD Flow for Over-the-Air Phase Synchronization in Distributed Antenna Systems
Khac-Hoang Ngo, Erik G. Larsson
Comments: accepted to IEEE GLOBECOM 2025
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

Phase synchronization between distributed antenna arrays requires measurements that break the standard time-division duplex (TDD) operation. We present a feasibility study on implementing such synchronization and analyze its impact on the quality of service. Considering two antenna arrays with independent local oscillators (LOs), we propose a modified TDD flow to accommodate the transmission of phase synchronization signals, formulate the phase estimation and compensation problem, and derive the achievable downlink spectral efficiency (SE). Numerical results show that frequent re-estimation of the interarray phase disparity is essential for maximizing SE in systems with low-quality LOs. Furthermore, applying a Kalman filter for phase tracking substantially improves the SE, especially if phase estimation errors are large compared to LOs phase drifts.

[90] arXiv:2504.16960 (replaced) [pdf, html, other]
Title: Can Knowledge Improve Security? A Coding-Enhanced Jamming Approach for Semantic Communication
Weixuan Chen, Qianqian Yang, Shuo Shao, Zhiguo Shi, Jiming Chen, Xuemin (Sherman)Shen
Subjects: Information Theory (cs.IT); Image and Video Processing (eess.IV)

As semantic communication (SemCom) attracts growing attention as a novel communication paradigm, ensuring the security of transmitted semantic information over open wireless channels has become a critical issue. However, traditional encryption methods often introduce significant additional communication overhead to maintain stability, and conventional learning-based secure SemCom methods typically rely on a channel capacity advantage for the legitimate receiver, which is challenging to guarantee in real-world scenarios. In this paper, we propose a coding-enhanced jamming method that eliminates the need to transmit a secret key by utilizing shared knowledge-potentially part of the training set of the SemCom system-between the legitimate receiver and the transmitter. Specifically, we leverage the shared private knowledge base to generate a set of private digital codebooks in advance using neural network (NN)-based encoders. For each transmission, we encode the transmitted data into digital sequence Y1 and associate Y1 with a sequence randomly picked from the private codebook, denoted as Y2, through superposition coding. Here, Y1 serves as the outer code and Y2 as the inner code. By optimizing the power allocation between the inner and outer codes, the legitimate receiver can reconstruct the transmitted data using successive decoding with the index of Y2 shared, while the eavesdropper' s decoding performance is severely degraded, potentially to the point of random guessing. Experimental results demonstrate that our method achieves comparable security to state-of-the-art approaches while significantly improving the reconstruction performance of the legitimate receiver by more than 1 dB across varying channel signal-to-noise ratios (SNRs) and compression ratios.

[91] arXiv:2504.19149 (replaced) [pdf, other]
Title: A Kinematic and Kinetic Dataset of Lower Limb Joints During Obstacle Crossing in Healthy Young Adults
Jingwen Huang, Shucong Yin, Zhaokai Chen, Hanyang Xu, Chenglong Fu
Subjects: Medical Physics (physics.med-ph); Systems and Control (eess.SY)

Obstacle crossing is an essential component of human locomotion, particularly for individuals with lower limb amputations who face elevated risks of imbalance and falls. While prior studies have explored this task, they often lack a comprehensive examination of kinematic and kinetic changes throughout the entire gait cycle across varying obstacle heights. This study creates a novel dataset collected from ten healthy adults performing obstacle crossing at four different heights (7.5 cm, 15 cm, 22.5 cm, and 30 cm). Kinematic and kinetic data (angles and torques of hip, knee, and ankle) were recorded and analyzed. Results indicate that increased obstacle height leads to a longer swing phase and significant increases in both hip and knee joint angles (1.5* and 1.0*, respectively) and torques. In contrast, ankle joint angles and moments exhibited minimal variation across obstacle heights, indicating a relatively consistent movement strategy at the ankle. Furthermore, significant asymmetries were observed between the dominant and non-dominant foot: the dominant foot demonstrated larger hip and knee joint angles and more consistent ankle behavior, reflecting greater coordination. These findings offer valuable biomechanical insights for improving fall prevention strategies and informing the design of assistive devices such as prostheses and exoskeletons.

[92] arXiv:2505.10823 (replaced) [pdf, html, other]
Title: From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification
Xue Li, Jameson Merkow, Noel C. F. Codella, Alberto Santamaria-Pang, Naiteek Sangani, Alexander Ersoy, Christopher Burt, John W. Garrett, Richard J. Bruce, Joshua D. Warner, Tyler Bradshaw, Ivan Tarapov, Matthew P. Lungren, Alan B. McMillan
Comments: 12 pages, 5 figures, 4 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Foundation models provide robust embeddings for diverse tasks, including medical imaging. We evaluate embeddings from seven general and medical-specific foundation models (e.g., DenseNet121, BiomedCLIP, MedImageInsight, Rad-DINO, CXR-Foundation) for training lightweight adapters in multi-class radiography classification. Using a dataset of 8,842 radiographs across seven classes, we trained adapters with algorithms like K-Nearest Neighbors, logistic regression, SVM, random forest, and MLP. The combination of MedImageInsight embeddings with an SVM or MLP adapter achieved the highest mean area under the curve (mAUC) of 93.1%. This performance was statistically superior to other models, including MedSigLIP with an MLP (91.0%), Rad-DINO with an SVM (90.7%), and CXR-Foundation with logistic regression (88.6%). In contrast, models like BiomedCLIP (82.8%) and Med-Flamingo (78.5%) showed lower performance. Crucially, these lightweight adapters are computationally efficient, training in minutes and performing inference in seconds on a CPU, making them practical for clinical use. A fairness analysis of the top-performing MedImageInsight adapter revealed minimal performance disparities across patient gender (within 1.8%) and age groups (std. dev < 1.4%), with no significant statistical differences. These findings confirm that embeddings from specialized foundation models, particularly MedImageInsight, can power accurate, efficient, and equitable diagnostic tools using simple, lightweight adapters.

[93] arXiv:2506.08570 (replaced) [pdf, html, other]
Title: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Or Tal, Felix Kreuk, Yossi Adi
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: this https URL

[94] arXiv:2507.01587 (replaced) [pdf, html, other]
Title: Towards Controllable Real Image Denoising with Camera Parameters
Youngjin Oh, Junhyeong Kwon, Keuntek Lee, Nam Ik Cho
Comments: Published in 2025 IEEE International Conference on Image Processing (ICIP)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus on ISO, shutter speed, and F-number, which are closely related to noise levels. We convert these selected parameters into a vector to control and enhance the performance of the denoising network. Experimental results show that our method seamlessly adds controllability to standard denoising neural networks and improves their performance. Code is available at this https URL.

[95] arXiv:2509.00813 (replaced) [pdf, html, other]
Title: AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation
Gyehun Go, Satbyul Han, Ahyeon Choi, Eunjin Choi, Juhan Nam, Jeong Mi Park
Comments: to be published in HCMIR25: 3rd Workshop on Human-Centric Music Information Research
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.

[96] arXiv:2509.01153 (replaced) [pdf, html, other]
Title: EZhouNet:A framework based on graph neural network and anchor interval for the respiratory sound event detection
Yun Chu, Qiuhao Wang, Enze Zhou, Qian Liu, Gang Zheng
Journal-ref: Biomedical Signal Processing and Control 2026-02 | Journal article
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Auscultation is a key method for early diagnosis of respiratory and pulmonary diseases, relying on skilled healthcare professionals. However, the process is often subjective, with variability between experts. As a result, numerous deep learning-based automatic classification methods have emerged, most of which focus on respiratory sound classification. In contrast, research on respiratory sound event detection remains limited. Existing sound event detection methods typically rely on frame-level predictions followed by post-processing to generate event-level outputs, making interval boundaries challenging to learn directly. Furthermore, many approaches can only handle fixed-length audio, limiting their applicability to variable-length respiratory sounds. Additionally, the impact of respiratory sound location information on detection performance has not been extensively explored. To address these issues, we propose a graph neural network-based framework with anchor intervals, capable of handling variable-length audio and providing more precise temporal localization for abnormal respiratory sound events. Our method improves both the flexibility and applicability of respiratory sound detection. Experiments on the SPRSound 2024 and HF Lung V1 datasets demonstrate the effectiveness of the proposed approach, and incorporating respiratory position information enhances the discrimination between abnormal sounds. The reference implementation is available at this https URL.

[97] arXiv:2509.02020 (replaced) [pdf, html, other]
Title: FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at this https URL.

Total of 97 entries
Showing up to 2000 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack