Search | arXiv e-print repository

arXiv:2507.01766 [pdf, ps, other]

Reconfigurable Intelligent Surface aided Integrated-Navigation-and-Communication in Urban Canyons: A Satellite Selection Approach

Authors: Tianwei Hou, Da Guan, Xin Sun, Anna Li, Wenqiang Yi, Yuanwei Liu, Arumugam Nallanathan

Abstract: This study investigates the application of a simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided medium-Earth-orbit (MEO) satellite network for providing both global positioning services and communication services in the urban canyons, where the direct satellite-user links are obstructed. Superposition coding (SC) and successive interference cancellation (S… ▽ More This study investigates the application of a simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided medium-Earth-orbit (MEO) satellite network for providing both global positioning services and communication services in the urban canyons, where the direct satellite-user links are obstructed. Superposition coding (SC) and successive interference cancellation (SIC) techniques are utilized for the integrated navigation and communication (INAC) networks, and the composed navigation and communication signals are reflected or transmitted to ground users or indoor users located in urban canyons. To meet diverse application needs, navigation-oriented (NO)-INAC and communication-oriented (CO)-INAC have been developed, each tailored according to distinct power allocation factors. We then proposed two algorithms, namely navigation-prioritized-algorithm (NPA) and communication-prioritized-algorithm (CPA), to improve the navigation or communication performance by selecting the satellite with the optimized position dilution of precision (PDoP) or with the best channel gain. The effectiveness of the proposed STAR-RIS-aided INAC network is quantified by analyzing the positioning error for navigation services and by evaluating communication performance through achievable ergodic rate metrics. Our satellite selection approach indicates that: the positioning services at the urban canyon users can be completed with the aid of STAR-RIS. 2) Additionally, it is observed that while a single STAR-RIS array can extend the navigational link, it fails to serve users in indoor scenarios, highlighting a limitation in the current system design. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.19266 [pdf]

Convergent and divergent connectivity patterns of the arcuate fasciculus in macaques and humans

Authors: Jiahao Huang, Ruifeng Li, Wenwen Yu, Anan Li, Xiangning Li, Mingchao Yan, Lei Xie, Qingrun Zeng, Xueyan Jia, Shuxin Wang, Ronghui Ju, Feng Chen, Qingming Luo, Hui Gong, Andrew Zalesky, Xiaoquan Yang, Yuanjing Feng, Zheng Wang

Abstract: The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T dif… ▽ More The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T diffusion MRI. Complemented by spectral embedding analysis of 7.0T MRI in humans, we performed a comparative connectomic analysis of the AF across species. We demonstrate that the macaque AF originates in the temporal-parietal cortex, traverses the auditory cortex and parietal operculum, and projects into prefrontal regions. In contrast, the human AF exhibits greater expansion into the middle temporal gyrus and stronger prefrontal and parietal operculum connectivity - divergences quantified by Kullback-Leibler analysis that likely underpin the evolutionary specialization of human language networks. These interspecies differences - particularly the human AF's broader temporal integration and strengthened frontoparietal linkages - suggest a connectivity-based substrate for the emergence of advanced language processing unique to humans. Furthermore, our findings offer a neuroanatomical framework for understanding AF-related disorders such as aphasia and dyslexia, where aberrant connectivity disrupts language function. △ Less

Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: 34 pages, 6 figures

arXiv:2506.17184 [pdf, ps, other]

Judo: A User-Friendly Open-Source Package for Sampling-Based Model Predictive Control

Authors: Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac'h, Preston Culbertson

Abstract: Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping… ▽ More Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, Judo provides robust implementations of common sampling-based MPC algorithms and standardized benchmark tasks. It further emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While written in Python, the software leverages MuJoCo as its physics backend to achieve real-time performance, which we validate across both consumer and server-grade hardware. Code at https://github.com/bdaiinstitute/judo. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted at the 2025 RSS Workshop on Fast Motion Planning and Control in the Era of Parallelism. 5 Pages

arXiv:2506.16733 [pdf]

A Prior-Guided Joint Diffusion Model in Projection Domain for PET Tracer Conversion

Authors: Fang Chen, Weifeng Zhang, Xingyu Ai, BingXuan Li, An Li, Qiegen Liu

Abstract: Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neu… ▽ More Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neurological disorders. However, the complexity of its synthesis process and constraints on transportation time have limited its clinical application. Among different forms of raw data acquired by the scanner, sinogram is a commonly used representation in PET imaging. Therefore, modeling in projection domain enables more direct utilization of the original information, potentially reducing the accumulation errors during the image reconstruction process. Inspired by these factors, this study proposes a prior-guided joint diffusion model (PJDM) for transforming 18F-FDG PET sinograms into 18F-DOPA PET sinograms. During inference, an initial synthetic 18F-DOPA PET sinogram is first generated using a higher-order hybrid sampler. This sinogram is then degraded and serves as an additional condition to guide the iterative refinement process. Experimental results demonstrated that PJDM effectively improved both sinogram quality and the final synthetic outcomes. The code is available at: https://github.com/yqx7150/PJDM. △ Less

Submitted 22 June, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16537 [pdf]

Agile, Autonomous Spacecraft Constellations with Disruption Tolerant Networking to Monitor Precipitation and Urban Floods

Authors: Sreeja Roy-Singh, Alan P. Li, Vinay Ravindra, Roderick Lammers, Marc Sanchez Net

Abstract: Fully re-orientable small spacecraft are now supported by commercial technologies, allowing them to point their instruments in any direction and capture images, with short notice. When combined with improved onboard processing, and implemented on a constellation of inter-communicable satellites, this intelligent agility can significantly increase responsiveness to transient or evolving phenomena.… ▽ More Fully re-orientable small spacecraft are now supported by commercial technologies, allowing them to point their instruments in any direction and capture images, with short notice. When combined with improved onboard processing, and implemented on a constellation of inter-communicable satellites, this intelligent agility can significantly increase responsiveness to transient or evolving phenomena. We demonstrate a ground-based and onboard algorithmic framework that combines orbital mechanics, attitude control, inter-satellite communication, intelligent prediction and planning to schedule the time-varying, re-orientation of agile, small satellites in a constellation. Planner intelligence is improved by updating the predictive value of future space-time observations based on shared observations of evolving episodic precipitation and urban flood forecasts. Reliable inter-satellite communication within a fast, dynamic constellation topology is modeled in the physical, access control and network layer. We apply the framework on a representative 24-satellite constellation observing 5 global regions. Results show appropriately low latency in information exchange (average within 1/3rd available time for implicit consensus), enabling the onboard scheduler to observe ~7% more flood magnitude than a ground-based implementation. Both onboard and offline versions performed ~98% better than constellations without agility. △ Less

Submitted 23 June, 2025; v1 submitted 19 June, 2025; originally announced June 2025.

Journal ref: Robotics Science and Systems (RSS 2025) - Space Robotics Workshop

arXiv:2506.05171 [pdf, other]

Towards provable probabilistic safety for scalable embodied AI systems

Authors: Linxuan He, Qing-Shan Jia, Ang Li, Hongyan Sang, Ling Wang, Jiwen Lu, Tao Zhang, Jie Zhou, Yi Zhang, Yisen Wang, Peng Wei, Zhongyuan Wang, Henry X. Liu, Shuo Feng

Abstract: Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving prov… ▽ More Embodied AI systems, comprising AI models and physical plants, are increasingly prevalent across various applications. Due to the rarity of system failures, ensuring their safety in complex operating environments remains a major challenge, which severely hinders their large-scale deployment in safety-critical domains, such as autonomous vehicles, medical devices, and robotics. While achieving provable deterministic safety--verifying system safety across all possible scenarios--remains theoretically ideal, the rarity and complexity of corner cases make this approach impractical for scalable embodied AI systems. To address this challenge, we introduce provable probabilistic safety, which aims to ensure that the residual risk of large-scale deployment remains below a predefined threshold. Instead of attempting exhaustive safety proof across all corner cases, this paradigm establishes a probabilistic safety boundary on overall system performance, leveraging statistical methods to enhance feasibility and scalability. A well-defined probabilistic safety boundary enables embodied AI systems to be deployed at scale while allowing for continuous refinement of safety guarantees. Our work focuses on three core questions: what is provable probabilistic safety, how to prove the probabilistic safety, and how to achieve the provable probabilistic safety. By bridging the gap between theoretical safety assurance and practical deployment, our work offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2505.22923 [pdf, ps, other]

Plug-and-Play Posterior Sampling for Blind Inverse Problems

Authors: Anqi Li, Weijie Gan, Ulugbek S. Kamilov

Abstract: We introduce Blind Plug-and-Play Diffusion Models (Blind-PnPDM) as a novel framework for solving blind inverse problems where both the target image and the measurement operator are unknown. Unlike conventional methods that rely on explicit priors or separate parameter estimation, our approach performs posterior sampling by recasting the problem into an alternating Gaussian denoising scheme. We lev… ▽ More We introduce Blind Plug-and-Play Diffusion Models (Blind-PnPDM) as a novel framework for solving blind inverse problems where both the target image and the measurement operator are unknown. Unlike conventional methods that rely on explicit priors or separate parameter estimation, our approach performs posterior sampling by recasting the problem into an alternating Gaussian denoising scheme. We leverage two diffusion models as learned priors: one to capture the distribution of the target image and another to characterize the parameters of the measurement operator. This PnP integration of diffusion models ensures flexibility and ease of adaptation. Our experiments on blind image deblurring show that Blind-PnPDM outperforms state-of-the-art methods in terms of both quantitative metrics and visual fidelity. Our results highlight the effectiveness of treating blind inverse problems as a sequence of denoising subproblems while harnessing the expressive power of diffusion-based priors. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: arXiv admin note: text overlap with arXiv:2305.12672

arXiv:2505.15200 [pdf, ps, other]

Performance Analysis of Fluid Antenna System under Spatially-Correlated Rician Fading Channels

Authors: Jiangsheng Huangfu, Zhengyu Song, Tianwei Hou, Anna Li, Yuanwei Liu, Arumugam Nallanathan, Kai-Kit Wong

Abstract: Fluid antenna systems (FAS) are among the most promising technologies for the sixth generation (6G) mobile communication networks. Unlike traditional fixed-position multiple-input multiple-output (MIMO) systems, a FAS possesses position reconfigurability to switch on-demand among $N$ predefined ports over a prescribed space. This paper explores the performance of a single-input single-output (SISO… ▽ More Fluid antenna systems (FAS) are among the most promising technologies for the sixth generation (6G) mobile communication networks. Unlike traditional fixed-position multiple-input multiple-output (MIMO) systems, a FAS possesses position reconfigurability to switch on-demand among $N$ predefined ports over a prescribed space. This paper explores the performance of a single-input single-output (SISO) model with a fixed-position antenna transmitter and a single-antenna FAS receiver, referred to as the Rx-SISO-FAS model, under spatially-correlated Rician fading channels. Our contributions include exact expressions and closed-form bounds for the outage probability of the Rx-SISO-FAS model, as well as exact and closed-form lower bounds for the ergodic rate. Importantly, we also analyze the performance considering both uniform linear array (ULA) and uniform planar array (UPA) configurations for the ports of the FAS. To gain insights, we evaluate the diversity order of the proposed model and our analytical results indicate that with a fixed overall system size, increasing the number of ports, $N$, significantly decreases the outage performance of FAS under different Rician fading factors. Our numerical results further demonstrate that: $i)$ the Rx-SISO-FAS model can enhance performance under spatially-correlated Rician fading channels over the fixed-position antenna counterpart; $ii)$ the Rician factor negatively impacts performance in the low signal-to-noise ratio (SNR) regime; $iii$) FAS can outperform an $L$ branches maximum ratio combining (MRC) system under Rician fading channels; and $iv)$ when the number of ports is identical, UPA outperforms ULA. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.10786 [pdf, ps, other]

Bridging BCI and Communications: A MIMO Framework for EEG-to-ECoG Wireless Channel Modeling

Authors: Jiaheng Wang, Zhenyu Wang, Tianheng Xu, Yuan Si, Ang Li, Ting Zhou, Xi Zhao, Honglin Hu

Abstract: As a method to connect human brain and external devices, Brain-computer interfaces (BCIs) are receiving extensive research attention. Recently, the integration of communication theory with BCI has emerged as a popular trend, offering potential to enhance system performance and shape next-generation communications. A key challenge in this field is modeling the brain wireless communication channel… ▽ More As a method to connect human brain and external devices, Brain-computer interfaces (BCIs) are receiving extensive research attention. Recently, the integration of communication theory with BCI has emerged as a popular trend, offering potential to enhance system performance and shape next-generation communications. A key challenge in this field is modeling the brain wireless communication channel between intracranial electrocorticography (ECoG) emitting neurons and extracranial electroencephalography (EEG) receiving electrodes. However, the complex physiology of brain challenges the application of traditional channel modeling methods, leaving relevant research in its infancy. To address this gap, we propose a frequency-division multiple-input multiple-output (MIMO) estimation framework leveraging simultaneous macaque EEG and ECoG recordings, while employing neurophysiology-informed regularization to suppress noise interference. This approach reveals profound similarities between neural signal propagation and multi-antenna communication systems. Experimental results show improved estimation accuracy over conventional methods while highlighting a trade-off between frequency resolution and temporal stability determined by signal duration. This work establish a conceptual bridge between neural interfacing and communication theory, accelerating synergistic developments in both fields. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2503.22837 [pdf, other]

A Cooperative Compliance Control Framework for Socially Optimal Mixed Traffic Routing

Authors: Anni Li, Ting Bai, Yingqing Chen, Christos G. Cassandras, Andreas A. Malikopoulos

Abstract: In mixed traffic environments, where Connected and Autonomed Vehicles (CAVs) coexist with potentially non-cooperative Human-Driven Vehicles (HDVs), the self-centered behavior of human drivers may compromise the efficiency, optimality, and safety of the overall traffic network. In this paper, we propose a Cooperative Compliance Control (CCC) framework for mixed traffic routing, where a Social Plann… ▽ More In mixed traffic environments, where Connected and Autonomed Vehicles (CAVs) coexist with potentially non-cooperative Human-Driven Vehicles (HDVs), the self-centered behavior of human drivers may compromise the efficiency, optimality, and safety of the overall traffic network. In this paper, we propose a Cooperative Compliance Control (CCC) framework for mixed traffic routing, where a Social Planner (SP) optimizes vehicle routes for system-wide optimality while a compliance controller incentivizes human drivers to align their behavior with route guidance from the SP through a "refundable toll" scheme. A key challenge arises from the heterogeneous and unknown response models of different human driver types to these tolls, making it difficult to design a proper controller and achieve desired compliance probabilities over the traffic network. To address this challenge, we employ Control Lyapunov Functions (CLFs) to adaptively correct (learn) crucial components of our compliance probability model online, construct data-driven feedback controllers, and demonstrate that we can achieve the desired compliance probability for HDVs, thereby contributing to the social optimality of the traffic network. △ Less

Submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.12936 [pdf, other]

FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks

Authors: Tong Lei, Qinwen Hu, Ziyao Lin, Andong Li, Rilin Chen, Meng Yu, Dong Yu, Jing Lu

Abstract: The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real m… ▽ More The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real mixtures. Specifically, we revisit the single-channel far-field to near-field speech enhancement (FNSE) task, focusing on real-world data characterized by low signal-to-noise ratio (SNR), high reverberation, and mid-to-high frequency attenuation. We propose FNSE-SBGAN, a framework that integrates a Schrodinger Bridge (SB)-based diffusion model with generative adversarial networks (GANs). Our approach achieves state-of-the-art performance across various metrics and subjective evaluations, significantly reducing the character error rate (CER) by up to 14.58% compared to far-field signals. Experimental results demonstrate that FNSE-SBGAN preserves superior subjective quality and establishes a new benchmark for real-world far-field speech enhancement. Additionally, we introduce an evaluation framework leveraging matrix rank analysis in the time-frequency domain, providing systematic insights into model performance and revealing the strengths and weaknesses of different generative methods. △ Less

Submitted 15 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: 13 pages, 6 figures

arXiv:2503.12601 [pdf, other]

Routing Guidance for Emerging Transportation Systems with Improved Dynamic Trip Equity

Authors: Ting Bai, Anni Li, Gehui Xu, Christos G. Cassandras, Andreas A. Malikopoulos

Abstract: In this paper, we present a dynamic routing guidance system that optimizes route recommendations for individual vehicles within an emerging transportation system while enhancing travelers' trip equity. We develop a framework to quantify trip quality and equity in a dynamic travel environment, providing new insights into how routing guidance influences equity in road transportation. Our approach en… ▽ More In this paper, we present a dynamic routing guidance system that optimizes route recommendations for individual vehicles within an emerging transportation system while enhancing travelers' trip equity. We develop a framework to quantify trip quality and equity in a dynamic travel environment, providing new insights into how routing guidance influences equity in road transportation. Our approach enables real-time routing by incorporating both monitored and anticipated traffic congestion. We provide conditions that ensure achieving perfect trip equity for all travelers in a free-flow network. Finally, simulation studies on 1,000 vehicles traversing an urban road network in Boston demonstrate that our proposed method improves trip equity by approximately 11.4\% compared to the shortest-route strategy. In addition, the results reveal that our approach redistributes travel costs across vehicle types through route optimization, contributing to a more equitable transportation system. △ Less

Submitted 1 April, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.12233 [pdf, ps, other]

Robust Full-Space Physical Layer Security for STAR-RIS-Aided Wireless Networks: Eavesdropper with Uncertain Location and Channel

Authors: Han Xiao, Xiaoyan Hu, Ang Li, Wenjie Wang, Kun Yang

Abstract: A robust full-space physical layer security (PLS) transmission scheme is proposed in this paper considering the full-space wiretapping challenge of wireless networks supported by simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS). Different from the existing schemes, the proposed PLS scheme takes account of the uncertainty on the eavesdropper's position within t… ▽ More A robust full-space physical layer security (PLS) transmission scheme is proposed in this paper considering the full-space wiretapping challenge of wireless networks supported by simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS). Different from the existing schemes, the proposed PLS scheme takes account of the uncertainty on the eavesdropper's position within the 360$^\circ$ service area offered by the STAR-RIS. Specifically, the large system analytical method is utilized to derive the asymptotic expression of the average security rate achieved by the security user, considering that the base station (BS) only has the statistical information of the eavesdropper's channel state information (CSI) and the uncertainty of its location. To evaluate the effectiveness of the proposed PLS scheme, we first formulate an optimization problem aimed at maximizing the weighted sum rate of the security user and the public user. This optimization is conducted under the power allocation constraint, and some practical limitations for STAR-RIS implementation, through jointly designing the active and passive beamforming variables. A novel iterative algorithm based on the minimum mean-square error (MMSE) and cross-entropy optimization (CEO) methods is proposed to effectively address the established non-convex optimization problem with discrete variables. Simulation results indicate that the proposed robust PLS scheme can effectively mitigate the information leakage across the entire coverage area of the STAR-RIS-assisted system, leading to superior performance gain when compared to benchmark schemes encompassing traditional RIS-aided scheme. △ Less

Submitted 15 March, 2025; originally announced March 2025.

arXiv:2503.00510 [pdf, other]

NeuroSymAD: A Neuro-Symbolic Framework for Interpretable Alzheimer's Disease Diagnosis

Authors: Yexiao He, Ziyao Wang, Yuning Zhang, Tingting Dan, Tianlong Chen, Guorong Wu, Ang Li

Abstract: Alzheimer's disease (AD) diagnosis is complex, requiring the integration of imaging and clinical data for accurate assessment. While deep learning has shown promise in brain MRI analysis, it often functions as a black box, limiting interpretability and lacking mechanisms to effectively integrate critical clinical data such as biomarkers, medical history, and demographic information. To bridge this… ▽ More Alzheimer's disease (AD) diagnosis is complex, requiring the integration of imaging and clinical data for accurate assessment. While deep learning has shown promise in brain MRI analysis, it often functions as a black box, limiting interpretability and lacking mechanisms to effectively integrate critical clinical data such as biomarkers, medical history, and demographic information. To bridge this gap, we propose NeuroSymAD, a neuro-symbolic framework that synergizes neural networks with symbolic reasoning. A neural network percepts brain MRI scans, while a large language model (LLM) distills medical rules to guide a symbolic system in reasoning over biomarkers and medical history. This structured integration enhances both diagnostic accuracy and explainability. Experiments on the ADNI dataset demonstrate that NeuroSymAD outperforms state-of-the-art methods by up to 2.91% in accuracy and 3.43% in F1-score while providing transparent and interpretable diagnosis. △ Less

Submitted 1 March, 2025; originally announced March 2025.

arXiv:2502.13395 [pdf]

Unsupervised CP-UNet Framework for Denoising DAS Data with Decay Noise

Authors: Tianye Huang, Aopeng Li, Xiang Li, Jing Zhang, Sijing Xian, Qi Zhang, Mingkong Lu, Guodong Chen, Liangming Xiong, Xiangyun Hu

Abstract: Distributed acoustic sensor (DAS) technology leverages optical fiber cables to detect acoustic signals, providing cost-effective and dense monitoring capabilities. It offers several advantages including resistance to extreme conditions, immunity to electromagnetic interference, and accurate detection. However, DAS typically exhibits a lower signal-to-noise ratio (S/N) compared to geophones and is… ▽ More Distributed acoustic sensor (DAS) technology leverages optical fiber cables to detect acoustic signals, providing cost-effective and dense monitoring capabilities. It offers several advantages including resistance to extreme conditions, immunity to electromagnetic interference, and accurate detection. However, DAS typically exhibits a lower signal-to-noise ratio (S/N) compared to geophones and is susceptible to various noise types, such as random noise, erratic noise, level noise, and long-period noise. This reduced S/N can negatively impact data analyses containing inversion and interpretation. While artificial intelligence has demonstrated excellent denoising capabilities, most existing methods rely on supervised learning with labeled data, which imposes stringent requirements on the quality of the labels. To address this issue, we develop a label-free unsupervised learning (UL) network model based on Context-Pyramid-UNet (CP-UNet) to suppress erratic and random noises in DAS data. The CP-UNet utilizes the Context Pyramid Module in the encoding and decoding process to extract features and reconstruct the DAS data. To enhance the connectivity between shallow and deep features, we add a Connected Module (CM) to both encoding and decoding section. Layer Normalization (LN) is utilized to replace the commonly employed Batch Normalization (BN), accelerating the convergence of the model and preventing gradient explosion during training. Huber-loss is adopted as our loss function whose parameters are experimentally determined. We apply the network to both the 2-D synthetic and filed data. Comparing to traditional denoising methods and the latest UL framework, our proposed method demonstrates superior noise reduction performance. △ Less

Submitted 18 February, 2025; originally announced February 2025.

Comments: 13 pages, 8 figures

arXiv:2502.12412 [pdf, other]

Incomplete Graph Learning: A Comprehensive Survey

Authors: Riting Xia, Huibo Liu, Anchen Li, Xueyan Liu, Yan Zhang, Chunxu Zhang, Bo Yang

Abstract: Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve… ▽ More Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve more accurate and representative results. In this paper, we conducted a comprehensive review of the literature on incomplete graph learning. Initially, we categorize incomplete graphs and provide precise definitions of relevant concepts, terminologies, and techniques, thereby establishing a solid understanding for readers. Subsequently, we classify incomplete graph learning methods according to the types of incompleteness: (1) attribute-incomplete graph learning methods, (2) attribute-missing graph learning methods, and (3) hybrid-absent graph learning methods. By systematically classifying and summarizing incomplete graph learning methods, we highlight the commonalities and differences among existing approaches, aiding readers in selecting methods and laying the groundwork for further advancements. In addition, we summarize the datasets, incomplete processing modes, evaluation metrics, and application domains used by the current methods. Lastly, we discuss the current challenges and propose future directions for incomplete graph learning, with the aim of stimulating further innovations in this crucial field. To our knowledge, this is the first review dedicated to incomplete graph learning, aiming to offer valuable insights for researchers in related fields.We developed an online resource to follow relevant research based on this review, available at https://github.com/cherry-a11y/Incomplete-graph-learning.git △ Less

Submitted 17 February, 2025; originally announced February 2025.

arXiv:2502.12002 [pdf, other]

NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing

Authors: Yifan Liang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng

Abstract: Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits o… ▽ More Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components. Specifically, we introduce a fundamental frequency (F0) predictor to capture prosodic variations in synthesized speech. The predicted F0 then drives a Differentiable Digital Signal Processing (DDSP) synthesizer to generate a coarse signal which serves as prior information for subsequent speech synthesis. Additionally, instead of relying on a reference speaker embedding as an auxiliary input, our approach achieves satisfactory performance on speaker similarity without explicitly modelling speaker characteristics. Both objective and subjective evaluation results demonstrate that NaturalL2S can effectively enhance the quality of the synthesized speech when compared to state-of-the-art methods. Our demonstration page is accessible at https://yifan-liang.github.io/NaturalL2S/. △ Less

Submitted 17 February, 2025; originally announced February 2025.

arXiv:2502.00248 [pdf, other]

doi 10.1016/j.engappai.2024.109252

Provably-Stable Neural Network-Based Control of Nonlinear Systems

Authors: Anran Li, John P. Swensen, Mehdi Hosseinzadeh

Abstract: In recent years, Neural Networks (NNs) have been employed to control nonlinear systems due to their potential capability in dealing with situations that might be difficult for conventional nonlinear control schemes. However, to the best of our knowledge, the current literature on NN-based control lacks theoretical guarantees for stability and tracking performance. This precludes the application of… ▽ More In recent years, Neural Networks (NNs) have been employed to control nonlinear systems due to their potential capability in dealing with situations that might be difficult for conventional nonlinear control schemes. However, to the best of our knowledge, the current literature on NN-based control lacks theoretical guarantees for stability and tracking performance. This precludes the application of NN-based control schemes to systems where stringent stability and performance guarantees are required. To address this gap, this paper proposes a systematic and comprehensive methodology to design provably-stable NN-based control schemes for affine nonlinear systems. Rigorous analysis is provided to show that the proposed approach guarantees stability of the closed-loop system with the NN in the loop. Also, it is shown that the resulting NN-based control scheme ensures that system states asymptotically converge to a neighborhood around the desired equilibrium point, with a tunable proximity threshold. The proposed methodology is validated and evaluated via simulation studies on an inverted pendulum and experimental studies on a Parrot Bebop 2 drone. △ Less

Submitted 31 January, 2025; originally announced February 2025.

Journal ref: Engineering Applications of Artificial Intelligence, volume 138, pages 109252, year 2024

arXiv:2501.16215 [pdf, other]

Enhancing Visual Inspection Capability of Multi-Modal Large Language Models on Medical Time Series with Supportive Conformalized and Interpretable Small Specialized Models

Authors: Huayu Li, Xiwen Chen, Ci Zhang, Stuart F. Quan, William D. S. Killgore, Shu-Fen Wung, Chen X. Chen, Geng Yuan, Jin Lu, Ao Li

Abstract: Large language models (LLMs) exhibit remarkable capabilities in visual inspection of medical time-series data, achieving proficiency comparable to human clinicians. However, their broad scope limits domain-specific precision, and proprietary weights hinder fine-tuning for specialized datasets. In contrast, small specialized models (SSMs) excel in targeted tasks but lack the contextual reasoning re… ▽ More Large language models (LLMs) exhibit remarkable capabilities in visual inspection of medical time-series data, achieving proficiency comparable to human clinicians. However, their broad scope limits domain-specific precision, and proprietary weights hinder fine-tuning for specialized datasets. In contrast, small specialized models (SSMs) excel in targeted tasks but lack the contextual reasoning required for complex clinical decision-making. To address these challenges, we propose ConMIL (Conformalized Multiple Instance Learning), a decision-support SSM that integrates seamlessly with LLMs. By using Multiple Instance Learning (MIL) to identify clinically significant signal segments and conformal prediction for calibrated set-valued outputs, ConMIL enhances LLMs' interpretative capabilities for medical time-series analysis. Experimental results demonstrate that ConMIL significantly improves the performance of state-of-the-art LLMs, such as ChatGPT4.0 and Qwen2-VL-7B. Specifically, \ConMIL{}-supported Qwen2-VL-7B achieves 94.92% and 96.82% precision for confident samples in arrhythmia detection and sleep staging, compared to standalone LLM accuracy of 46.13% and 13.16%. These findings highlight the potential of ConMIL to bridge task-specific precision and broader contextual reasoning, enabling more reliable and interpretable AI-driven clinical decision support. △ Less

Submitted 27 January, 2025; originally announced January 2025.

arXiv:2501.13465 [pdf, other]

Neural Vocoders as Speech Enhancers

Authors: Andong Li, Zhihang Sun, Fengyuan Hao, Xiaodong Li, Chengshi Zheng

Abstract: Speech enhancement (SE) and neural vocoding are traditionally viewed as separate tasks. In this work, we observe them under a common thread: the rank behavior of these processes. This observation prompts two key questions: \textit{Can a model designed for one task's rank degradation be adapted for the other?} and \textit{Is it possible to address both tasks using a unified model?} Our empirical fi… ▽ More Speech enhancement (SE) and neural vocoding are traditionally viewed as separate tasks. In this work, we observe them under a common thread: the rank behavior of these processes. This observation prompts two key questions: \textit{Can a model designed for one task's rank degradation be adapted for the other?} and \textit{Is it possible to address both tasks using a unified model?} Our empirical findings demonstrate that existing speech enhancement models can be successfully trained to perform vocoding tasks, and a single model, when jointly trained, can effectively handle both tasks with performance comparable to separately trained models. These results suggest that speech enhancement and neural vocoding can be unified under a broader framework of speech restoration. Code: https://github.com/Andong-Li-speech/Neural-Vocoders-as-Speech-Enhancers. △ Less

Submitted 23 January, 2025; originally announced January 2025.

Comments: 6 pages, 3 figures

arXiv:2501.09799 [pdf, ps, other]

Scan-Adaptive MRI Undersampling Using Neighbor-based Optimization (SUNO)

Authors: Siddhant Gautam, Angqi Li, Nicole Seiberlich, Jeffrey A. Fessler, Saiprasad Ravishankar

Abstract: Accelerated MRI involves collecting partial k-space measurements to reduce acquisition time, patient discomfort, and motion artifacts, and typically uses regular undersampling patterns or hand-designed schemes. Recent works have studied population-adaptive sampling patterns that are learned from a group of patients (or scans) based on population-specific metrics. However, such a general sampling p… ▽ More Accelerated MRI involves collecting partial k-space measurements to reduce acquisition time, patient discomfort, and motion artifacts, and typically uses regular undersampling patterns or hand-designed schemes. Recent works have studied population-adaptive sampling patterns that are learned from a group of patients (or scans) based on population-specific metrics. However, such a general sampling pattern can be sub-optimal for any specific scan since it may lack scan or slice adaptive details. To overcome this issue, we propose a framework for jointly learning scan-adaptive Cartesian undersampling patterns and a corresponding reconstruction model from a training set. We use an alternating algorithm for learning the sampling patterns and reconstruction model where we use an iterative coordinate descent (ICD) based offline optimization of scan-adaptive k-space sampling patterns for each example in the training set. A nearest neighbor search is then used to select the scan-adaptive sampling pattern at test time from initially acquired low-frequency k-space information. We applied the proposed framework (dubbed SUNO) to the fastMRI multi-coil knee and brain datasets, demonstrating improved performance over currently used undersampling patterns at both 4x and 8x acceleration factors in terms of both visual quality and quantitative metrics. The code for the proposed framework is available at https://github.com/sidgautam95/adaptive-sampling-mri-suno. △ Less

Submitted 9 June, 2025; v1 submitted 16 January, 2025; originally announced January 2025.

arXiv:2412.20023 [pdf]

Global Search of Optimal Spacecraft Trajectories using Amortization and Deep Generative Models

Authors: Ryne Beeson, Anjian Li, Amlan Sinha

Abstract: Preliminary spacecraft trajectory optimization is a parameter dependent global search problem that aims to provide a set of solutions that are of high quality and diverse. In the case of numerical solution, it is dependent on the original optimal control problem, the choice of a control transcription, and the behavior of a gradient based numerical solver. In this paper we formulate the parameteriz… ▽ More Preliminary spacecraft trajectory optimization is a parameter dependent global search problem that aims to provide a set of solutions that are of high quality and diverse. In the case of numerical solution, it is dependent on the original optimal control problem, the choice of a control transcription, and the behavior of a gradient based numerical solver. In this paper we formulate the parameterized global search problem as the task of sampling a conditional probability distribution with support on the neighborhoods of local basins of attraction to the high quality solutions. The conditional distribution is learned and represented using deep generative models that allow for prediction of how the local basins change as parameters vary. The approach is benchmarked on a low thrust spacecraft trajectory optimization problem in the circular restricted three-body problem, showing significant speed-up over a simple multi-start method and vanilla machine learning approaches. The paper also provides an in-depth analysis of the multi-modal funnel structure of a low-thrust spacecraft trajectory optimization problem. △ Less

Submitted 27 December, 2024; originally announced December 2024.

Comments: 47 pages, 23 figures, initial content of this paper appears in Paper 23-352 at the AAS/AIAA Astrodynamics Specialist Conference, Big Sky, MT, August 13-17 2023

arXiv:2412.19099 [pdf, other]

BSDB-Net: Band-Split Dual-Branch Network with Selective State Spaces Mechanism for Monaural Speech Enhancement

Authors: Cunhang Fan, Enrui Liu, Andong Li, Jianhua Tao, Jian Zhou, Jiahao Li, Chengshi Zheng, Zhao Lv

Abstract: Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that lim… ▽ More Although the complex spectrum-based speech enhancement(SE) methods have achieved significant performance, coupling amplitude and phase can lead to a compensation effect, where amplitude information is sacrificed to compensate for the phase that is harmful to SE. In addition, to further improve the performance of SE, many modules are stacked onto SE, resulting in increased model complexity that limits the application of SE. To address these problems, we proposed a dual-path network based on compressed frequency using Mamba. First, we extract amplitude and phase information through parallel dual branches. This approach leverages structured complex spectra to implicitly capture phase information and solves the compensation effect by decoupling amplitude and phase, and the network incorporates an interaction module to suppress unnecessary parts and recover missing components from the other branch. Second, to reduce network complexity, the network introduces a band-split strategy to compress the frequency dimension. To further reduce complexity while maintaining good performance, we designed a Mamba-based module that models the time and frequency dimensions under linear complexity. Finally, compared to baselines, our model achieves an average 8.3 times reduction in computational complexity while maintaining superior performance. Furthermore, it achieves a 25 times reduction in complexity compared to transformer-based models. △ Less

Submitted 26 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI 2025

arXiv:2411.07833 [pdf, other]

Robust Adaptive Safe Robotic Grasping with Tactile Sensing

Authors: Yitaek Kim, Jeeseop Kim, Albert H. Li, Aaron D. Ames, Christoffer Sloth

Abstract: Robotic grasping requires safe force interaction to prevent a grasped object from being damaged or slipping out of the hand. In this vein, this paper proposes an integrated framework for grasping with formal safety guarantees based on Control Barrier Functions. We first design contact force and force closure constraints, which are enforced by a safety filter to accomplish safe grasping with finger… ▽ More Robotic grasping requires safe force interaction to prevent a grasped object from being damaged or slipping out of the hand. In this vein, this paper proposes an integrated framework for grasping with formal safety guarantees based on Control Barrier Functions. We first design contact force and force closure constraints, which are enforced by a safety filter to accomplish safe grasping with finger force control. For sensory feedback, we develop a technique to estimate contact point, force, and torque from tactile sensors at each finger. We verify the framework with various safety filters in a numerical simulation under a two-finger grasping scenario. We then experimentally validate the framework by grasping multiple objects, including fragile lab glassware, in a real robotic setup, showing that safe grasping can be successfully achieved in the real world. We evaluate the performance of each safety filter in the context of safety violation and conservatism, and find that disturbance observer-based control barrier functions provide superior performance for safety guarantees with minimum conservatism. The demonstration video is available at https://youtu.be/Cuj47mkXRdg. △ Less

Submitted 12 November, 2024; originally announced November 2024.

arXiv:2410.22028 [pdf, other]

MU-MIMO Symbol-Level Precoding for QAM Constellations with Maximum Likelihood Receivers

Authors: X. Tong, A. Li, L. Lei, X. Hu, F. Dong, S. Chatzinotas, C. Masouros

Abstract: In this paper, we investigate symbol-level precoding (SLP) and efficient decoding techniques for downlink transmission, where we focus on scenarios where the base station (BS) transmits multiple QAM constellation streams to users equipped with multiple receive antennas. We begin by formulating a joint symbol-level transmit precoding and receive combining optimization problem. This coupled problem… ▽ More In this paper, we investigate symbol-level precoding (SLP) and efficient decoding techniques for downlink transmission, where we focus on scenarios where the base station (BS) transmits multiple QAM constellation streams to users equipped with multiple receive antennas. We begin by formulating a joint symbol-level transmit precoding and receive combining optimization problem. This coupled problem is addressed by employing the alternating optimization (AO) method, and closed-form solutions are derived by analyzing the obtained two subproblems. Furthermore, to address the dependence of the receive combining matrix on the transmit signals, we switch to maximum likelihood detection (MLD) method for decoding. Notably, we have demonstrated that the smallest singular value of the precoding matrix significantly impacts the performance of MLD method. Specifically, a lower value of the smallest singular value results in degraded detection performance. Additionally, we show that the traditional SLP matrix is rank-one, making it infeasible to directly apply MLD at the receiver end. To circumvent this limitation, we propose a novel symbol-level smallest singular value maximization problem, termed SSVMP, to enable SLP in systems where users employ the MLD decoding approach. Moreover, to reduce the number of variables to be optimized, we further derive a more generic semidefinite programming (SDP)-based optimization problem. Numerical results validate the effectiveness of our proposed schemes and demonstrate that they significantly outperform the traditional block diagonalization (BD)-based method. △ Less

Submitted 29 October, 2024; originally announced October 2024.

Comments: 13 pages,8 figures

arXiv:2410.17574 [pdf, other]

Adversarial Domain Adaptation for Metal Cutting Sound Detection: Leveraging Abundant Lab Data for Scarce Industry Data

Authors: Mir Imtiaz Mostafiz, Eunseob Kim, Adrian Shuai Li, Elisa Bertino, Martin Byung-Guk Jun, Ali Shakouri

Abstract: Cutting state monitoring in the milling process is crucial for improving manufacturing efficiency and tool life. Cutting sound detection using machine learning (ML) models, inspired by experienced machinists, can be employed as a cost-effective and non-intrusive monitoring method in a complex manufacturing environment. However, labeling industry data for training is costly and time-consuming. More… ▽ More Cutting state monitoring in the milling process is crucial for improving manufacturing efficiency and tool life. Cutting sound detection using machine learning (ML) models, inspired by experienced machinists, can be employed as a cost-effective and non-intrusive monitoring method in a complex manufacturing environment. However, labeling industry data for training is costly and time-consuming. Moreover, industry data is often scarce. In this study, we propose a novel adversarial domain adaptation (DA) approach to leverage abundant lab data to learn from scarce industry data, both labeled, for training a cutting-sound detection model. Rather than adapting the features from separate domains directly, we project them first into two separate latent spaces that jointly work as the feature space for learning domain-independent representations. We also analyze two different mechanisms for adversarial learning where the discriminator works as an adversary and a critic in separate settings, enabling our model to learn expressive domain-invariant and domain-ingrained features, respectively. We collected cutting sound data from multiple sensors in different locations, prepared datasets from lab and industry domain, and evaluated our learning models on them. Experiments showed that our models outperformed the multi-layer perceptron based vanilla domain adaptation models in labeling tasks on the curated datasets, achieving near 92%, 82% and 85% accuracy respectively for three different sensors installed in industry settings. △ Less

Submitted 23 October, 2024; originally announced October 2024.

Comments: 8 pages, 3 figures, 3 tables, First two named Authors have equal contribution (Co-first author)

arXiv:2410.11270 [pdf, other]

Energy Efficient Transmission Parameters Selection Method Using Reinforcement Learning in Distributed LoRa Networks

Authors: Ryotai Airiyoshi, Mikio Hasegawa, Tomoaki Ohtsuki, Aohan Li

Abstract: With the increase in demand for Internet of Things (IoT) applications, the number of IoT devices has drastically grown, making spectrum resources seriously insufficient. Transmission collisions and retransmissions increase power consumption. Therefore, even in long-range (LoRa) networks, selecting appropriate transmission parameters, such as channel and transmission power, is essential to improve… ▽ More With the increase in demand for Internet of Things (IoT) applications, the number of IoT devices has drastically grown, making spectrum resources seriously insufficient. Transmission collisions and retransmissions increase power consumption. Therefore, even in long-range (LoRa) networks, selecting appropriate transmission parameters, such as channel and transmission power, is essential to improve energy efficiency. However, due to the limited computational ability and memory, traditional transmission parameter selection methods for LoRa networks are challenging to implement on LoRa devices. To solve this problem, a distributed reinforcement learning-based channel and transmission power selection method is proposed, which can be implemented on the LoRa devices to improve energy efficiency in this paper. Specifically, the channel and transmission power selection problem in LoRa networks is first mapped to the multi-armed-bandit (MAB) problem. Then, an MAB-based method is introduced to solve the formulated transmission parameter selection problem based on the acknowledgment (ACK) packet and the power consumption for data transmission of the LoRa device. The performance of the proposed method is evaluated by the constructed actual LoRa network. Experimental results show that the proposed method performs better than fixed assignment, adaptive data rate low-complexity (ADR-Lite), and $ε$-greedy-based methods in terms of both transmission success rate and energy efficiency. △ Less

Submitted 21 January, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

Comments: 6 pages, 5 figures, conference

arXiv:2410.11097 [pdf, other]

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Authors: Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin

Abstract: Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that preven… ▽ More Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/. △ Less

Submitted 19 February, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

arXiv:2410.06170 [pdf, other]

QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers

Authors: Haozhe Chen, Ang Li, Ethan Che, Tianyi Peng, Jing Dong, Hongseok Namkoong

Abstract: Queuing network control determines the allocation of scarce resources to manage congestion, a fundamental problem in manufacturing, communications, and healthcare. Compared to standard RL problems, queueing problems are distinguished by unique challenges: i) a system operating in continuous time, ii) high stochasticity, and iii) long horizons over which the system can become unstable (exploding de… ▽ More Queuing network control determines the allocation of scarce resources to manage congestion, a fundamental problem in manufacturing, communications, and healthcare. Compared to standard RL problems, queueing problems are distinguished by unique challenges: i) a system operating in continuous time, ii) high stochasticity, and iii) long horizons over which the system can become unstable (exploding delays). To spur methodological progress tackling these challenges, we present an open-sourced queueing simulation framework, QGym, that benchmark queueing policies across realistic problem instances. Our modular framework allows the researchers to build on our initial instances, which provide a wide range of environments including parallel servers, criss-cross, tandem, and re-entrant networks, as well as a realistically calibrated hospital queuing system. QGym makes it easy to compare multiple policies, including both model-free RL methods and classical queuing policies. Our testbed complements the traditional focus on evaluating algorithms based on mathematical guarantees in idealized settings, and significantly expands the scope of empirical benchmarking in prior work. QGym code is open-sourced at https://github.com/namkoong-lab/QGym. △ Less

Submitted 8 October, 2024; originally announced October 2024.

arXiv:2410.05739 [pdf, other]

Array2BR: An End-to-End Noise-immune Binaural Audio Synthesis from Microphone-array Signals

Authors: Cheng Chi, Xiaoyu Li, Andong Li, Yuxuan Ke, Xiaodong Li, Chengshi Zheng

Abstract: Telepresence technology aims to provide an immersive virtual presence for remote conference applications, and it is extremely important to synthesize high-quality binaural audio signals for this aim. Because the ambient noise is often inevitable in practical application scenarios, it is highly desired that binaural audio signals without noise can be obtained from microphone-array signals directly.… ▽ More Telepresence technology aims to provide an immersive virtual presence for remote conference applications, and it is extremely important to synthesize high-quality binaural audio signals for this aim. Because the ambient noise is often inevitable in practical application scenarios, it is highly desired that binaural audio signals without noise can be obtained from microphone-array signals directly. For this purpose, this paper proposes a new end-to-end noise-immune binaural audio synthesis framework from microphone-array signals, abbreviated as Array2BR, and experimental results show that binaural cues can be correctly mapped and noise can be well suppressed simultaneously using the proposed framework. Compared with existing methods, the proposed method achieved better performance in terms of both objective and subjective metric scores. △ Less

Submitted 8 October, 2024; originally announced October 2024.

arXiv:2410.02976 [pdf, other]

Learning Optimal Control and Dynamical Structure of Global Trajectory Search Problems with Diffusion Models

Authors: Jannik Graebner, Anjian Li, Amlan Sinha, Ryne Beeson

Abstract: Spacecraft trajectory design is a global search problem, where previous work has revealed specific solution structures that can be captured with data-driven methods. This paper explores two global search problems in the circular restricted three-body problem: hybrid cost function of minimum fuel/time-of-flight and transfers to energy-dependent invariant manifolds. These problems display a fundamen… ▽ More Spacecraft trajectory design is a global search problem, where previous work has revealed specific solution structures that can be captured with data-driven methods. This paper explores two global search problems in the circular restricted three-body problem: hybrid cost function of minimum fuel/time-of-flight and transfers to energy-dependent invariant manifolds. These problems display a fundamental structure either in the optimal control profile or the use of dynamical structures. We build on our prior generative machine learning framework to apply diffusion models to learn the conditional probability distribution of the search problem and analyze the model's capability to capture these structures. △ Less

Submitted 29 December, 2024; v1 submitted 3 October, 2024; originally announced October 2024.

Comments: This paper was presented at the AAS/AIAA Astrodynamics Specialist Conference

arXiv:2410.00392 [pdf, other]

MERIT: Multimodal Wearable Vital Sign Waveform Monitoring

Authors: Yongyang Tang, Zhe Chen, Ang Li, Tianyue Zheng, Zheng Lin, Jia Xu, Pin Lv, Zhe Sun, Yue Gao

Abstract: Cardiovascular disease (CVD) is the leading cause of death and premature mortality worldwide, with occupational environments significantly influencing CVD risk, underscoring the need for effective cardiac monitoring and early warning systems. Existing methods of monitoring vital signs require subjects to remain stationary, which is impractical for daily monitoring as individuals are often in motio… ▽ More Cardiovascular disease (CVD) is the leading cause of death and premature mortality worldwide, with occupational environments significantly influencing CVD risk, underscoring the need for effective cardiac monitoring and early warning systems. Existing methods of monitoring vital signs require subjects to remain stationary, which is impractical for daily monitoring as individuals are often in motion. To address this limitation, we propose MERIT, a multimodality-based wearable system designed for precise ECG waveform monitoring without movement restrictions. Daily activities, involving frequent arm movements, can significantly affect sensor data and complicate the reconstruction of accurate ECG signals. To mitigate motion impact and enhance ECG signal reconstruction, we introduce a deep independent component analysis (Deep-ICA) module and a multimodal fusion module. We conducted experiments with 15 subjects. Our results, compared with commercial wearable devices and existing methods, demonstrate that MERIT accurately reconstructs ECG waveforms during various office activities, offering a reliable solution for fine-grained cardiac monitoring in dynamic environments. △ Less

Submitted 21 November, 2024; v1 submitted 1 October, 2024; originally announced October 2024.

Comments: 8 pages, 10 figures

arXiv:2409.10058 [pdf, other]

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Authors: Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani

Abstract: The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this… ▽ More The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at https://styletts-zs.github.io/. △ Less

Submitted 16 September, 2024; originally announced September 2024.

arXiv:2408.11849 [pdf, other]

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Authors: Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani

Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resou… ▽ More The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: CoLM 2024

arXiv:2407.09732 [pdf, other]

Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

Authors: Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani

Abstract: It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compar… ▽ More It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compare them with transformers of similar sizes in performance, memory, and speed. Our Mamba or Mamba-transformer hybrid models show comparable or higher performance than their transformer counterparts: Sepformer, Conformer, and VALL-E. They are more efficient than transformers in memory and speed for speech longer than a threshold duration, inversely related to the resolution of a speech token. Mamba for separation is the most efficient, and Mamba for recognition is the least. Further, we show that Mamba is not more efficient than transformer for speech shorter than the threshold duration and performs worse in models that require joint modeling of text and speech, such as cross or masked attention of two inputs. Therefore, we argue that the superiority of Mamba or transformer depends on particular problems and models. Code available at https://github.com/xi-j/Mamba-TasNet and https://github.com/xi-j/Mamba-ASR. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2406.17804 [pdf, other]

doi 10.1109/ICMLCA63499.2024.10753737

A Review of Electromagnetic Elimination Methods for low-field portable MRI scanner

Authors: Wanyu Bian, Panfeng Li, Mengyao Zheng, Chihang Wang, Anying Li, Ying Li, Haowei Ni, Zixuan Zeng

Abstract: This paper analyzes conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We compare traditional analytical and adaptive techniques with advanced deep learning approaches. Key strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination, such as external EMI receiver coils, are discussed alongside deep l… ▽ More This paper analyzes conventional and deep learning methods for eliminating electromagnetic interference (EMI) in MRI systems. We compare traditional analytical and adaptive techniques with advanced deep learning approaches. Key strengths and limitations of each method are highlighted. Recent advancements in active EMI elimination, such as external EMI receiver coils, are discussed alongside deep learning methods, which show superior EMI suppression by leveraging neural networks trained on MRI data. While deep learning improves EMI elimination and diagnostic capabilities, it introduces security and safety concerns, particularly in commercial applications. A balanced approach, integrating conventional reliability with deep learning's advanced capabilities, is proposed for more effective EMI suppression in MRI systems. △ Less

Submitted 13 November, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

Comments: Accepted by 2024 5th International Conference on Machine Learning and Computer Application

Journal ref: Proceedings of the 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA), 2024, pp. 614-618

arXiv:2406.16870 [pdf, other]

Robust Optimal Lane-changing Control for Connected Autonomous Vehicles in Mixed Traffic

Authors: Anni Li, Andres S. Chavez Armijos, Christos G. Cassandras

Abstract: We derive time and energy-optimal policies for a Connected Autonomous Vehicle (CAV) to execute lane change maneuvers in mixed traffic, i.e., in the presence of both CAVs and Human Driven Vehicles (HDVs). These policies are also shown to be robust with respect to the unpredictable behavior of HDVs by exploiting CAV cooperation which can eliminate or greatly reduce the interaction between CAVs and H… ▽ More We derive time and energy-optimal policies for a Connected Autonomous Vehicle (CAV) to execute lane change maneuvers in mixed traffic, i.e., in the presence of both CAVs and Human Driven Vehicles (HDVs). These policies are also shown to be robust with respect to the unpredictable behavior of HDVs by exploiting CAV cooperation which can eliminate or greatly reduce the interaction between CAVs and HDVs. We derive a simple threshold-based criterion on the initial relative distance between two cooperating CAVs based on which an optimal policy is selected such that the lane-changing CAV merges ahead of a cooperating CAV in the target lane; in this case, the lane-changing CAV's trajectory becomes independent of HDV behavior. Otherwise, the interaction between CAVs and neighboring HDVs is formulated as a bilevel optimization problem with an appropriate behavioral model for an HDV, and an iterated best response (IBR) method is used to determine an equilibrium. We demonstrate the convergence of the IBR process under certain conditions. Furthermore, Control Barrier Functions (CBFs) are implemented to ensure the robustness of lane-changing behaviors by guaranteeing safety in both longitudinal and lateral directions despite HDV disturbances. Simulation results validate the effectiveness of our CAV controllers in terms of cost, safety guarantees, and limited disruption to traffic flow. Additionally, we demonstrate the robustness of the lane-changing behaviors in the presence of uncontrollable HDVs. △ Less

Submitted 15 March, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2303.16948

arXiv:2406.11175 [pdf, other]

doi 10.1109/SLT61566.2024.10832279

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

Authors: Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu

Abstract: The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, t… ▽ More The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/up-sampling layer with varying compression ratios and the dual-path structure for inter- and intra-band modeling. The model is configured from 50 M/s to 6.8 G/s in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements. △ Less

Submitted 24 January, 2025; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: 8 pages, Accepted to SLT 2024

Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT), pp. 317-324, 2024

arXiv:2406.00758 [pdf, other]

Once-for-All: Controllable Generative Image Compression with Dynamic Granularity Adaption

Authors: Anqi Li, Feng Li, Yuxi Liu, Runmin Cong, Yao Zhao, Huihui Bai

Abstract: Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, termed Control-GIC, the first capa… ▽ More Although recent generative image compression methods have demonstrated impressive potential in optimizing the rate-distortion-perception trade-off, they still face the critical challenge of flexible rate adaption to diverse compression necessities and scenarios. To overcome this challenge, this paper proposes a Controllable Generative Image Compression framework, termed Control-GIC, the first capable of fine-grained bitrate adaption across a broad spectrum while ensuring high-fidelity and generality compression. Control-GIC is grounded in a VQGAN framework that encodes an image as a sequence of variable-length codes (i.e. VQ-indices), which can be losslessly compressed and exhibits a direct positive correlation with the bitrates. Drawing inspiration from the classical coding principle, we correlate the information density of local image patches with their granular representations. Hence, we can flexibly determine a proper allocation of granularity for the patches to achieve dynamic adjustment for VQ-indices, resulting in desirable compression rates. We further develop a probabilistic conditional decoder capable of retrieving historic encoded multi-granularity representations according to transmitted codes, and then reconstruct hierarchical granular features in the formalization of conditional probability, enabling more informative aggregation to improve reconstruction realism. Our experiments show that Control-GIC allows highly flexible and controllable bitrate adaption where the results demonstrate its superior performance over recent state-of-the-art methods. △ Less

Submitted 4 December, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.17594 [pdf, other]

Towards Achieving Cooperation Compliance of Human Drivers in Mixed Traffic

Authors: Anni Li, Christos G. Cassandras

Abstract: We consider a mixed-traffic environment in transportation systems, where Connected and Automated Vehicles (CAVs) coexist with potentially non-cooperative Human-Driven Vehicles (HDVs). We develop a cooperation compliance control framework to incentivize HDVs to align their behavior with socially optimal objectives using a ``refundable toll'' scheme so as to achieve a desired compliance probability… ▽ More We consider a mixed-traffic environment in transportation systems, where Connected and Automated Vehicles (CAVs) coexist with potentially non-cooperative Human-Driven Vehicles (HDVs). We develop a cooperation compliance control framework to incentivize HDVs to align their behavior with socially optimal objectives using a ``refundable toll'' scheme so as to achieve a desired compliance probability for all non-compliant HDVs through a feedback control mechanism combining global with local (individual) components. We apply this scheme to the lane-changing problem, where a ``Social Planner'' provides references to the HDVs, measures their state errors, and induces cooperation compliance for safe lane-changing through a refundable toll approach. Simulation results are included to show the effectiveness of our cooperation compliance controller in terms of improved compliance and lane-changing maneuver safety and efficiency when non-cooperative HDVs are present. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16446 [pdf, ps, other]

A New Solution for MU-MISO Symbol-Level Precoding: Extrapolation and Deep Unfolding

Authors: Mu Liang, Ang Li, Xiaoyan Hu, Christos Masouros

Abstract: Constructive interference (CI) precoding, which converts the harmful multi-user interference into beneficial signals, is a promising and efficient interference management scheme in multi-antenna communication systems. However, CI-based symbol-level precoding (SLP) experiences high computational complexity as the number of symbol slots increases within a transmission block, rendering it unaffordabl… ▽ More Constructive interference (CI) precoding, which converts the harmful multi-user interference into beneficial signals, is a promising and efficient interference management scheme in multi-antenna communication systems. However, CI-based symbol-level precoding (SLP) experiences high computational complexity as the number of symbol slots increases within a transmission block, rendering it unaffordable in practical communication systems. In this paper, we propose a symbol-level extrapolation (SLE) strategy to extrapolate the precoding matrix by leveraging the relationship between different symbol slots within in a transmission block, during which the channel state information (CSI) remains constant, where we design a closed-form iterative algorithm based on SLE for both PSK and QAM modulation. In order to further reduce the computational complexity, a sub-optimal closed-form solution based on SLE is further developed for PSK and QAM, respectively. Moreover, we design an unsupervised SLE-based neural network (SLE-Net) to unfold the proposed iterative algorithm, which helps enhance the interpretability of the neural network. By carefully designing the loss function of the SLE-Net, the time-complexity of the network can be reduced effectively. Extensive simulation results illustrate that the proposed algorithms can dramatically reduce the computational complexity and time complexity with only marginal performance loss, compared with the conventional SLP design methods. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.04167 [pdf, other]

Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment

Authors: Aobo Li, Jinjian Wu, Yongxu Liu, Leida Li

Abstract: The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming, especially for authentic images. Training on synthetic data is expected to be beneficial, but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that introducing more distortion types in the synthetic dataset may… ▽ More The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming, especially for authentic images. Training on synthetic data is expected to be beneficial, but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that introducing more distortion types in the synthetic dataset may not improve or even be harmful to generalizing authentic image quality assessment. To solve this challenge, we propose distortion-guided unsupervised domain adaptation for BIQA (DGQA), a novel framework that leverages adaptive multi-domain selection via prior knowledge from distortion to match the data distribution between the source domains and the target domain, thereby reducing negative transfer from the outlier source domains. Extensive experiments on two cross-domain settings (synthetic distortion to authentic distortion and synthetic distortion to algorithmic distortion) have demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is orthogonal to existing model-based BIQA methods, and can be used in combination with such models to improve performance with less training data. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Accepted by CVPR2024

arXiv:2404.15364 [pdf, other]

doi 10.1109/LMWT.2024.3386330

MP-DPD: Low-Complexity Mixed-Precision Neural Networks for Energy-Efficient Digital Predistortion of Wideband Power Amplifiers

Authors: Yizhuo Wu, Ang Li, Mohammadreza Beikmirza, Gagan Deep Singh, Qinyu Chen, Leo C. N. de Vreede, Morteza Alavi, Chang Gao

Abstract: Digital Pre-Distortion (DPD) enhances signal quality in wideband RF power amplifiers (PAs). As signal bandwidths expand in modern radio systems, DPD's energy consumption increasingly impacts overall system efficiency. Deep Neural Networks (DNNs) offer promising advancements in DPD, yet their high complexity hinders their practical deployment. This paper introduces open-source mixed-precision (MP)… ▽ More Digital Pre-Distortion (DPD) enhances signal quality in wideband RF power amplifiers (PAs). As signal bandwidths expand in modern radio systems, DPD's energy consumption increasingly impacts overall system efficiency. Deep Neural Networks (DNNs) offer promising advancements in DPD, yet their high complexity hinders their practical deployment. This paper introduces open-source mixed-precision (MP) neural networks that employ quantized low-precision fixed-point parameters for energy-efficient DPD. This approach reduces computational complexity and memory footprint, thereby lowering power consumption without compromising linearization efficacy. Applied to a 160MHz-BW 1024-QAM OFDM signal from a digital RF PA, MP-DPD gives no performance loss against 32-bit floating-point precision DPDs, while achieving -43.75 (L)/-45.27 (R) dBc in Adjacent Channel Power Ratio (ACPR) and -38.72 dB in Error Vector Magnitude (EVM). A 16-bit fixed-point-precision MP-DPD enables a 2.8X reduction in estimated inference power. The PyTorch learning and testing code is publicly available at \url{https://github.com/lab-emi/OpenDPD}. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted to IEEE Microwave and Wireless Technology Letters (MWTL)

arXiv:2403.09096 [pdf, other]

Deep unfolding Network for Hyperspectral Image Super-Resolution with Automatic Exposure Correction

Authors: Yuan Fang, Yipeng Liu, Jie Chen, Zhen Long, Ao Li, Chong-Yung Chi, Ce Zhu

Abstract: In recent years, the fusion of high spatial resolution multispectral image (HR-MSI) and low spatial resolution hyperspectral image (LR-HSI) has been recognized as an effective method for HSI super-resolution (HSI-SR). However, both HSI and MSI may be acquired under extreme conditions such as night or poorly illuminating scenarios, which may cause different exposure levels, thereby seriously downgr… ▽ More In recent years, the fusion of high spatial resolution multispectral image (HR-MSI) and low spatial resolution hyperspectral image (LR-HSI) has been recognized as an effective method for HSI super-resolution (HSI-SR). However, both HSI and MSI may be acquired under extreme conditions such as night or poorly illuminating scenarios, which may cause different exposure levels, thereby seriously downgrading the yielded HSISR. In contrast to most existing methods based on respective low-light enhancements (LLIE) of MSI and HSI followed by their fusion, a deep Unfolding HSI Super-Resolution with Automatic Exposure Correction (UHSR-AEC) is proposed, that can effectively generate a high-quality fused HSI-SR (in texture and features) even under very imbalanced exposures, thanks to the correlation between LLIE and HSI-SR taken into account. Extensive experiments are provided to demonstrate the state-of-the-art overall performance of the proposed UHSR-AEC, including comparison with some benchmark peer methods. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2402.15944 [pdf, other]

On A Class of Greedy Sparse Recovery Algorithms

Authors: Gang Li, Qiuwei Li, Shuang Li, Wu Angela Li

Abstract: Sparse signal recovery deals with finding the sparest solution of an under-determined linear system $x = Q s$. In this paper, we propose a novel greedy approach to addressing the challenges from such a problem. Such an approach is based on a characterization of solutions to the system, which allows us to work on the sparse recovery in the $s$-space directly with a given measure. With $l_2$-based m… ▽ More Sparse signal recovery deals with finding the sparest solution of an under-determined linear system $x = Q s$. In this paper, we propose a novel greedy approach to addressing the challenges from such a problem. Such an approach is based on a characterization of solutions to the system, which allows us to work on the sparse recovery in the $s$-space directly with a given measure. With $l_2$-based measure, an orthogonal matching pursuit (OMP)-type algorithm is proposed, which significantly outperforms the classical OMP algorithm in terms of recovery accuracy while maintaining comparable computational complexity. An $l_1$-based algorithm, denoted as $\text{Alg}_{GL1}$, is derived. Such an algorithm significantly outperforms the classical basis pursuit (BP) algorithm. Combining with the CoSaMP-strategy for selecting atoms, a class of high performance greedy algorithms is also derived. Extensive numerical simulations on both synthetic and image data are carried out, with which the superior performance of our proposed algorithms is demonstrated in terms of sparse recovery accuracy and robustness against numerical instability of the system matrix $Q$ and disturbance in the measurement $x$. △ Less

Submitted 2 March, 2025; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.04882 [pdf, other]

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Authors: Zeyu Liu, Gourav Datta, Anni Li, Peter Anthony Beerel

Abstract: Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation… ▽ More Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation with explicit states. However, these approaches often incur significant performance degradation. The ultimate goal is to develop a model that has the following properties: parallel training, streaming and low-cost inference, and SOTA performance. In this paper, we propose a new direction to achieve this goal. We show how architectural modifications to a recurrent model can help push its performance toward Transformer models while retaining its sequential processing capability. Specifically, inspired by the recent success of Legendre Memory Units (LMU) in sequence learning tasks, we propose LMUFormer, which augments the LMU with convolutional patch embedding and convolutional channel mixer. Moreover, we present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules while simultaneously reducing the computing complexity. We evaluated our architectures on multiple sequence datasets. In comparison to SOTA transformer-based models within the ANN domain on the SCv2 dataset, our LMUFormer demonstrates comparable performance while necessitating a remarkable 53 times reduction in parameters and a substantial 65 times decrement in FLOPs. Additionally, owing to our model's proficiency in real-time data processing, we can achieve a 32.03% reduction in sequence length, all while incurring an inconsequential decline in performance. Our code is publicly available at https://github.com/zeyuliu1037/LMUFormer.git. △ Less

Submitted 19 January, 2024; originally announced February 2024.

Comments: The 12th International Conference on Learning Representations (ICLR 2024)

arXiv:2402.03710 [pdf, ps, other]

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani

Abstract: In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Remix" (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix… ▽ More In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Remix" (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse remixing tasks including extraction, removal, and volume control of single or multiple sources. Our experiments demonstrate significant improvements in signal quality across all remixing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources. An audio demo is available at: https://listenchatremix.github.io/demo. △ Less

Submitted 10 June, 2025; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)

arXiv:2401.00166 [pdf, ps, other]

Block-Level MU-MISO Interference Exploitation Precoding: Optimal Structure and Explicit Duality

Authors: Junwen Yang, Ang Li, Xuewen Liao, Christos Masouros, A. L. Swindlehurst

Abstract: This paper investigates block-level interference exploitation (IE) precoding for multi-user multiple-input single-output (MU-MISO) downlink systems. To overcome the need for symbol-level IE precoding to frequently update the precoding matrix, we propose to jointly optimize all the precoders or transmit signals within a transmission block. The resultant precoders only need to be updated once per bl… ▽ More This paper investigates block-level interference exploitation (IE) precoding for multi-user multiple-input single-output (MU-MISO) downlink systems. To overcome the need for symbol-level IE precoding to frequently update the precoding matrix, we propose to jointly optimize all the precoders or transmit signals within a transmission block. The resultant precoders only need to be updated once per block, and while not necessarily constant over all the symbol slots, we refer to the technique as block-level slot-variant IE precoding. Through a careful examination of the optimal structure and the explicit duality inherent in block-level power minimization (PM) and signal-to-interference-plus-noise ratio (SINR) balancing (SB) problems, we discover that the joint optimization can be decomposed into subproblems with smaller variable sizes. As a step further, we propose block-level slot-invariant IE precoding by adding a structural constraint on the slot-variant IE precoding to maintain a constant precoder throughout the block. A novel linear precoder for IE is further presented, and we prove that the proposed slot-variant and slot-invariant IE precoding share an identical solution when the number of symbol slots does not exceed the number of users. Numerical simulations demonstrate that the proposed precoders achieve a significant complexity reduction compared against benchmark schemes, without sacrificing performance. △ Less

Submitted 30 December, 2023; originally announced January 2024.

Comments: Submitted to IEEE

arXiv:2312.08507 [pdf, other]

doi 10.1109/ICASSP48485.2024.10446528

Patient-Adaptive and Learned MRI Data Undersampling Using Neighborhood Clustering

Authors: Siddhant Gautam, Angqi Li, Saiprasad Ravishankar

Abstract: There has been much recent interest in adapting undersampled trajectories in MRI based on training data. In this work, we propose a novel patient-adaptive MRI sampling algorithm based on grouping scans within a training set. Scan-adaptive sampling patterns are optimized together with an image reconstruction network for the training scans. The training optimization alternates between determining th… ▽ More There has been much recent interest in adapting undersampled trajectories in MRI based on training data. In this work, we propose a novel patient-adaptive MRI sampling algorithm based on grouping scans within a training set. Scan-adaptive sampling patterns are optimized together with an image reconstruction network for the training scans. The training optimization alternates between determining the best sampling pattern for each scan (based on a greedy search or iterative coordinate descent (ICD)) and training a reconstructor across the dataset. The eventual scan-adaptive sampling patterns on the training set are used as labels to predict sampling design using nearest neighbor search at test time. The proposed algorithm is applied to the fastMRI knee multicoil dataset and demonstrates improved performance over several baselines. △ Less

Submitted 31 March, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2311.16456 [pdf, other]

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

Authors: Gourav Datta, Zeyu Liu, Anni Li, Peter A. Beerel

Abstract: Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision tra… ▽ More Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision transformers (ViT), either require a large number of time steps or fail to converge. Based on analysis of the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire (LIF) layer. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), except for the input embedding layer, in contrast to expensive multiply-and-accumulates (MAC) needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: Under review

Showing 1–50 of 168 results for author: Li, A