Search | arXiv e-print repository

UltraTwin: Towards Cardiac Anatomical Twin Generation from Multi-view 2D Ultrasound

Authors: Junxuan Yu, Yaofei Duan, Yuhao Huang, Yu Wang, Rongbo Ling, Weihao Luo, Ang Zhang, Jingxian Xu, Qiongying Ni, Yongsong Zhou, Binghan Li, Haoran Dou, Liping Liu, Yanfen Chu, Feng Geng, Zhe Sheng, Zhifeng Ding, Dingxin Zhang, Rui Huang, Yuhang Zhang, Xiaowei Xu, Tao Tan, Dong Ni, Zhongshan Gou, Xin Yang

Abstract: Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quan… ▽ More Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quantification. However, it remains challenging due to the rare paired data, complex structures, and US noises. In this study, we introduce a novel generative framework UltraTwin, to obtain cardiac anatomical twin from sparse multi-view 2D US. Our contribution is three-fold. First, pioneered the construction of a real-world and high-quality dataset containing strictly paired multi-view 2D US and CT, and pseudo-paired data. Second, we propose a coarse-to-fine scheme to achieve hierarchical reconstruction optimization. Last, we introduce an implicit autoencoder for topology-aware constraints. Extensive experiments show that UltraTwin reconstructs high-quality anatomical twins versus strong competitors. We believe it advances anatomical twin modeling for potential applications in personalized cardiac care. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: accepted by miccai 2025

arXiv:2506.23472 [pdf, ps, other]

Automatic Phase Calibration for High-resolution mmWave Sensing via Ambient Radio Anchors

Authors: Ruixu Geng, Yadong Li, Dongheng Zhang, Pengcheng Huang, Binquan Wang, Binbin Zhang, Zhi Lu, Yang Hu, Yan Chen

Abstract: Millimeter-wave (mmWave) radar systems with large array have pushed radar sensing into a new era, thanks to their high angular resolution. However, our long-term experiments indicate that array elements exhibit phase drift over time and require periodic phase calibration to maintain high-resolution, creating an obstacle for practical high-resolution mmWave sensing. Unfortunately, existing calibrat… ▽ More Millimeter-wave (mmWave) radar systems with large array have pushed radar sensing into a new era, thanks to their high angular resolution. However, our long-term experiments indicate that array elements exhibit phase drift over time and require periodic phase calibration to maintain high-resolution, creating an obstacle for practical high-resolution mmWave sensing. Unfortunately, existing calibration methods are inadequate for periodic recalibration, either because they rely on artificial references or fail to provide sufficient precision. To address this challenge, we introduce AutoCalib, the first framework designed to automatically and accurately calibrate high-resolution mmWave radars by identifying Ambient Radio Anchors (ARAs)-naturally existing objects in ambient environments that offer stable phase references. AutoCalib achieves calibration by first generating spatial spectrum templates based on theoretical electromagnetic characteristics. It then employs a pattern-matching and scoring mechanism to accurately detect these anchors and select the optimal one for calibration. Extensive experiments across 11 environments demonstrate that AutoCalib capable of identifying ARAs that existing methods miss due to their focus on strong reflectors. AutoCalib's calibration performance approaches corner reflectors (74% phase error reduction) while outperforming existing methods by 83%. Beyond radar calibration, AutoCalib effectively supports other phase-dependent applications like handheld imaging, delivering 96% of corner reflector calibration performance without artificial references. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: 13 pages, 21 figures

arXiv:2506.23325 [pdf, ps, other]

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

Authors: Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu

Abstract: Speech codecs serve as bridges between speech signals and large language models. An ideal codec for speech language models should not only preserve acoustic information but also capture rich semantic information. However, existing speech codecs struggle to balance high-quality audio reconstruction with ease of modeling by language models. In this study, we analyze the limitations of previous codec… ▽ More Speech codecs serve as bridges between speech signals and large language models. An ideal codec for speech language models should not only preserve acoustic information but also capture rich semantic information. However, existing speech codecs struggle to balance high-quality audio reconstruction with ease of modeling by language models. In this study, we analyze the limitations of previous codecs in balancing semantic richness and acoustic fidelity. We propose XY-Tokenizer, a novel codec that mitigates the conflict between semantic and acoustic capabilities through multi-stage, multi-task learning. Experimental results demonstrate that XY-Tokenizer achieves performance in both semantic and acoustic tasks comparable to that of state-of-the-art codecs operating at similar bitrates, even though those existing codecs typically excel in only one aspect. Specifically, XY-Tokenizer achieves strong text alignment, surpassing distillation-based semantic modeling methods such as SpeechTokenizer and Mimi, while maintaining a speaker similarity score of 0.83 between reconstructed and original audio. The reconstruction performance of XY-Tokenizer is comparable to that of BigCodec, the current state-of-the-art among acoustic-only codecs, which achieves a speaker similarity score of 0.84 at a similar bitrate. Code and models are available at https://github.com/gyt1145028706/XY-Tokenizer. △ Less

Submitted 9 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.21796 [pdf, ps, other]

Demonstrating Interoperable Channel State Feedback Compression with Machine Learning

Authors: Dani Korpi, Rachel Wang, Jerry Wang, Abdelrahman Ibrahim, Carl Nuzman, Runxin Wang, Kursat Rasim Mestav, Dustin Zhang, Iraj Saniee, Shawn Winston, Gordana Pavlovic, Wei Ding, William J. Hillery, Chenxi Hao, Ram Thirunagari, Jung Chang, Jeehyun Kim, Bartek Kozicki, Dragan Samardzija, Taesang Yoo, Andreas Maeder, Tingfang Ji, Harish Viswanathan

Abstract: Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of co… ▽ More Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of concepts demonstrating the benefits of ML-based channel feedback compression in a practical setting, where the user equipment (UE) and base station have no access to each others' ML models. In this paper, we present a novel approach for training interoperable compression and decompression ML models in a confidential manner, and demonstrate the accuracy of the ensuing models using prototype UEs and base stations. The performance of the ML-based channel feedback is measured both in terms of the accuracy of the reconstructed channel information and achieved downlink throughput gains when using the channel information for beamforming. The reported measurement results demonstrate that it is possible to develop an accurate ML-based channel feedback link without having to share ML models between device and network vendors. These results pave the way for a practical implementation of ML-based channel feedback in commercial 6G networks. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.19774 [pdf, ps, other]

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Authors: Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alig… ▽ More We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.16381 [pdf, ps, other]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Authors: Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

Abstract: In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language inst… ▽ More In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages, 9 figures

arXiv:2506.13094 [pdf, ps, other]

MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

Authors: Dingwei Fan, Junyong Zhao, Chunlin Li, Xinlong Wang, Ronghan Zhang, Mingliang Wang, Qi Zhu, Haipeng Si, Daoqiang Zhang, Liang Sun

Abstract: Spine image segmentation is crucial for clinical diagnosis and treatment of spine diseases. The complex structure of the spine and the high morphological similarity between individual vertebrae and adjacent intervertebral discs make accurate spine segmentation a challenging task. Although the Segment Anything Model (SAM) has been developed, it still struggles to effectively capture and utilize mor… ▽ More Spine image segmentation is crucial for clinical diagnosis and treatment of spine diseases. The complex structure of the spine and the high morphological similarity between individual vertebrae and adjacent intervertebral discs make accurate spine segmentation a challenging task. Although the Segment Anything Model (SAM) has been developed, it still struggles to effectively capture and utilize morphological information, limiting its ability to enhance spine image segmentation performance. To address these challenges, in this paper, we propose a MorphSAM that explicitly learns morphological information from atlases, thereby strengthening the spine image segmentation performance of SAM. Specifically, the MorphSAM includes two fully automatic prompt learning networks, 1) an anatomical prompt learning network that directly learns morphological information from anatomical atlases, and 2) a semantic prompt learning network that derives morphological information from text descriptions converted from the atlases. Then, the two learned morphological prompts are fed into the SAM model to boost the segmentation performance. We validate our MorphSAM on two spine image segmentation tasks, including a spine anatomical structure segmentation task with CT images and a lumbosacral plexus segmentation task with MR images. Experimental results demonstrate that our MorphSAM achieves superior segmentation performance when compared to the state-of-the-art methods. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.10362 [pdf, ps, other]

Relaxation-Free Min-k-Partition for PCI Assignment in 5G Networks

Authors: Yeqing Qiu, Chengpiao Huang, Ye Xue, Zhipeng Jiang, Qingjiang Shi, Dong Zhang, Zhi-Quan Luo

Abstract: Physical Cell Identity (PCI) is a critical parameter in 5G networks. Efficient and accurate PCI assignment is essential for mitigating mod-3 interference, mod-30 interference, collisions, and confusions among cells, which directly affect network reliability and user experience. In this paper, we propose a novel framework for PCI assignment by decomposing the problem into Min-3-Partition, Min-10-Pa… ▽ More Physical Cell Identity (PCI) is a critical parameter in 5G networks. Efficient and accurate PCI assignment is essential for mitigating mod-3 interference, mod-30 interference, collisions, and confusions among cells, which directly affect network reliability and user experience. In this paper, we propose a novel framework for PCI assignment by decomposing the problem into Min-3-Partition, Min-10-Partition, and a graph coloring problem, leveraging the Chinese Remainder Theorem (CRT). Furthermore, we develop a relaxation-free approach to the general Min-k-Partition problem by reformulating it as a quadratic program with a norm-equality constraint and solving it using a penalized mirror descent (PMD) algorithm. The proposed method demonstrates superior computational efficiency and scalability, significantly reducing interference while eliminating collisions and confusions in large-scale 5G networks. Numerical evaluations on real-world datasets show that our approach reduces computational time by up to 20 times compared to state-of-the-art methods, making it highly practical for real-time PCI optimization in large-scale networks. These results highlight the potential of our method to improve network performance and reduce deployment costs in modern 5G systems. △ Less

Submitted 13 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.05381 [pdf, other]

Heterogeneous Secure Transmissions in IRS-Assisted NOMA Communications: CO-GNN Approach

Authors: Linlin Liang, Zongkai Tian, Haiyan Huang, Xiaoyan Li, Zhisheng Yin, Dehua Zhang, Nina Zhang, Wenchao Zhai

Abstract: Intelligent Reflecting Surfaces (IRS) enhance spectral efficiency by adjusting reflection phase shifts, while Non-Orthogonal Multiple Access (NOMA) increases system capacity. Consequently, IRS-assisted NOMA communications have garnered significant research interest. However, the passive nature of the IRS, lacking authentication and security protocols, makes these systems vulnerable to external eav… ▽ More Intelligent Reflecting Surfaces (IRS) enhance spectral efficiency by adjusting reflection phase shifts, while Non-Orthogonal Multiple Access (NOMA) increases system capacity. Consequently, IRS-assisted NOMA communications have garnered significant research interest. However, the passive nature of the IRS, lacking authentication and security protocols, makes these systems vulnerable to external eavesdropping due to the openness of electromagnetic signal propagation and reflection. NOMA's inherent multi-user signal superposition also introduces internal eavesdropping risks during user pairing. This paper investigates secure transmissions in IRS-assisted NOMA systems with heterogeneous resource configuration in wireless networks to mitigate both external and internal eavesdropping. To maximize the sum secrecy rate of legitimate users, we propose a combinatorial optimization graph neural network (CO-GNN) approach to jointly optimize beamforming at the base station, power allocation of NOMA users, and phase shifts of IRS for dynamic heterogeneous resource allocation, thereby enabling the design of dual-link or multi-link secure transmissions in the presence of eavesdroppers on the same or heterogeneous links. The CO-GNN algorithm simplifies the complex mathematical problem-solving process, eliminates the need for channel estimation, and enhances scalability. Simulation results demonstrate that the proposed algorithm significantly enhances the secure transmission performance of the system. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02585 [pdf, ps, other]

A Tree-guided CNN for image super-resolution

Authors: Chunwei Tian, Mingjian Song, Xiaopeng Fan, Xiangtao Zheng, Bob Zhang, David Zhang

Abstract: Deep convolutional neural networks can extract more accurate structural information via deep architectures to obtain good performance in image super-resolution. However, it is not easy to find effect of important layers in a single network architecture to decrease performance of super-resolution. In this paper, we design a tree-guided CNN for image super-resolution (TSRNet). It uses a tree archite… ▽ More Deep convolutional neural networks can extract more accurate structural information via deep architectures to obtain good performance in image super-resolution. However, it is not easy to find effect of important layers in a single network architecture to decrease performance of super-resolution. In this paper, we design a tree-guided CNN for image super-resolution (TSRNet). It uses a tree architecture to guide a deep network to enhance effect of key nodes to amplify the relation of hierarchical information for improving the ability of recovering images. To prevent insufficiency of the obtained structural information, cosine transform techniques in the TSRNet are used to extract cross-domain information to improve the performance of image super-resolution. Adaptive Nesterov momentum optimizer (Adan) is applied to optimize parameters to boost effectiveness of training a super-resolution model. Extended experiments can verify superiority of the proposed TSRNet for restoring high-quality images. Its code can be obtained at https://github.com/hellloxiaotian/TSRNet. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: This paper has been accepted for publication in IEEE Transactions on Consumer Electronics. 10 pages, 6 figures. Its code can be obtained at https://github.com/hellloxiaotian/TSRNet

arXiv:2505.24382 [pdf, ps, other]

MagicGripper: A Multimodal Sensor-Integrated Gripper for Contact-Rich Robotic Manipulation

Authors: Wen Fan, Haoran Li, Dandan Zhang

Abstract: Contact-rich manipulation in unstructured environments demands precise, multimodal perception to enable robust and adaptive control. Vision-based tactile sensors (VBTSs) have emerged as an effective solution; however, conventional VBTSs often face challenges in achieving compact, multi-modal functionality due to hardware constraints and algorithmic complexity. In this work, we present MagicGripper… ▽ More Contact-rich manipulation in unstructured environments demands precise, multimodal perception to enable robust and adaptive control. Vision-based tactile sensors (VBTSs) have emerged as an effective solution; however, conventional VBTSs often face challenges in achieving compact, multi-modal functionality due to hardware constraints and algorithmic complexity. In this work, we present MagicGripper, a multimodal sensor-integrated gripper designed for contact-rich robotic manipulation. Building on our prior design, MagicTac, we develop a compact variant, mini-MagicTac, which features a three-dimensional, multi-layered grid embedded in a soft elastomer. MagicGripper integrates mini-MagicTac, enabling high-resolution tactile feedback alongside proximity and visual sensing within a compact, gripper-compatible form factor. We conduct a thorough evaluation of mini-MagicTac's performance, demonstrating its capabilities in spatial resolution, contact localization, and force regression. We also assess its robustness across manufacturing variability, mechanical deformation, and sensing performance under real-world conditions. Furthermore, we validate the effectiveness of MagicGripper through three representative robotic tasks: a teleoperated assembly task, a contact-based alignment task, and an autonomous robotic grasping task. Across these experiments, MagicGripper exhibits reliable multimodal perception, accurate force estimation, and high adaptability to challenging manipulation scenarios. Our results highlight the potential of MagicGripper as a practical and versatile tool for embodied intelligence in complex, contact-rich environments. △ Less

Submitted 30 May, 2025; originally announced May 2025.

Comments: 19 pages, 24 figures

arXiv:2505.22013 [pdf, other]

Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge

Authors: Shangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng

Abstract: This paper presents the system developed to address the MISP 2025 Challenge. For the diarization system, we proposed a hybrid approach combining a WavLM end-to-end segmentation method with a traditional multi-module clustering technique to adaptively select the appropriate model for handling varying degrees of overlapping speech. For the automatic speech recognition (ASR) system, we proposed an AS… ▽ More This paper presents the system developed to address the MISP 2025 Challenge. For the diarization system, we proposed a hybrid approach combining a WavLM end-to-end segmentation method with a traditional multi-module clustering technique to adaptively select the appropriate model for handling varying degrees of overlapping speech. For the automatic speech recognition (ASR) system, we proposed an ASR-aware observation addition method that compensates for the performance limitations of Guided Source Separation (GSS) under low signal-to-noise ratio conditions. Finally, we integrated the speaker diarization and ASR systems in a cascaded architecture to address Track 3. Our system achieved character error rates (CER) of 9.48% on Track 2 and concatenated minimum permutation character error rate (cpCER) of 11.56% on Track 3, ultimately securing first place in both tracks and thereby demonstrating the effectiveness of the proposed methods in real-world meeting scenarios. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.13062 [pdf, other]

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

Authors: Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu

Abstract: Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions f… ▽ More Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference. △ Less

Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.06495 [pdf, other]

Monopulse Parameter Estimation based on MIMO-STCA Radar in the Presence of Multiple Mainlobe Jammings

Authors: Huake Wang, Dongchang Zhang, Guisheng Liao, Yinghui Quan

Abstract: The monopulse technique is characterized by its high accuracy in angle estimation and simplicity in engineering implementation. However, in the complex electromagnetic environment, the presence of the mainlobe jamming (MLJ) greatly degrades the accuracy of angle estimation. Conventional methods of jamming suppression often lead to significant deviations in monopulse ratio while suppressing MLJ. Ad… ▽ More The monopulse technique is characterized by its high accuracy in angle estimation and simplicity in engineering implementation. However, in the complex electromagnetic environment, the presence of the mainlobe jamming (MLJ) greatly degrades the accuracy of angle estimation. Conventional methods of jamming suppression often lead to significant deviations in monopulse ratio while suppressing MLJ. Additionally, the monopulse technique based on traditional radar cannot jointly estimate the target's range. In this paper, the four-channel adaptive beamforming (ABF) algorithm is proposed, which adds a delta-delta channel based on conventional sum-difference-difference three-channel to suppress a single MLJ. Moreover, considering the suppression of multiple MLJs and sidelobe jammings (SLJs), the row-column ABF algorithm is proposed. This algorithm utilizes more spatial degrees of freedom (DOFs) to suppress multiple jammings by the row-column adaptive beamforming at the subarray level. The key ideal of both algorithms is to suppress MLJ with null along one spatial direction while keeping the sum and difference beampatterns undistorted along another spatial direction. Therefore, the monopulse ratio remains undistorted while suppressing the MLJ, ensuring the accuracy of monopulse parameter estimation. Furthermore, by utilizing the additional degrees of freedom (DOFs) in the range domain provided by the multiple-input multiple-output space-time coding array (MIMO-STCA) radar, joint angle-range estimation can be achieved through the monopulse technique. Simulation results highlight the effectiveness of the proposed methods in suppressing multiple MLJs and enhancing the accuracy of monopulse parameter estimation, as verified by the low root mean square error (RMSE) in the parameter estimation results. △ Less

Submitted 9 May, 2025; originally announced May 2025.

Comments: 10 pages,15 figures

arXiv:2505.04652 [pdf, other]

Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation

Authors: Yi Lin, Dong Zhang, Xiao Fang, Yufan Chen, Kwang-Ting Cheng, Hao Chen

Abstract: Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transfo… ▽ More Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: Accepted by Medical Image Analysis

arXiv:2505.01675 [pdf, other]

Enhanced Prediction Model for Time Series Characterized by GARCH via Interval Type-2 Fuzzy Inference System

Authors: Hongpei Shao, Da-Qing Zhang, Feilong Lu

Abstract: GARCH-type time series (characterized by Generalized Autoregressive Conditional Heteroskedasticity) exhibit pronounced volatility, autocorrelation, and heteroskedasticity. To address these challenges and enhance predictive accuracy, this study introduces a hybrid forecasting framework that integrates the Interval Type-2 Fuzzy Inference System (IT2FIS) with the GARCH model. Leveraging the interval-… ▽ More GARCH-type time series (characterized by Generalized Autoregressive Conditional Heteroskedasticity) exhibit pronounced volatility, autocorrelation, and heteroskedasticity. To address these challenges and enhance predictive accuracy, this study introduces a hybrid forecasting framework that integrates the Interval Type-2 Fuzzy Inference System (IT2FIS) with the GARCH model. Leveraging the interval-based uncertainty representation of IT2FIS and the volatility-capturing capability of GARCH, the proposed model effectively mitigates the adverse impact of heteroskedasticity on prediction reliability. Specifically, the GARCH component estimates conditional variance, which is subsequently incorporated into the Gaussian membership functions of IT2FIS. This integration transforms IT2FIS into an adaptive variable-parameter system, dynamically aligning with the time-varying volatility of the target series. Through systematic parameter optimization, the framework not only captures intricate volatility patterns but also accounts for heteroskedasticity and epistemic uncertainties during modeling, thereby improving both prediction precision and model robustness. Experimental validation employs diverse datasets, including air quality concentration, urban traffic flow, and energy consumption. Comparative analyses are conducted against models: the GARCH-Takagi-Sugeno-Kang (GARCH-TSK) model, fixed-variance time series models, the GARCH-Gated Recurrent Unit (GARCH-GRU), and Long Short-Term Memory (LSTM) networks. The results indicate that the proposed model achieves superior predictive performance across the majority of test scenarios in error metrics. These findings underscore the effectiveness of hybrid approaches in forecasting uncertainty for GARCH-type time series, highlighting their practical utility in real-world time series forecasting applications. △ Less

Submitted 27 May, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

Comments: 40 pages, 13 figures, references added

arXiv:2505.00742 [pdf, other]

Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Authors: Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

Abstract: Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omi… ▽ More Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce \SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. \SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that \SysName consistently outperforms baseline methods, achieving up to a $26.9\%$ improvement in accuracy while significantly reducing token consumption. △ Less

Submitted 29 April, 2025; originally announced May 2025.

arXiv:2504.15649 [pdf, other]

RepNet-VSR: Reparameterizable Architecture for High-Fidelity Video Super-Resolution

Authors: Biao Wu, Diankai Zhang, Shaoli Liu, Si Gao, Chengjian Zheng, Ning Wang

Abstract: As a fundamental challenge in visual computing, video super-resolution (VSR) focuses on reconstructing highdefinition video sequences from their degraded lowresolution counterparts. While deep convolutional neural networks have demonstrated state-of-the-art performance in spatial-temporal super-resolution tasks, their computationally intensive nature poses significant deployment challenges for res… ▽ More As a fundamental challenge in visual computing, video super-resolution (VSR) focuses on reconstructing highdefinition video sequences from their degraded lowresolution counterparts. While deep convolutional neural networks have demonstrated state-of-the-art performance in spatial-temporal super-resolution tasks, their computationally intensive nature poses significant deployment challenges for resource-constrained edge devices, particularly in real-time mobile video processing scenarios where power efficiency and latency constraints coexist. In this work, we propose a Reparameterizable Architecture for High Fidelity Video Super Resolution method, named RepNet-VSR, for real-time 4x video super-resolution. On the REDS validation set, the proposed model achieves 27.79 dB PSNR when processing 180p to 720p frames in 103 ms per 10 frames on a MediaTek Dimensity NPU. The competition results demonstrate an excellent balance between restoration quality and deployment efficiency. The proposed method scores higher than the previous champion algorithm of MAI video super-resolution challenge. △ Less

Submitted 22 April, 2025; originally announced April 2025.

Comments: Champion Solution for CVPR 2025 MAI VSR Track

arXiv:2504.15628 [pdf, ps, other]

Joint Security-Latency Design for Short Packet-Based Low-Altitude Communications

Authors: Zeyin Wang, Di Zhang, Shaobo Jia, Lulu Song, Yanqun Tang

Abstract: In this article, a joint security and latency analysis of short packet-based low-altitude communications when the eavesdropper is close to the receiver is addressed. To reveal the impacts of the signal-to-noise ratio (SNR) and block-length on latency in communications, we propose a new metric named secure latency (SL) and derive the expressions for the effective secure probability (ESP) and the av… ▽ More In this article, a joint security and latency analysis of short packet-based low-altitude communications when the eavesdropper is close to the receiver is addressed. To reveal the impacts of the signal-to-noise ratio (SNR) and block-length on latency in communications, we propose a new metric named secure latency (SL) and derive the expressions for the effective secure probability (ESP) and the average SL. To minimize the average SL, different transmission designs are analyzed, in which the optimal solutions of SNR and block-length are provided. Numerical results validate our analysis and reveal the trade-off between reliability and security and the impacts of the block-length, SNR, and packet-generating rate on average SL, of which SNR and the block-length account for main factors. In addition, we find that the performance of SL can be enhanced by allocating less SNR. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.14032 [pdf, other]

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Authors: Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang

Abstract: Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler ar… ▽ More Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.06830 [pdf, other]

Integrated Sensing and Communications Over the Years: An Evolution Perspective

Authors: Di Zhang, Yuanhao Cui, Xiaowen Cao, Nanchi Su, Fan Liu, Xiaojun Jing, J. Andrew Zhang, Jie Xu, Christos Masouros, Dusit Niyato, Marco Di Renzo

Abstract: Integrated Sensing and Communications (ISAC) enables efficient spectrum utilization and reduces hardware costs for beyond 5G (B5G) and 6G networks, facilitating intelligent applications that require both high-performance communication and precise sensing capabilities. This survey provides a comprehensive review of the evolution of ISAC over the years. We examine the expansion of the spectrum acros… ▽ More Integrated Sensing and Communications (ISAC) enables efficient spectrum utilization and reduces hardware costs for beyond 5G (B5G) and 6G networks, facilitating intelligent applications that require both high-performance communication and precise sensing capabilities. This survey provides a comprehensive review of the evolution of ISAC over the years. We examine the expansion of the spectrum across RF and optical ISAC, highlighting the role of advanced technologies, along with key challenges and synergies. We further discuss the advancements in network architecture from single-cell to multi-cell systems, emphasizing the integration of collaborative sensing and interference mitigation strategies. Moreover, we analyze the progress from single-modal to multi-modal sensing, with a focus on the integration of edge intelligence to enable real-time data processing, reduce latency, and enhance decision-making. Finally, we extensively review standardization efforts by 3GPP, IEEE, and ITU, examining the transition of ISAC-related technologies and their implications for the deployment of 6G networks. △ Less

Submitted 9 April, 2025; originally announced April 2025.

arXiv:2504.01044 [pdf, other]

Coarse-to-Fine Learning for Multi-Pipette Localisation in Robot-Assisted In Vivo Patch-Clamp

Authors: Lan Wei, Gema Vera Gonzalez, Phatsimo Kgwarae, Alexander Timms, Denis Zahorovsky, Simon Schultz, Dandan Zhang

Abstract: In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures mainly rely on manual expertise, which limits accessibility and scalability. Robotic automation presents a promising solution, but achieving precise real-time detection of multiple pipettes remains a challenge. Existing methods focus on ex… ▽ More In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures mainly rely on manual expertise, which limits accessibility and scalability. Robotic automation presents a promising solution, but achieving precise real-time detection of multiple pipettes remains a challenge. Existing methods focus on ex vivo experiments or single pipette use, making them inadequate for in vivo multi-pipette scenarios. To address these challenges, we propose a heatmap-augmented coarse-to-fine learning technique to facilitate multi-pipette real-time localisation for robot-assisted in vivo patch-clamp. More specifically, we introduce a Generative Adversarial Network (GAN)-based module to remove background noise and enhance pipette visibility. We then introduce a two-stage Transformer model that starts with predicting the coarse heatmap of the pipette tips, followed by the fine-grained coordination regression module for precise tip localisation. To ensure robust training, we use the Hungarian algorithm for optimal matching between the predicted and actual locations of tips. Experimental results demonstrate that our method achieved > 98% accuracy within 10 μm, and > 89% accuracy within 5 μm for the localisation of multi-pipette tips. The average MSE is 2.52 μm. △ Less

Submitted 31 March, 2025; originally announced April 2025.

arXiv:2504.00735 [pdf, ps, other]

Reinforcement learning for robust dynamic metabolic control

Authors: Sebastián Espinel-Ríos, River Walser, Dongda Zhang

Abstract: Dynamic metabolic control enables key metabolic fluxes to be modulated in real time, enhancing bioprocess flexibility and expanding the available optimization degrees of freedom. This can be achieved, e.g., via targeted modulation of metabolic enzyme expression. However, identifying optimal dynamic control policies in metabolic engineering is challenging due to the generally high-dimensional solut… ▽ More Dynamic metabolic control enables key metabolic fluxes to be modulated in real time, enhancing bioprocess flexibility and expanding the available optimization degrees of freedom. This can be achieved, e.g., via targeted modulation of metabolic enzyme expression. However, identifying optimal dynamic control policies in metabolic engineering is challenging due to the generally high-dimensional solution space, and the need to manage potential metabolic burden and cytotoxic effects, arising from inducible enzyme expression. This task is further complicated by the presence of stochastic dynamics, which reduce the reproducibility of bioprocesses. Here, we propose a reinforcement learning framework to derive optimal dynamic metabolic control policies by allowing an agent (i.e., the controller) to interact with a surrogate dynamic model $\textit{in silico}$. To promote robustness in the metabolic control policies, we apply domain randomization, enabling the controller to generalize across system uncertainties. Our framework provides an alternative to conventional model-based control such as model predictive control, which requires model differentiation with respect to decision variables; an often impractical task when dealing with complex stochastic, nonlinear, stiff, and piecewise-defined dynamics. In contrast, our approach only requires forward integration, making the task much simpler. We demonstrate our framework in two $\textit{Escherichia coli}$ bioprocesses, one involving the dynamic control of acetyl-CoA carboxylase in the synthesis of fatty acids, and another one dealing with the dynamic control of adenosine triphosphatase in the synthesis of lactate. △ Less

Submitted 5 July, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.24313 [pdf]

1-Tb/s/λ Transmission over Record 10714-km AR-HCF

Authors: Dawei Ge, Siyuan Liu, Qiang Qiu, Peng Li, Qiang Guo, Yiqi Li, Dong Wang, Baoluo Yan, Mingqing Zuo, Lei Zhang, Dechao Zhang, Hu Shi, Jie Luo, Han Li, Zhangyuan Chen

Abstract: We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km. We present the first single-channel 1.001-Tb/s DP-36QAM-PCS recirculating transmission over 73 loops of 146.77-km ultra-low-loss & low-IMI DNANF-5 fiber, achieving a record transmission distance of 10,714.28 km. △ Less

Submitted 2 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

arXiv:2503.22409 [pdf, ps, other]

Reinforcement learning for efficient and robust multi-setpoint and multi-trajectory tracking in bioprocesses

Authors: Sebastián Espinel-Ríos, José L. Avalos, Ehecatl Antonio del Rio Chanona, Dongda Zhang

Abstract: Efficient and robust bioprocess control is essential for maximizing performance and adaptability in advanced biotechnological systems. In this work, we present a reinforcement-learning framework for multi-setpoint and multi-trajectory tracking. Tracking multiple setpoints and time-varying trajectories in reinforcement learning is challenging due to the complexity of balancing multiple objectives,… ▽ More Efficient and robust bioprocess control is essential for maximizing performance and adaptability in advanced biotechnological systems. In this work, we present a reinforcement-learning framework for multi-setpoint and multi-trajectory tracking. Tracking multiple setpoints and time-varying trajectories in reinforcement learning is challenging due to the complexity of balancing multiple objectives, a difficulty further exacerbated by system uncertainties such as uncertain initial conditions and stochastic dynamics. This challenge is relevant, e.g., in bioprocesses involving microbial consortia, where precise control over population compositions is required. We introduce a novel return function based on multiplicative reciprocal saturation functions, which explicitly couples reward gains to the simultaneous satisfaction of multiple references. Through a case study involving light-mediated cybergenetic growth control in microbial consortia, we demonstrate via computational experiments that our approach achieves faster convergence, improved stability, and superior control compliance compared to conventional quadratic-cost-based return functions. Moreover, our method enables tuning of the saturation function's parameters, shaping the learning process and policy updates. By incorporating system uncertainties, our framework also demonstrates robustness, a key requirement in industrial bioprocessing. Overall, this work advances reinforcement-learning-based control strategies in bioprocess engineering, with implications in the broader field of process and systems engineering. △ Less

Submitted 24 June, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.09491 [pdf, other]

DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction

Authors: Junjie Zhou, Shouju Wang, Yuxia Tang, Qi Zhu, Daoqiang Zhang, Wei Shao

Abstract: The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among mult… ▽ More The prediction of nanoparticles (NPs) distribution is crucial for the diagnosis and treatment of tumors. Recent studies indicate that the heterogeneity of tumor microenvironment (TME) highly affects the distribution of NPs across tumors. Hence, it has become a research hotspot to generate the NPs distribution by the aid of multi-modal TME components. However, the distribution divergence among multi-modal TME components may cause side effects i.e., the best uni-modal model may outperform the joint generative model. To address the above issues, we propose a \textbf{D}ivergence-\textbf{A}ware \textbf{M}ulti-\textbf{M}odal \textbf{Diffusion} model (i.e., \textbf{DAMM-Diffusion}) to adaptively generate the prediction results from uni-modal and multi-modal branches in a unified network. In detail, the uni-modal branch is composed of the U-Net architecture while the multi-modal branch extends it by introducing two novel fusion modules i.e., Multi-Modal Fusion Module (MMFM) and Uncertainty-Aware Fusion Module (UAFM). Specifically, the MMFM is proposed to fuse features from multiple modalities, while the UAFM module is introduced to learn the uncertainty map for cross-attention computation. Following the individual prediction results from each branch, the Divergence-Aware Multi-Modal Predictor (DAMMP) module is proposed to assess the consistency of multi-modal data with the uncertainty map, which determines whether the final prediction results come from multi-modal or uni-modal predictions. We predict the NPs distribution given the TME components of tumor vessels and cell nuclei, and the experimental results show that DAMM-Diffusion can generate the distribution of NPs with higher accuracy than the comparing methods. Additional results on the multi-modal brain image synthesis task further validate the effectiveness of the proposed method. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: Accepted by CVPR 2025

arXiv:2503.06359 [pdf, other]

Deep Reinforcement Learning-Based Semi-Autonomous Control for Magnetic Micro-robot Navigation with Immersive Manipulation

Authors: Yudong Mao, Dandan Zhang

Abstract: Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not prov… ▽ More Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not provide an intuitive manner for operators to manipulate micro-robots. These limitations increase the cognitive load on operators, who must interpret limited feedback and translate it into effective control actions. To address these challenges, we propose a Deep Reinforcement Learning-Based Semi-Autonomous Control (DRL-SC) framework for magnetic micro-robot navigation in a simulated microvascular system. Our framework integrates Mixed Reality (MR) to facilitate immersive manipulation of micro-robots, thereby enhancing situational awareness and control precision. Simulation and experimental results demonstrate that our approach significantly improves navigation efficiency, reduces control errors, and enhances the overall robustness of the system in simulated microvascular environments. △ Less

Submitted 8 March, 2025; originally announced March 2025.

Comments: Accepted by ICRA

arXiv:2503.01879 [pdf, ps, other]

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Authors: Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Haohan Li, Yu Lu, Shilin Zhou, Yue Lu, Ziliang Gan, Ziao Wang, Junwei Liao, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai

Abstract: This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder a… ▽ More This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language. △ Less

Submitted 29 May, 2025; v1 submitted 26 February, 2025; originally announced March 2025.

arXiv:2502.17499 [pdf]

Detecting Long QT Syndrome and First-Degree Atrioventricular Block using Single-Lead AI-ECG: A Multi-Center Real-World Study

Authors: Sumei Fan, Deyun Zhang, Yue Wang, Shijia Geng, Kun Lu, Meng Sang, Weilun Xu, Haixue Wang, Qinghao Zhao, Chuandong Cheng, Peng Wang, Shenda Hong

Abstract: Home-based single-lead AI-ECG devices have enabled continuous, real-world cardiac monitoring. However, the accuracy of parameter calculations from single-lead AI-ECG algorithm remains to be fully validated, which is critical for conditions such as Long QT Syndrome (LQTS) and First-Degree Atrioventricular Block (AVBI). In this multicenter study, we assessed FeatureDB, an ECG measurements computatio… ▽ More Home-based single-lead AI-ECG devices have enabled continuous, real-world cardiac monitoring. However, the accuracy of parameter calculations from single-lead AI-ECG algorithm remains to be fully validated, which is critical for conditions such as Long QT Syndrome (LQTS) and First-Degree Atrioventricular Block (AVBI). In this multicenter study, we assessed FeatureDB, an ECG measurements computation algorithm, in the context of single-lead monitoring using three annotated datasets: PTB-XL+ (n=21,354), CSE (n=105), and HeartVoice-ECG-lite (n=369). FeatureDB showed strong correlation with standard ECG machines (12SL and Uni-G) in key measurements (PR, QRS, QT, QTc), and high agreement confirmed by Bland-Altman analysis. In detecting LQTS (AUC=0.786) and AVBI (AUC=0.684), FeatureDB demonstrated diagnostic performance comparable to commercial ECG systems (12SL: 0.859/0.716; Uni-G: 0.817/0.605), significantly outperforming ECGDeli (0.501/0.569). Notably, FeatureDB can operate locally on resource-limited devices, facilitating use in low-connectivity settings. These findings confirm the clinical reliability of FeatureDB for single-lead ECG diagnostics and highlight its potential to bridge traditional ECG diagnostics with wearable technology for scalable cardiovascular monitoring and early intervention. △ Less

Submitted 26 April, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

Comments: 29pages, 11 figures, 8 tables

arXiv:2502.17473 [pdf, other]

Model-Based Learning for DOA Estimation with One-Bit Single-Snapshot Sparse Arrays

Authors: Yunqiao Hu, Shunqiao Sun, Yimin D. Zhang

Abstract: We address the challenging problem of estimating the directions-of-arrival (DOAs) of multiple off-grid signals using a single snapshot of one-bit quantized measurements. Conventional DOA estimation methods face difficulties in tackling this problem effectively. This paper introduces a domain-knowledge-guided learning framework to achieve high-resolution DOA estimation in such a scenario, thus dras… ▽ More We address the challenging problem of estimating the directions-of-arrival (DOAs) of multiple off-grid signals using a single snapshot of one-bit quantized measurements. Conventional DOA estimation methods face difficulties in tackling this problem effectively. This paper introduces a domain-knowledge-guided learning framework to achieve high-resolution DOA estimation in such a scenario, thus drastically reducing hardware complexity without compromising performance. We first reformulate DOA estimation as a maximum a posteriori (MAP) problem, unifying on-grid and off-grid scenarios under a Laplacian-type sparsity prior to effectively enforce sparsity for both uniform and sparse linear arrays. For off-grid signals, a first-order approximation grid model is embedded into the one-bit signal model. We then reinterpret one-bit sensing as a binary classification task, employing a multivariate Bernoulli likelihood with a logistic link function to enhance stability and estimation accuracy. To resolve the non-convexity inherent in the MAP formulation, we develop augmented algorithmic frameworks based on majorization-minimization principles. Further, we design model-based inference neural networks by deep unrolling these frameworks, significantly reducing computational complexity while preserving the estimation precision. Extensive simulations demonstrate the robustness of the proposed framework across a wide range of input signal-to-noise ratio values and off-grid deviations. By integrating the unified model-based priors with data-driven learning, this work bridges the gap between theoretical guarantees and practical feasibility in one-bit single-snapshot DOA estimation, offering a scalable, hardware-efficient solution for next-generation radar and communication systems. △ Less

Submitted 15 February, 2025; originally announced February 2025.

Comments: manuscript submitted to IEEE Journal of Selected Topics in Signal Processing, 13-page, 11 figures

arXiv:2502.17213 [pdf, other]

Deep Learning-Powered Electrical Brain Signals Analysis: Advancing Neurological Diagnostics

Authors: Jiahe Li, Xin Chen, Fanqi Shen, Junru Chen, Yuxin Liu, Daoze Zhang, Zhizhang Yuan, Fang Zhao, Meng Li, Yang Yang

Abstract: Neurological disorders represent significant global health challenges, driving the advancement of brain signal analysis methods. Scalp electroencephalography (EEG) and intracranial electroencephalography (iEEG) are widely used to diagnose and monitor neurological conditions. However, dataset heterogeneity and task variations pose challenges in developing robust deep learning solutions. This review… ▽ More Neurological disorders represent significant global health challenges, driving the advancement of brain signal analysis methods. Scalp electroencephalography (EEG) and intracranial electroencephalography (iEEG) are widely used to diagnose and monitor neurological conditions. However, dataset heterogeneity and task variations pose challenges in developing robust deep learning solutions. This review systematically examines recent advances in deep learning approaches for EEG/iEEG-based neurological diagnostics, focusing on applications across 7 neurological conditions using 46 datasets. We explore trends in data utilization, model design, and task-specific adaptations, highlighting the importance of pre-trained multi-task models for scalable, generalizable solutions. To advance research, we propose a standardized benchmark for evaluating models across diverse datasets to enhance reproducibility. This survey emphasizes how recent innovations can transform neurological diagnostics and enable the development of intelligent, adaptable healthcare solutions. △ Less

Submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.03781 [pdf, ps, other]

Gaze-Assisted Human-Centric Domain Adaptation for Cardiac Ultrasound Image Segmentation

Authors: Ruiyi Li, Yuting He, Rongjun Ge, Chong Wang, Daoqiang Zhang, Yang Chen, Shuo Li

Abstract: Domain adaptation (DA) for cardiac ultrasound image segmentation is clinically significant and valuable. However, previous domain adaptation methods are prone to be affected by the incomplete pseudo-label and low-quality target to source images. Human-centric domain adaptation has great advantages of human cognitive guidance to help model adapt to target domain and reduce reliance on labels. Docto… ▽ More Domain adaptation (DA) for cardiac ultrasound image segmentation is clinically significant and valuable. However, previous domain adaptation methods are prone to be affected by the incomplete pseudo-label and low-quality target to source images. Human-centric domain adaptation has great advantages of human cognitive guidance to help model adapt to target domain and reduce reliance on labels. Doctor gaze trajectories contains a large amount of cross-domain human guidance. To leverage gaze information and human cognition for guiding domain adaptation, we propose gaze-assisted human-centric domain adaptation (GAHCDA), which reliably guides the domain adaptation of cardiac ultrasound images. GAHCDA includes following modules: (1) Gaze Augment Alignment (GAA): GAA enables the model to obtain human cognition general features to recognize segmentation target in different domain of cardiac ultrasound images like humans. (2) Gaze Balance Loss (GBL): GBL fused gaze heatmap with outputs which makes the segmentation result structurally closer to the target domain. The experimental results illustrate that our proposed framework is able to segment cardiac ultrasound images more effectively in the target domain than GAN-based methods and other self-train based methods, showing great potential in clinical application. △ Less

Submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.02984 [pdf, other]

Learning Efficient Flocking Control based on Gibbs Random Fields

Authors: Dengyu Zhang, Chenghao, Feng Xue, Qingrui Zhang

Abstract: Flocking control is essential for multi-robot systems in diverse applications, yet achieving efficient flocking in congested environments poses challenges regarding computation burdens, performance optimality, and motion safety. This paper addresses these challenges through a multi-agent reinforcement learning (MARL) framework built on Gibbs Random Fields (GRFs). With GRFs, a multi-robot system is… ▽ More Flocking control is essential for multi-robot systems in diverse applications, yet achieving efficient flocking in congested environments poses challenges regarding computation burdens, performance optimality, and motion safety. This paper addresses these challenges through a multi-agent reinforcement learning (MARL) framework built on Gibbs Random Fields (GRFs). With GRFs, a multi-robot system is represented by a set of random variables conforming to a joint probability distribution, thus offering a fresh perspective on flocking reward design. A decentralized training and execution mechanism, which enhances the scalability of MARL concerning robot quantity, is realized using a GRF-based credit assignment method. An action attention module is introduced to implicitly anticipate the motion intentions of neighboring robots, consequently mitigating potential non-stationarity issues in MARL. The proposed framework enables learning an efficient distributed control policy for multi-robot systems in challenging environments with success rate around $99\%$, as demonstrated through thorough comparisons with state-of-the-art solutions in simulations and experiments. Ablation studies are also performed to validate the efficiency of different framework modules. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: 9 pages, 10 figures

arXiv:2502.02179 [pdf, other]

Deep Ensemble approach for Enhancing Brain Tumor Segmentation in Resource-Limited Settings

Authors: Jeremiah Fadugba, Isabel Lieberman, Olabode Ajayi, Mansour Osman, Solomon Oluwole Akinola, Tinashe Mustvangwa, Dong Zhang, Udunna C Anazondo, Raymond Confidence

Abstract: Segmentation of brain tumors is a critical step in treatment planning, yet manual segmentation is both time-consuming and subjective, relying heavily on the expertise of radiologists. In Sub-Saharan Africa, this challenge is magnified by overburdened medical systems and limited access to advanced imaging modalities and expert radiologists. Automating brain tumor segmentation using deep learning of… ▽ More Segmentation of brain tumors is a critical step in treatment planning, yet manual segmentation is both time-consuming and subjective, relying heavily on the expertise of radiologists. In Sub-Saharan Africa, this challenge is magnified by overburdened medical systems and limited access to advanced imaging modalities and expert radiologists. Automating brain tumor segmentation using deep learning offers a promising solution. Convolutional Neural Networks (CNNs), especially the U-Net architecture, have shown significant potential. However, a major challenge remains: achieving generalizability across different datasets. This study addresses this gap by developing a deep learning ensemble that integrates UNet3D, V-Net, and MSA-VNet models for the semantic segmentation of gliomas. By initially training on the BraTS-GLI dataset and fine-tuning with the BraTS-SSA dataset, we enhance model performance. Our ensemble approach significantly outperforms individual models, achieving DICE scores of 0.8358 for Tumor Core, 0.8521 for Whole Tumor, and 0.8167 for Enhancing Tumor. These results underscore the potential of ensemble methods in improving the accuracy and reliability of automated brain tumor segmentation, particularly in resource-limited settings. △ Less

Submitted 4 February, 2025; originally announced February 2025.

arXiv:2501.13242 [pdf, other]

Distributed Multiple Testing with False Discovery Rate Control in the Presence of Byzantines

Authors: Daofu Zhang, Mehrdad Pournaderi, Yu Xiang, Pramod Varshney

Abstract: This work studies distributed multiple testing with false discovery rate (FDR) control in the presence of Byzantine attacks, where an adversary captures a fraction of the nodes and corrupts their reported p-values. We focus on two baseline attack models: an oracle model with the full knowledge of which hypotheses are true nulls, and a practical attack model that leverages the Benjamini-Hochberg (B… ▽ More This work studies distributed multiple testing with false discovery rate (FDR) control in the presence of Byzantine attacks, where an adversary captures a fraction of the nodes and corrupts their reported p-values. We focus on two baseline attack models: an oracle model with the full knowledge of which hypotheses are true nulls, and a practical attack model that leverages the Benjamini-Hochberg (BH) procedure locally to classify which p-values follow the true null hypotheses. We provide a thorough characterization of how both attack models affect the global FDR, which in turn motivates counter-attack strategies and stronger attack models. Our extensive simulation studies confirm the theoretical results, highlight key design trade-offs under attacks and countermeasures, and provide insights into more sophisticated attacks. △ Less

Submitted 25 April, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

Comments: Accepted to the 2025 International Symposium on Information Theory (ISIT)

arXiv:2501.07008 [pdf, other]

Advancing Single-Snapshot DOA Estimation with Siamese Neural Networks for Sparse Linear Arrays

Authors: Ruxin Zheng, Shunqiao Sun, Hongshan Liu, Yimin D. Zhang

Abstract: Single-snapshot signal processing in sparse linear arrays has become increasingly vital, particularly in dynamic environments like automotive radar systems, where only limited snapshots are available. These arrays are often utilized either to cut manufacturing costs or result from unintended antenna failures, leading to challenges such as high sidelobe levels and compromised accuracy in direction-… ▽ More Single-snapshot signal processing in sparse linear arrays has become increasingly vital, particularly in dynamic environments like automotive radar systems, where only limited snapshots are available. These arrays are often utilized either to cut manufacturing costs or result from unintended antenna failures, leading to challenges such as high sidelobe levels and compromised accuracy in direction-of-arrival (DOA) estimation. Despite deep learning's success in tasks such as DOA estimation, the need for extensive training data to increase target numbers or improve angular resolution poses significant challenges. In response, this paper presents a novel Siamese neural network (SNN) featuring a sparse augmentation layer, which enhances signal feature embedding and DOA estimation accuracy in sparse arrays. We demonstrate the enhanced DOA estimation performance of our approach through detailed feature analysis and performance evaluation. The code for this study is available at https://github.com/ruxinzh/SNNS_SLA. △ Less

Submitted 12 January, 2025; originally announced January 2025.

Comments: Paper accepted by ICASSP 2025

arXiv:2501.04734 [pdf, other]

Generative Style Transfer for MRI Image Segmentation: A Case of Glioma Segmentation in Sub-Saharan Africa

Authors: Rancy Chepchirchir, Jill Sunday, Raymond Confidence, Dong Zhang, Talha Chaudhry, Udunna C. Anazodo, Kendi Muchungi, Yujing Zou

Abstract: In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic Resonance Imaging (MRI) technology raises questions about the applicability of machine learning methods for clinical tasks. This study aims to provide a robust deep learning-based brain tumor segmentation (BraTS) method tailored for the SSA population using a threefold approach. Firstly, the impact of domain shift from the SSA… ▽ More In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic Resonance Imaging (MRI) technology raises questions about the applicability of machine learning methods for clinical tasks. This study aims to provide a robust deep learning-based brain tumor segmentation (BraTS) method tailored for the SSA population using a threefold approach. Firstly, the impact of domain shift from the SSA training data on model efficacy was examined, revealing no significant effect. Secondly, a comparative analysis of 3D and 2D full-resolution models using the nnU-Net framework indicates similar performance of both the models trained for 300 epochs achieving a five-fold cross-validation score of 0.93. Lastly, addressing the performance gap observed in SSA validation as opposed to the relatively larger BraTS glioma (GLI) validation set, two strategies are proposed: fine-tuning SSA cases using the GLI+SSA best-pretrained 2D fullres model at 300 epochs, and introducing a novel neural style transfer-based data augmentation technique for the SSA cases. This investigation underscores the potential of enhancing brain tumor prediction within SSA's unique healthcare landscape. △ Less

Submitted 7 January, 2025; originally announced January 2025.

arXiv:2501.02303 [pdf, other]

Design and Benchmarking of A Multi-Modality Sensor for Robotic Manipulation with GAN-Based Cross-Modality Interpretation

Authors: Dandan Zhang, Wen Fan, Jialin Lin, Haoran Li, Qingzheng Cong, Weiru Liu, Nathan F. Lepora, Shan Luo

Abstract: In this paper, we present the design and benchmark of an innovative sensor, ViTacTip, which fulfills the demand for advanced multi-modal sensing in a compact design. A notable feature of ViTacTip is its transparent skin, which incorporates a `see-through-skin' mechanism. This mechanism aims at capturing detailed object features upon contact, significantly improving both vision-based and proximity… ▽ More In this paper, we present the design and benchmark of an innovative sensor, ViTacTip, which fulfills the demand for advanced multi-modal sensing in a compact design. A notable feature of ViTacTip is its transparent skin, which incorporates a `see-through-skin' mechanism. This mechanism aims at capturing detailed object features upon contact, significantly improving both vision-based and proximity perception capabilities. In parallel, the biomimetic tips embedded in the sensor's skin are designed to amplify contact details, thus substantially augmenting tactile and derived force perception abilities. To demonstrate the multi-modal capabilities of ViTacTip, we developed a multi-task learning model that enables simultaneous recognition of hardness, material, and textures. To assess the functionality and validate the versatility of ViTacTip, we conducted extensive benchmarking experiments, including object recognition, contact point detection, pose regression, and grating identification. To facilitate seamless switching between various sensing modalities, we employed a Generative Adversarial Network (GAN)-based approach. This method enhances the applicability of the ViTacTip sensor across diverse environments by enabling cross-modality interpretation. △ Less

Submitted 4 January, 2025; originally announced January 2025.

Comments: Accepted by IEEE Transactions on Robotics

arXiv:2412.19200 [pdf, other]

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Authors: Dengming Zhang, Weitao You, Ziheng Liu, Lingyun Sun, Pei Chen

Abstract: Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even t… ▽ More Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER. △ Less

Submitted 26 December, 2024; originally announced December 2024.

Comments: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

arXiv:2412.14100 [pdf, other]

Parameter-efficient Fine-tuning for improved Convolutional Baseline for Brain Tumor Segmentation in Sub-Saharan Africa Adult Glioma Dataset

Authors: Bijay Adhikari, Pratibha Kulung, Jakesh Bohaju, Laxmi Kanta Poudel, Confidence Raymond, Dong Zhang, Udunna C Anazodo, Bishesh Khanal, Mahesh Shakya

Abstract: Automating brain tumor segmentation using deep learning methods is an ongoing challenge in medical imaging. Multiple lingering issues exist including domain-shift and applications in low-resource settings which brings a unique set of challenges including scarcity of data. As a step towards solving these specific problems, we propose Convolutional adapter-inspired Parameter-efficient Fine-tuning (P… ▽ More Automating brain tumor segmentation using deep learning methods is an ongoing challenge in medical imaging. Multiple lingering issues exist including domain-shift and applications in low-resource settings which brings a unique set of challenges including scarcity of data. As a step towards solving these specific problems, we propose Convolutional adapter-inspired Parameter-efficient Fine-tuning (PEFT) of MedNeXt architecture. To validate our idea, we show our method performs comparable to full fine-tuning with the added benefit of reduced training compute using BraTS-2021 as pre-training dataset and BraTS-Africa as the fine-tuning dataset. BraTS-Africa consists of a small dataset (60 train / 35 validation) from the Sub-Saharan African population with marked shift in the MRI quality compared to BraTS-2021 (1251 train samples). We first show that models trained on BraTS-2021 dataset do not generalize well to BraTS-Africa as shown by 20% reduction in mean dice on BraTS-Africa validation samples. Then, we show that PEFT can leverage both the BraTS-2021 and BraTS-Africa dataset to obtain mean dice of 0.8 compared to 0.72 when trained only on BraTS-Africa. Finally, We show that PEFT (0.80 mean dice) results in comparable performance to full fine-tuning (0.77 mean dice) which may show PEFT to be better on average but the boxplots show that full finetuning results is much lesser variance in performance. Nevertheless, on disaggregation of the dice metrics, we find that the model has tendency to oversegment as shown by high specificity (0.99) compared to relatively low sensitivity(0.75). The source code is available at https://github.com/CAMERA-MRI/SPARK2024/tree/main/PEFT_MedNeXt △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: Accepted to "The International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2024 conference"

arXiv:2412.10976 [pdf, other]

Enhancing Off-Grid One-Bit DOA Estimation with Learning-Based Sparse Bayesian Approach for Non-Uniform Sparse Array

Authors: Yunqiao Hu, Shunqiao Sun, Yimin D. Zhang

Abstract: This paper tackles the challenge of one-bit off-grid direction of arrival (DOA) estimation in a single snapshot scenario based on a learning-based Bayesian approach. Firstly, we formulate the off-grid DOA estimation model, utilizing the first-order off-grid approximation, incorporating one-bit data quantization. Subsequently, we address this problem using the Sparse Bayesian based framework and so… ▽ More This paper tackles the challenge of one-bit off-grid direction of arrival (DOA) estimation in a single snapshot scenario based on a learning-based Bayesian approach. Firstly, we formulate the off-grid DOA estimation model, utilizing the first-order off-grid approximation, incorporating one-bit data quantization. Subsequently, we address this problem using the Sparse Bayesian based framework and solve iteratively. However, traditional Sparse Bayesian methods often face challenges such as high computational complexity and the need for extensive hyperparameter tuning. To balance estimation accuracy and computational efficiency, we propose a novel Learning-based Sparse Bayesian framework, which leverages an unrolled neural network architecture. This framework autonomously learns hyperparameters through supervised learning, offering more accurate off-grid DOA estimates and improved computational efficiency compared to some state-of-the-art methods. Furthermore, the proposed approach is applicable to both uniform linear arrays and non-uniform sparse arrays. Simulation results validate the effectiveness of the proposed framework. △ Less

Submitted 14 December, 2024; originally announced December 2024.

Comments: Proc. 58th Annual Asilomar Conference on Signals, Systems, and Computers (Asilomar), Pacific Grove, CA, Oct. 27 - Oct. 30, 2024

arXiv:2412.03749 [pdf]

Electrically functionalized body surface for deep-tissue bioelectrical recording

Authors: Dehui Zhang, Yucheng Zhang, Dong Xu, Shaolei Wang, Kaidong Wang, Boxuan Zhou, Yansong Ling, Yang Liu, Qingyu Cui, Junyi Yin, Enbo Zhu, Xun Zhao, Chengzhang Wan, Jun Chen, Tzung K. Hsiai, Yu Huang, Xiangfeng Duan

Abstract: Directly probing deep tissue activities from body surfaces offers a noninvasive approach to monitoring essential physiological processes1-3. However, this method is technically challenged by rapid signal attenuation toward the body surface and confounding motion artifacts4-6 primarily due to excessive contact impedance and mechanical mismatch with conventional electrodes. Herein, by formulating an… ▽ More Directly probing deep tissue activities from body surfaces offers a noninvasive approach to monitoring essential physiological processes1-3. However, this method is technically challenged by rapid signal attenuation toward the body surface and confounding motion artifacts4-6 primarily due to excessive contact impedance and mechanical mismatch with conventional electrodes. Herein, by formulating and directly spray coating biocompatible two-dimensional nanosheet ink onto the human body under ambient conditions, we create microscopically conformal and adaptive van der Waals thin films (VDWTFs) that seamlessly merge with non-Euclidean, hairy, and dynamically evolving body surfaces. Unlike traditional deposition methods, which often struggle with conformality and adaptability while retaining high electronic performance, this gentle process enables the formation of high-performance VDWTFs directly on the body surface under bio-friendly conditions, making it ideal for biological applications. This results in low-impedance electrically functionalized body surfaces (EFBS), enabling highly robust monitoring of biopotential and bioimpedance modulations associated with deep-tissue activities, such as blood circulation, muscle movements, and brain activities. Compared to commercial solutions, our VDWTF-EFBS exhibits nearly two-orders of magnitude lower contact impedance and substantially reduces the extrinsic motion artifacts, enabling reliable extraction of bioelectrical signals from irregular surfaces, such as unshaved human scalps. This advancement defines a technology for continuous, noninvasive monitoring of deep-tissue activities during routine body movements. △ Less

Submitted 4 December, 2024; originally announced December 2024.

arXiv:2411.09177 [pdf, other]

Enhancing reinforcement learning for population setpoint tracking in co-cultures

Authors: Sebastián Espinel-Ríos, Joyce Qiaoxi Mo, Dongda Zhang, Ehecatl Antonio del Rio-Chanona, José L. Avalos

Abstract: Efficient multiple setpoint tracking can enable advanced biotechnological applications, such as maintaining desired population levels in co-cultures for optimal metabolic division of labor. In this study, we employ reinforcement learning as a control method for population setpoint tracking in co-cultures, focusing on policy-gradient techniques where the control policy is parameterized by neural ne… ▽ More Efficient multiple setpoint tracking can enable advanced biotechnological applications, such as maintaining desired population levels in co-cultures for optimal metabolic division of labor. In this study, we employ reinforcement learning as a control method for population setpoint tracking in co-cultures, focusing on policy-gradient techniques where the control policy is parameterized by neural networks. However, achieving accurate tracking across multiple setpoints is a significant challenge in reinforcement learning, as the agent must effectively balance the contributions of various setpoints to maximize the expected system performance. Traditional return functions, such as those based on a quadratic cost, often yield suboptimal performance due to their inability to efficiently guide the agent toward the simultaneous satisfaction of all setpoints. To overcome this, we propose a novel return function that rewards the simultaneous satisfaction of multiple setpoints and diminishes overall reward gains otherwise, accounting for both stage and terminal system performance. This return function includes parameters to fine-tune the desired smoothness and steepness of the learning process. We demonstrate our approach considering an $\textit{Escherichia coli}$ co-culture in a chemostat with optogenetic control over amino acid synthesis pathways, leveraging auxotrophies to modulate growth. △ Less

Submitted 14 March, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

arXiv:2411.04568 [pdf, other]

Dynamic-Attention-based EEG State Transition Modeling for Emotion Recognition

Authors: Xinke Shen, Runmin Gan, Kaixuan Wang, Shuyi Yang, Qingzhu Zhang, Quanying Liu, Dan Zhang, Sen Song

Abstract: Electroencephalogram (EEG)-based emotion decoding can objectively quantify people's emotional state and has broad application prospects in human-computer interaction and early detection of emotional disorders. Recently emerging deep learning architectures have significantly improved the performance of EEG emotion decoding. However, existing methods still fall short of fully capturing the complex s… ▽ More Electroencephalogram (EEG)-based emotion decoding can objectively quantify people's emotional state and has broad application prospects in human-computer interaction and early detection of emotional disorders. Recently emerging deep learning architectures have significantly improved the performance of EEG emotion decoding. However, existing methods still fall short of fully capturing the complex spatiotemporal dynamics of neural signals, which are crucial for representing emotion processing. This study proposes a Dynamic-Attention-based EEG State Transition (DAEST) modeling method to characterize EEG spatiotemporal dynamics. The model extracts spatiotemporal components of EEG that represent multiple parallel neural processes and estimates dynamic attention weights on these components to capture transitions in brain states. The model is optimized within a contrastive learning framework for cross-subject emotion recognition. The proposed method achieved state-of-the-art performance on three publicly available datasets: FACED, SEED, and SEED-V. It achieved 75.4% accuracy in the binary classification of positive and negative emotions and 59.3% in nine-class discrete emotion classification on the FACED dataset, 88.1% in the three-class classification of positive, negative, and neutral emotions on the SEED dataset, and 73.6% in five-class discrete emotion classification on the SEED-V dataset. The learned EEG spatiotemporal patterns and dynamic transition properties offer valuable insights into neural dynamics underlying emotion processing. △ Less

Submitted 7 November, 2024; originally announced November 2024.

Comments: 14 pages, 6 figures

arXiv:2411.02888 [pdf, other]

A Symmetric Dynamic Learning Framework for Diffeomorphic Medical Image Registration

Authors: Jinqiu Deng, Ke Chen, Mingke Li, Daoping Zhang, Chong Chen, Alejandro F. Frangi, Jianping Zhang

Abstract: Diffeomorphic image registration is crucial for various medical imaging applications because it can preserve the topology of the transformation. This study introduces DCCNN-LSTM-Reg, a learning framework that evolves dynamically and learns a symmetrical registration path by satisfying a specified control increment system. This framework aims to obtain symmetric diffeomorphic deformations between m… ▽ More Diffeomorphic image registration is crucial for various medical imaging applications because it can preserve the topology of the transformation. This study introduces DCCNN-LSTM-Reg, a learning framework that evolves dynamically and learns a symmetrical registration path by satisfying a specified control increment system. This framework aims to obtain symmetric diffeomorphic deformations between moving and fixed images. To achieve this, we combine deep learning networks with diffeomorphic mathematical mechanisms to create a continuous and dynamic registration architecture, which consists of multiple Symmetric Registration (SR) modules cascaded on five different scales. Specifically, our method first uses two U-nets with shared parameters to extract multiscale feature pyramids from the images. We then develop an SR-module comprising a sequential CNN-LSTM architecture to progressively correct the forward and reverse multiscale deformation fields using control increment learning and the homotopy continuation technique. Through extensive experiments on three 3D registration tasks, we demonstrate that our method outperforms existing approaches in both quantitative and qualitative evaluations. △ Less

Submitted 5 November, 2024; originally announced November 2024.

Comments: 12 pages,7 figures

arXiv:2410.17709 [pdf, other]

Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure

Authors: Chaoyun Zhang, Randolph Yao, Si Qin, Ze Li, Shekhar Agrawal, Binit R. Mishra, Tri Tran, Minghua Ma, Qingwei Lin, Murali Chintalapati, Dongmei Zhang

Abstract: The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored… ▽ More The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action overriding mechanisms to (i) improve the reliability of the system, and (ii) strike a good tradeoff between downtime reduction and resource utilization, thereby enhancing the overall system performance. After deploying Deoxys in a large-scale cloud infrastructure at Microsoft, our observations demonstrate that Deoxys significantly reduces average VM downtime by 53% compared to a legacy policy, while leading to 49.5% lower VM interruption rate. This substantial improvement enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience. △ Less

Submitted 23 October, 2024; originally announced October 2024.

arXiv:2409.15897 [pdf, ps, other]

ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Authors: Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-weon Jung, Jia Qi Yip, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen Alharhi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H. Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe

Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse appli… ▽ More Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications. △ Less

Submitted 24 February, 2025; v1 submitted 24 September, 2024; originally announced September 2024.

Comments: Accepted by SLT

arXiv:2409.10282 [pdf, other]

Matrix Completion and Decomposition in Phase Bounded Cones

Authors: Ding Zhang, Axel Ringh, Li Qiu

Abstract: The problem of matrix completion and decomposition in the cone of positive semidefinite (PSD) matrices is a well-understood problem, with many important applications in areas such as linear algebra, optimization, and control theory. This paper considers the completion and decomposition problems in a broader class of cones, namely phase-bounded cones. We show that most of the main results from the… ▽ More The problem of matrix completion and decomposition in the cone of positive semidefinite (PSD) matrices is a well-understood problem, with many important applications in areas such as linear algebra, optimization, and control theory. This paper considers the completion and decomposition problems in a broader class of cones, namely phase-bounded cones. We show that most of the main results from the PSD case carry over to the phase-bounded case. More precisely, this is done by first unveiling a duality between the completion and decomposition problems, using a dual cone interpretation. Based on this, we then derive necessary and sufficient conditions for the phase-bounded completion and decomposition problems, and also characterize all phase-bounded completions of a completable partial matrix with a banded pattern. △ Less

Submitted 16 September, 2024; originally announced September 2024.

arXiv:2409.07040 [pdf, other]

Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement

Authors: Xianmin Chen, Peiliang Huang, Xiaoxu Feng, Dingwen Zhang, Longfei Han, Junwei Han

Abstract: Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoisin… ▽ More Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, two-stage approaches typically decompose a raw image with color filter arrays (CFA) into a four-channel RGGB format before feeding it into a neural network. However, this strategy overlooks the critical role of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we design a novel Mamba scanning mechanism, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we present a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction. By bridging demosaicing and denoising, better raw image enhancement is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping. △ Less

Submitted 31 December, 2024; v1 submitted 11 September, 2024; originally announced September 2024.

arXiv:2409.01544 [pdf, other]

Learning Task-Specific Sampling Strategy for Sparse-View CT Reconstruction

Authors: Liutao Yang, Jiahao Huang, Yingying Fang, Angelica I Aviles-Rivero, Carola-Bibiane Schonlieb, Daoqiang Zhang, Guang Yang

Abstract: Sparse-View Computed Tomography (SVCT) offers low-dose and fast imaging but suffers from severe artifacts. Optimizing the sampling strategy is an essential approach to improving the imaging quality of SVCT. However, current methods typically optimize a universal sampling strategy for all types of scans, overlooking the fact that the optimal strategy may vary depending on the specific scanning task… ▽ More Sparse-View Computed Tomography (SVCT) offers low-dose and fast imaging but suffers from severe artifacts. Optimizing the sampling strategy is an essential approach to improving the imaging quality of SVCT. However, current methods typically optimize a universal sampling strategy for all types of scans, overlooking the fact that the optimal strategy may vary depending on the specific scanning task, whether it involves particular body scans (e.g., chest CT scans) or downstream clinical applications (e.g., disease diagnosis). The optimal strategy for one scanning task may not perform as well when applied to other tasks. To address this problem, we propose a deep learning framework that learns task-specific sampling strategies with a multi-task approach to train a unified reconstruction network while tailoring optimal sampling strategies for each individual task. Thus, a task-specific sampling strategy can be applied for each type of scans to improve the quality of SVCT imaging and further assist in performance of downstream clinical usage. Extensive experiments across different scanning types provide validation for the effectiveness of task-specific sampling strategies in enhancing imaging quality. Experiments involving downstream tasks verify the clinical value of learned sampling strategies, as evidenced by notable improvements in downstream task performance. Furthermore, the utilization of a multi-task framework with a shared reconstruction network facilitates deployment on current imaging devices with switchable task-specific modules, and allows for easily integrate new tasks without retraining the entire model. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Showing 1–50 of 250 results for author: Zhang, D