Search | arXiv e-print repository

Towards AI-Native RAN: An Operator's Perspective of 6G Day 1 Standardization

Authors: Nan Li, Qi Sun, Lehan Wang, Xiaofei Xu, Jinri Huang, Chunhui Liu, Jing Gao, Yuhong Huang, Chih-Lin I

Abstract: Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardizati… ▽ More Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardization experience from 2G to 5G, this paper explores the design and standardization principles of AI-Native radio access networks (RAN) for 6G, with a particular focus on its critical Day 1 architecture, functionalities and capabilities. We investigate the framework of AI-Native RAN and present its three essential capabilities to shed some light on the standardization direction; namely, AI-driven RAN processing/optimization/automation, reliable AI lifecycle management (LCM), and AI-as-a-Service (AIaaS) provisioning. The standardization of AI-Native RAN, in particular the Day 1 features, including an AI-Native 6G RAN architecture, were proposed. For validation, a large-scale field trial with over 5000 5G-A base stations have been built and delivered significant improvements in average air interface latency, root cause identification, and network energy consumption with the proposed architecture and the supporting AI functions. This paper aims to provide a Day 1 framework for 6G AI-Native RAN standardization design, balancing technical innovation with practical deployment. △ Less

Submitted 11 July, 2025; originally announced July 2025.

arXiv:2507.04821 [pdf, ps, other]

Force-IMU Fusion-Based Sensing Acupuncture Needle and Quantitative Analysis System for Acupuncture Manipulations

Authors: Peng Tian, Kang Yu, Tianyun Jiang, Yuqi Wang, Haiying Zhang, Hao Yang, Yunfeng Wang, Jun Zhang, Shuo Gao, Junhong Gao

Abstract: Acupuncture, one of the key therapeutic methods in Traditional Chinese Medicine (TCM), has been widely adopted in various clinical fields. Quantitative research on acupuncture manipulation parameters is critical to achieve standardized techniques. However, quantitative mechanical detection of acupuncture parameters remains limited. This study establishes a kinematic and dynamic model of acupunctur… ▽ More Acupuncture, one of the key therapeutic methods in Traditional Chinese Medicine (TCM), has been widely adopted in various clinical fields. Quantitative research on acupuncture manipulation parameters is critical to achieve standardized techniques. However, quantitative mechanical detection of acupuncture parameters remains limited. This study establishes a kinematic and dynamic model of acupuncture, identifying key parameters such as lifting-thrusting force, acceleration, velocity, displacement, as well as twirling-rotating angular velocity and angle. To measure these critical parameters, we propose a quantitative system comprising a sensing needle equipped with a force sensor and an inertial measurement unit (IMU), as well as an external camera module to capture image information. By fusing visual and IMU data, we accurately identify the stationary or motion states of the needle, enabling segmented computation of lifting-thrusting velocity and displacement. The experimental results demonstrate that the sensing needle achieves comprehensive detection with high precision, featuring a nonlinearity error of 0.45% in force measurement and an RMSE of 1.2 mm in displacement. The extracted parameters provide an objective description of the operational characteristics and motion patterns of the four basic acupuncture manipulations. These findings provide valuable tools and methods for research in acupuncture standardization. △ Less

Submitted 7 July, 2025; originally announced July 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.21613 [pdf, ps, other]

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Authors: Gautam Siddharth Kashyap, Mohammad Anas Azeez, Rafiq Ali, Zohaib Hasan Siddiqui, Jiechao Gao, Usman Naseem

Abstract: The increasing prevalence of child-targeted hate speech online underscores the urgent need for specialized datasets to address this critical issue. Existing hate speech datasets lack agespecific annotations, fail to capture nuanced contexts, and overlook the unique emotional impact on children. To bridge this gap, we introduce ChildGuard1, a curated dataset derived from existing corpora and enrich… ▽ More The increasing prevalence of child-targeted hate speech online underscores the urgent need for specialized datasets to address this critical issue. Existing hate speech datasets lack agespecific annotations, fail to capture nuanced contexts, and overlook the unique emotional impact on children. To bridge this gap, we introduce ChildGuard1, a curated dataset derived from existing corpora and enriched with child-specific annotations. ChildGuard captures diverse contexts of child-targeted hate speech, spanning age groups. We benchmark existing state-of-the-art hate speech detection methods, including Large Language Models (LLMs), and assess their effectiveness in detecting and contextualizing child-targeted hate speech. To foster further research in this area, we publicly release ChildGuard, providing a robust foundation for developing improved methods to detect and mitigate such harm. △ Less

Submitted 21 June, 2025; originally announced June 2025.

arXiv:2506.20762 [pdf, ps, other]

Drift-Adaptive Slicing-Based Resource Management for Cooperative ISAC Networks

Authors: Shisheng Hu, Jie Gao, Xue Qin, Conghao Zhou, Xinyu Huang, Mushu Li, Mingcheng He, Xuemin Shen

Abstract: In this paper, we propose a novel drift-adaptive slicing-based resource management scheme for cooperative integrated sensing and communication (ISAC) networks. Particularly, we establish two network slices to provide sensing and communication services, respectively. In the large-timescale planning for the slices, we partition the sensing region of interest (RoI) of each mobile device and reserve n… ▽ More In this paper, we propose a novel drift-adaptive slicing-based resource management scheme for cooperative integrated sensing and communication (ISAC) networks. Particularly, we establish two network slices to provide sensing and communication services, respectively. In the large-timescale planning for the slices, we partition the sensing region of interest (RoI) of each mobile device and reserve network resources accordingly, facilitating low-complexity distance-based sensing target assignment in small timescales. To cope with the non-stationary spatial distributions of mobile devices and sensing targets, which can result in the drift in modeling the distributions and ineffective planning decisions, we construct digital twins (DTs) of the slices. In each DT, a drift-adaptive statistical model and an emulation function are developed for the spatial distributions in the corresponding slice, which facilitates closed-form decision-making and efficient validation of a planning decision, respectively. Numerical results show that the proposed drift-adaptive slicing-based resource management scheme can increase the service satisfaction ratio by up to 18% and reduce resource consumption by up to 13.1% when compared with benchmark schemes. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted by IEEE Transactions on Cognitive Communications and Networking

arXiv:2506.12314 [pdf, ps, other]

Explosive Output to Enhance Jumping Ability: A Variable Reduction Ratio Design Paradigm for Humanoid Robots Knee Joint

Authors: Xiaoshuai Ma, Haoxiang Qi, Qingqing Li, Haochen Xu, Xuechao Chen, Junyao Gao, Zhangguo Yu, Qiang Huang

Abstract: Enhancing the explosive power output of the knee joints is critical for improving the agility and obstacle-crossing capabilities of humanoid robots. However, a mismatch between the knee-to-center-of-mass (CoM) transmission ratio and jumping demands, coupled with motor performance degradation at high speeds, restricts the duration of high-power output and limits jump performance. To address these p… ▽ More Enhancing the explosive power output of the knee joints is critical for improving the agility and obstacle-crossing capabilities of humanoid robots. However, a mismatch between the knee-to-center-of-mass (CoM) transmission ratio and jumping demands, coupled with motor performance degradation at high speeds, restricts the duration of high-power output and limits jump performance. To address these problems, this paper introduces a novel knee joint design paradigm employing a dynamically decreasing reduction ratio for explosive output during jump. Analysis of motor output characteristics and knee kinematics during jumping inspired a coupling strategy in which the reduction ratio gradually decreases as the joint extends. A high initial ratio rapidly increases torque at jump initiation, while its gradual reduction minimizes motor speed increments and power losses, thereby maintaining sustained high-power output. A compact and efficient linear actuator-driven guide-rod mechanism realizes this coupling strategy, supported by parameter optimization guided by explosive jump control strategies. Experimental validation demonstrated a 63 cm vertical jump on a single-joint platform (a theoretical improvement of 28.1\% over the optimal fixed-ratio joints). Integrated into a humanoid robot, the proposed design enabled a 1.1 m long jump, a 0.5 m vertical jump, and a 0.5 m box jump. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2505.05088 [pdf, other]

SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal

Authors: Wenyang Liu, Jianjun Gao, Kim-Hui Yap

Abstract: Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network speci… ▽ More Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: Under Review in JVCI

arXiv:2504.04829 [pdf, other]

Attentional Graph Meta-Learning for Indoor Localization Using Extremely Sparse Fingerprints

Authors: Wenzhong Yan, Feng Yin, Jun Gao, Ao Wang, Yang Tian, Ruizhi Chen

Abstract: Fingerprint-based indoor localization is often labor-intensive due to the need for dense grids and repeated measurements across time and space. Maintaining high localization accuracy with extremely sparse fingerprints remains a persistent challenge. Existing benchmark methods primarily rely on the measured fingerprints, while neglecting valuable spatial and environmental characteristics. In this p… ▽ More Fingerprint-based indoor localization is often labor-intensive due to the need for dense grids and repeated measurements across time and space. Maintaining high localization accuracy with extremely sparse fingerprints remains a persistent challenge. Existing benchmark methods primarily rely on the measured fingerprints, while neglecting valuable spatial and environmental characteristics. In this paper, we propose a systematic integration of an Attentional Graph Neural Network (AGNN) model, capable of learning spatial adjacency relationships and aggregating information from neighboring fingerprints, and a meta-learning framework that utilizes datasets with similar environmental characteristics to enhance model training. To minimize the labor required for fingerprint collection, we introduce two novel data augmentation strategies: 1) unlabeled fingerprint augmentation using moving platforms, which enables the semi-supervised AGNN model to incorporate information from unlabeled fingerprints, and 2) synthetic labeled fingerprint augmentation through environmental digital twins, which enhances the meta-learning framework through a practical distribution alignment, which can minimize the feature discrepancy between synthetic and real-world fingerprints effectively. By integrating these novel modules, we propose the Attentional Graph Meta-Learning (AGML) model. This novel model combines the strengths of the AGNN model and the meta-learning framework to address the challenges posed by extremely sparse fingerprints. To validate our approach, we collected multiple datasets from both consumer-grade WiFi devices and professional equipment across diverse environments. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that the AGML model-based localization method consistently outperforms all baseline methods using sparse fingerprints across all evaluated metrics. △ Less

Submitted 7 April, 2025; originally announced April 2025.

arXiv:2504.01038 [pdf, other]

An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

Authors: Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong

Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One C… ▽ More Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git. △ Less

Submitted 31 March, 2025; originally announced April 2025.

Comments: 26 pages, 4 figures, 6 tables

arXiv:2504.01010 [pdf, other]

A YOLO-Based Semi-Automated Labeling Approach to Improve Fault Detection Efficiency in Railroad Videos

Authors: Dylan Lester, James Gao, Samuel Sutphin, Pingping Zhu, Husnu Narman, Ammar Alzarrad

Abstract: Manual labeling for large-scale image and video datasets is often time-intensive, error-prone, and costly, posing a significant barrier to efficient machine learning workflows in fault detection from railroad videos. This study introduces a semi-automated labeling method that utilizes a pre-trained You Only Look Once (YOLO) model to streamline the labeling process and enhance fault detection accur… ▽ More Manual labeling for large-scale image and video datasets is often time-intensive, error-prone, and costly, posing a significant barrier to efficient machine learning workflows in fault detection from railroad videos. This study introduces a semi-automated labeling method that utilizes a pre-trained You Only Look Once (YOLO) model to streamline the labeling process and enhance fault detection accuracy in railroad videos. By initiating the process with a small set of manually labeled data, our approach iteratively trains the YOLO model, using each cycle's output to improve model accuracy and progressively reduce the need for human intervention. To facilitate easy correction of model predictions, we developed a system to export YOLO's detection data as an editable text file, enabling rapid adjustments when detections require refinement. This approach decreases labeling time from an average of 2 to 4 minutes per image to 30 seconds to 2 minutes, effectively minimizing labor costs and labeling errors. Unlike costly AI based labeling solutions on paid platforms, our method provides a cost-effective alternative for researchers and practitioners handling large datasets in fault detection and other detection based machine learning applications. △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: Published on American Society of Engineering Education (ASEE) North Central Section Conference, 2025

arXiv:2503.18078 [pdf, other]

GenMetaLoc: Learning to Learn Environment-Aware Fingerprint Generation for Sample Efficient Wireless Localization

Authors: Jun Gao, Feng Yin, Wenzhong Yan, Qinglei Kong, Lexi Xu, Shuguang Cui

Abstract: Existing fingerprinting-based localization methods often require extensive data collection and struggle to generalize to new environments. In contrast to previous environment-unknown MetaLoc, we propose GenMetaLoc in this paper, which first introduces meta-learning to enable the generation of dense fingerprint databases from an environment-aware perspective. In the model aspect, the learning-to-le… ▽ More Existing fingerprinting-based localization methods often require extensive data collection and struggle to generalize to new environments. In contrast to previous environment-unknown MetaLoc, we propose GenMetaLoc in this paper, which first introduces meta-learning to enable the generation of dense fingerprint databases from an environment-aware perspective. In the model aspect, the learning-to-learn mechanism accelerates the fingerprint generation process by facilitating rapid adaptation to new environments with minimal data. Additionally, we incorporate 3D point cloud data from the first Fresnel zone between the transmitter and receiver, which describes the obstacles distribution in the environment and serves as a condition to guide the diffusion model in generating more accurate fingerprints. In the data processing aspect, unlike most studies that focus solely on channel state information (CSI) amplitude or phase, we present a comprehensive processing that addresses both, correcting errors from WiFi hardware limitations such as amplitude discrepancies and frequency offsets. For the data collection platform, we develop an uplink wireless localization system that leverages the sensing capabilities of existing commercial WiFi devices and mobile phones, thus reducing the need for additional deployment costs. Experimental results on real datasets show that our framework outperforms baseline methods. △ Less

Submitted 23 March, 2025; originally announced March 2025.

arXiv:2503.03465 [pdf, other]

DTU-Net: A Multi-Scale Dilated Transformer Network for Nonlinear Hyperspectral Unmixing

Authors: ChenTong Wang, Jincheng Gao, Fei Zhu, Abderrahim Halimi, Cédric Richard

Abstract: Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the l… ▽ More Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the linear mixing model, which lacks the flexibility to accommodate scenarios where nonlinear effects are significant. To address these limitations, we propose a multi-scale Dilated Transformer-based unmixing network for nonlinear HU (DTU-Net). The encoder employs two branches. The first one performs multi-scale spatial feature extraction using Multi-Scale Dilated Attention (MSDA) in the Dilated Transformer, which varies dilation rates across attention heads to capture long-range and multi-scale spatial correlations. The second one performs spectral feature extraction utilizing 3D-CNNs with channel attention. The outputs from both branches are then fused to integrate multi-scale spatial and spectral information, which is subsequently transformed to estimate the abundances. The decoder is designed to accommodate both linear and nonlinear mixing scenarios. Its interpretability is enhanced by explicitly modeling the relationships between endmembers, abundances, and nonlinear coefficients in accordance with the polynomial post-nonlinear mixing model (PPNMM). Experiments on synthetic and real datasets validate the effectiveness of the proposed DTU-Net compared to PPNMM-derived methods and several advanced unmixing networks. △ Less

Submitted 5 March, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

arXiv:2502.17759 [pdf]

Label-free Prediction of Vascular Connectivity in Perfused Microvascular Networks in vitro

Authors: Liang Xu, Pengwu Song, Shilu Zhu, Yang Zhang, Ru Zhang, Zhiyuan Zheng, Qingdong Zhang, Jie Gao, Chen Han, Mingzhai Sun, Peng Yao, Min Ye, Ronald X. Xu

Abstract: Continuous monitoring and in-situ assessment of microvascular connectivity have significant implications for culturing vascularized organoids and optimizing the therapeutic strategies. However, commonly used methods for vascular connectivity assessment heavily rely on fluorescent labels that may either raise biocompatibility concerns or interrupt the normal cell growth process. To address this iss… ▽ More Continuous monitoring and in-situ assessment of microvascular connectivity have significant implications for culturing vascularized organoids and optimizing the therapeutic strategies. However, commonly used methods for vascular connectivity assessment heavily rely on fluorescent labels that may either raise biocompatibility concerns or interrupt the normal cell growth process. To address this issue, a Vessel Connectivity Network (VC-Net) was developed for label-free assessment of vascular connectivity. To validate the VC-Net, microvascular networks (MVNs) were cultured in vitro and their microscopic images were acquired at different culturing conditions as a training dataset. The VC-Net employs a Vessel Queue Contrastive Learning (VQCL) method and a class imbalance algorithm to address the issues of limited sample size, indistinctive class features and imbalanced class distribution in the dataset. The VC-Net successfully evaluated the vascular connectivity with no significant deviation from that by fluorescence imaging. In addition, the proposed VC-Net successfully differentiated the connectivity characteristics between normal and tumor-related MVNs. In comparison with those cultured in the regular microenvironment, the averaged connectivity of MVNs cultured in the tumor-related microenvironment decreased by 30.8%, whereas the non-connected area increased by 37.3%. This study provides a new avenue for label-free and continuous assessment of organoid or tumor vascularization in vitro. △ Less

Submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.05766 [pdf, other]

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Authors: Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, Zhen-Hua Ling

Abstract: Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowle… ▽ More Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: accepted to Pattern Recognition

arXiv:2502.05130 [pdf, other]

Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

Authors: Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Jun Du, Kewei Li, Ruoyu Wang, Jiefeng Ma, Lei Sun, Jianqing Gao

Abstract: This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of M… ▽ More This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 $\sim$ 20 $\times$ faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/ △ Less

Submitted 18 March, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.01143 [pdf, other]

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

Authors: Tairan He, Jiawei Gao, Wenli Xiao, Yuanhang Zhang, Zi Wang, Jiashun Wang, Zhengyi Luo, Guanqi He, Nikhil Sobanbab, Chaoyi Pan, Zeji Yi, Guannan Qu, Kris Kitani, Jessica Hodgins, Linxi "Jim" Fan, Yuke Zhu, Changliu Liu, Guanya Shi

Abstract: Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive par… ▽ More Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, we pre-train motion tracking policies in simulation using retargeted human motion data. In the second stage, we deploy the policies in the real world and collect real-world data to train a delta (residual) action model that compensates for the dynamics mismatch. Then, ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids. △ Less

Submitted 25 April, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

Comments: RSS 2025. Project website: https://agile.human2humanoid.com/

arXiv:2501.10404 [pdf, other]

Automated Detection of Epileptic Spikes and Seizures Incorporating a Novel Spatial Clustering Prior

Authors: Hanyang Dong, Shurong Sheng, Xiongfei Wang, Jiahong Gao, Yi Sun, Wanli Yang, Kuntao Xiao, Pengfei Teng, Guoming Luan, Zhao Lv

Abstract: A Magnetoencephalography (MEG) time-series recording consists of multi-channel signals collected by superconducting sensors, with each signal's intensity reflecting magnetic field changes over time at the sensor location. Automating epileptic MEG spike detection significantly reduces manual assessment time and effort, yielding substantial clinical benefits. Existing research addresses MEG spike de… ▽ More A Magnetoencephalography (MEG) time-series recording consists of multi-channel signals collected by superconducting sensors, with each signal's intensity reflecting magnetic field changes over time at the sensor location. Automating epileptic MEG spike detection significantly reduces manual assessment time and effort, yielding substantial clinical benefits. Existing research addresses MEG spike detection by encoding neural network inputs with signals from all channel within a time segment, followed by classification. However, these methods overlook simultaneous spiking occurred from nearby sensors. We introduce a simple yet effective paradigm that first clusters MEG channels based on their sensor's spatial position. Next, a novel convolutional input module is designed to integrate the spatial clustering and temporal changes of the signals. This module is fed into a custom MEEG-ResNet3D developed by the authors, which learns to extract relevant features and classify the input as a spike clip or not. Our method achieves an F1 score of 94.73% on a large real-world MEG dataset Sanbo-CMR collected from two centers, outperforming state-of-the-art approaches by 1.85%. Moreover, it demonstrates efficacy and stability in the Electroencephalographic (EEG) seizure detection task, yielding an improved weighted F1 score of 1.4% compared to current state-of-the-art techniques evaluated on TUSZ, whch is the largest EEG seizure dataset. △ Less

Submitted 4 January, 2025; originally announced January 2025.

Comments: 8 pages, 6 figures, accepted by BIBM2024

arXiv:2501.03496 [pdf, other]

A Unified Attack Detection Strategy for Multi-Agent Systems over Transient and Steady Stages

Authors: Jinming Gao, Yijing Wang, Wentao Zhang, Rui Zhao, Yang Shi, Zhiqiang Zuo

Abstract: This paper proposes a unified detection strategy against three kinds of attacks for multi-agent systems (MASs) which is applicable to both transient and steady stages. For attacks on the communication layer, a watermarking-based detection scheme with KullbackLeibler (KL) divergence is designed. Different from traditional communication schemes, each agent transmits a message set containing two stat… ▽ More This paper proposes a unified detection strategy against three kinds of attacks for multi-agent systems (MASs) which is applicable to both transient and steady stages. For attacks on the communication layer, a watermarking-based detection scheme with KullbackLeibler (KL) divergence is designed. Different from traditional communication schemes, each agent transmits a message set containing two state values with different types of watermarking. It is found that the detection performance is determined by the relevant parameters of the watermarking signal. Unlike the existing detection manoeuvres, such a scheme is capable of transient and steady stages. For attacks on the agent layer, a convergence rate related detection approach is put forward. It is shown that the resilience of the considered system is characterized by the coefficient and offset of the envelope. For hybrid attacks, based on the above detection mechanisms, a general framework resorting to trusted agents is presented, which requires weaker graph conditions and less information transmission. Finally, an example associated with the platooning of connected vehicles is given to support the theoretical results. △ Less

Submitted 6 January, 2025; originally announced January 2025.

arXiv:2411.17552 [pdf, other]

Ensuring Safety in Target Pursuit Control: A CBF-Safe Reinforcement Learning Approach

Authors: Yaosheng Deng, Junjie Gao, Jiaping Xiao, Mir Feroskhan

Abstract: This paper addresses the target-pursuit problem, aiming to ensure each pursuer's safety regarding collision avoidance, sensing range, and input saturation. An input-constrained CBF is proposed to dynamically regulate the pursuer's control, ensuring effective target pursuit even when the target performs evasive maneuvers. To further ensure safety, two sets of CBF constraints are designed to regulat… ▽ More This paper addresses the target-pursuit problem, aiming to ensure each pursuer's safety regarding collision avoidance, sensing range, and input saturation. An input-constrained CBF is proposed to dynamically regulate the pursuer's control, ensuring effective target pursuit even when the target performs evasive maneuvers. To further ensure safety, two sets of CBF constraints are designed to regulate the pursuer's position, enabling it to keep the target within the sensing range while avoiding collision in complex environments with external disturbances. These three CBFs collectively form our safety filter, which filters unsafe outputs from RL by solving a Quadratic Program (QP). Finally, the safety filter, combined with a switch strategy that enhances the feasibility of solving its QP, constitutes the Control Barrier Function (CBF)-Safe Reinforcement Learning (CSRL) algorithm, whose solutions are proven to satisfy the Karush-Kuhn-Tucker (KKT) conditions for all safety constraints. Simulation results validate the effectiveness of the CSRL algorithm, demonstrating its ability to handle complex pursuit scenarios while maintaining safety and improving control performance. △ Less

Submitted 10 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

Comments: 12 pages

arXiv:2411.14385 [pdf, other]

Enhancing Diagnostic Precision in Gastric Bleeding through Automated Lesion Segmentation: A Deep DuS-KFCM Approach

Authors: Xian-Xian Liu, Mingkun Xu, Yuanyuan Wei, Huafeng Qin, Qun Song, Simon Fong, Feng Tien, Wei Luo, Juntao Gao, Zhihua Zhang, Shirley Siu

Abstract: Timely and precise classification and segmentation of gastric bleeding in endoscopic imagery are pivotal for the rapid diagnosis and intervention of gastric complications, which is critical in life-saving medical procedures. Traditional methods grapple with the challenge posed by the indistinguishable intensity values of bleeding tissues adjacent to other gastric structures. Our study seeks to rev… ▽ More Timely and precise classification and segmentation of gastric bleeding in endoscopic imagery are pivotal for the rapid diagnosis and intervention of gastric complications, which is critical in life-saving medical procedures. Traditional methods grapple with the challenge posed by the indistinguishable intensity values of bleeding tissues adjacent to other gastric structures. Our study seeks to revolutionize this domain by introducing a novel deep learning model, the Dual Spatial Kernelized Constrained Fuzzy C-Means (Deep DuS-KFCM) clustering algorithm. This Hybrid Neuro-Fuzzy system synergizes Neural Networks with Fuzzy Logic to offer a highly precise and efficient identification of bleeding regions. Implementing a two-fold coarse-to-fine strategy for segmentation, this model initially employs the Spatial Kernelized Fuzzy C-Means (SKFCM) algorithm enhanced with spatial intensity profiles and subsequently harnesses the state-of-the-art DeepLabv3+ with ResNet50 architecture to refine the segmentation output. Through extensive experiments across mainstream gastric bleeding and red spots datasets, our Deep DuS-KFCM model demonstrated unprecedented accuracy rates of 87.95%, coupled with a specificity of 96.33%, outperforming contemporary segmentation methods. The findings underscore the model's robustness against noise and its outstanding segmentation capabilities, particularly for identifying subtle bleeding symptoms, thereby presenting a significant leap forward in medical image processing. △ Less

Submitted 25 November, 2024; v1 submitted 21 November, 2024; originally announced November 2024.

arXiv:2411.14250 [pdf, other]

CP-UNet: Contour-based Probabilistic Model for Medical Ultrasound Images Segmentation

Authors: Ruiguo Yu, Yiyang Zhang, Yuan Tian, Zhiqiang Liu, Xuewei Li, Jie Gao

Abstract: Deep learning-based segmentation methods are widely utilized for detecting lesions in ultrasound images. Throughout the imaging procedure, the attenuation and scattering of ultrasound waves cause contour blurring and the formation of artifacts, limiting the clarity of the acquired ultrasound images. To overcome this challenge, we propose a contour-based probabilistic segmentation model CP-UNet, wh… ▽ More Deep learning-based segmentation methods are widely utilized for detecting lesions in ultrasound images. Throughout the imaging procedure, the attenuation and scattering of ultrasound waves cause contour blurring and the formation of artifacts, limiting the clarity of the acquired ultrasound images. To overcome this challenge, we propose a contour-based probabilistic segmentation model CP-UNet, which guides the segmentation network to enhance its focus on contour during decoding. We design a novel down-sampling module to enable the contour probability distribution modeling and encoding stages to acquire global-local features. Furthermore, the Gaussian Mixture Model utilizes optimized features to model the contour distribution, capturing the uncertainty of lesion boundaries. Extensive experiments with several state-of-the-art deep learning segmentation methods on three ultrasound image datasets show that our method performs better on breast and thyroid lesions segmentation. △ Less

Submitted 21 November, 2024; originally announced November 2024.

Comments: 4 pages, 4 figures, 2 tables;For icassp2025

arXiv:2411.13314

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Authors: Jiawei Zhang, Tian-Hao Zhang, Jun Wang, Jiaran Gao, Xinyuan Qian, Xu-Cheng Yin

Abstract: Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthe… ▽ More Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthesized speech, which may provide immersive experience in gaming and virtual reality. To solve this issue, in this paper, we present a novel multi-modal TTS approach, namely Image-indicated Immersive Text-to-speech Synthesis (I2TTS). Specifically, we introduce a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline to control the speech generation process. Additionally, we propose a reverberation classification and refinement technique that adjusts the synthesized mel-spectrogram to enhance the immersive experience, ensuring that the involved reverberation condition matches the scene accurately. Experimental results demonstrate that our model achieves high-quality scene and spatial matching without compromising speech naturalness, marking a significant advancement in the field of context-aware speech synthesis. Project demo page: https://spatialTTS.github.io/ Index Terms-Speech synthesis, scene prompt, spatial perception △ Less

Submitted 2 December, 2024; v1 submitted 20 November, 2024; originally announced November 2024.

Comments: The paper is missing some information

arXiv:2411.08896 [pdf, other]

Demand-Aware Beam Hopping and Power Allocation for Load Balancing in Digital Twin empowered LEO Satellite Networks

Authors: Ruili Zhao, Jun Cai, Jiangtao Luo, Junpeng Gao, Yongyi Ran

Abstract: Low-Earth orbit (LEO) satellites utilizing beam hopping (BH) technology offer extensive coverage, low latency, high bandwidth, and significant flexibility. However, the uneven geographical distribution and temporal variability of ground traffic demands, combined with the high mobility of LEO satellites, present significant challenges for efficient beam resource utilization. Traditional BH methods… ▽ More Low-Earth orbit (LEO) satellites utilizing beam hopping (BH) technology offer extensive coverage, low latency, high bandwidth, and significant flexibility. However, the uneven geographical distribution and temporal variability of ground traffic demands, combined with the high mobility of LEO satellites, present significant challenges for efficient beam resource utilization. Traditional BH methods based on GEO satellites fail to address issues such as satellite interference, overlapping coverage, and mobility. This paper explores a Digital Twin (DT)-based collaborative resource allocation network for multiple LEO satellites with overlapping coverage areas. A two-tier optimization problem, focusing on load balancing and cell service fairness, is proposed to maximize throughput and minimize inter-cell service delay. The DT layer optimizes the allocation of overlapping coverage cells by designing BH patterns for each satellite, while the LEO layer optimizes power allocation for each selected service cell. At the DT layer, an Actor-Critic network is deployed on each agent, with a global critic network in the cloud center. The A3C algorithm is employed to optimize the DT layer. Concurrently, the LEO layer optimization is performed using a Multi-Agent Reinforcement Learning algorithm, where each beam functions as an independent agent. The simulation results show that this method reduces satellite load disparity by about 72.5% and decreases the average delay to 12ms. Additionally, our approach outperforms other benchmarks in terms of throughput, ensuring a better alignment between offered and requested data. △ Less

Submitted 28 October, 2024; originally announced November 2024.

arXiv:2411.07751 [pdf, other]

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Authors: Xinyuan Qian, Jiaran Gao, Yaodan Zhang, Qiquan Zhang, Hexin Liu, Leibny Paola Garcia, Haizhou Li

Abstract: Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas co… ▽ More Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S$^2$E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S$^2$E over other competitive methods. We will make the source code publicly available. Project demo page: https://AVSEPage.github.io/ △ Less

Submitted 2 April, 2025; v1 submitted 12 November, 2024; originally announced November 2024.

Comments: accepted by IEEE Journal of Selected Topics in Signal Processing

arXiv:2410.11932 [pdf, other]

doi 10.1109/TIA.2024.3522486

Physical Informed-Inspired Deep Reinforcement Learning Based Bi-Level Programming for Microgrid Scheduling

Authors: Yang Li, Jiankai Gao, Yuanzheng Li, Chen Chen, Sen Li, Mohammad Shahidehpour, Zhe Chen

Abstract: To coordinate the interests of operator and users in a microgrid under complex and changeable operating conditions, this paper proposes a microgrid scheduling model considering the thermal flexibility of thermostatically controlled loads and demand response by leveraging physical informed-inspired deep reinforcement learning (DRL) based bi-level programming. To overcome the non-convex limitations… ▽ More To coordinate the interests of operator and users in a microgrid under complex and changeable operating conditions, this paper proposes a microgrid scheduling model considering the thermal flexibility of thermostatically controlled loads and demand response by leveraging physical informed-inspired deep reinforcement learning (DRL) based bi-level programming. To overcome the non-convex limitations of karush-kuhn-tucker (KKT)-based methods, a novel optimization solution method based on DRL theory is proposed to handle the bi-level programming through alternate iterations between levels. Specifically, by combining a DRL algorithm named asynchronous advantage actor-critic (A3C) and automated machine learning-prioritized experience replay (AutoML-PER) strategy to improve the generalization performance of A3C to address the above problems, an improved A3C algorithm, called AutoML-PER-A3C, is designed to solve the upper-level problem; while the DOCPLEX optimizer is adopted to address the lower-level problem. In this solution process, AutoML is used to automatically optimize hyperparameters and PER improves learning efficiency and quality by extracting the most valuable samples. The test results demonstrate that the presented approach manages to reconcile the interests between multiple stakeholders in MG by fully exploiting various flexibility resources. Furthermore, in terms of economic viability and computational efficiency, the proposal vastly exceeds other advanced reinforcement learning methods. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: Accepted by IEEE Transactions on Industry Applications (Paper Id: 2023-KDSEM-1058)

arXiv:2410.09444 [pdf]

Diabetic retinopathy image classification method based on GreenBen data augmentation

Authors: Yutong Liu, Jie Gao, Haijiang Zhu

Abstract: For the diagnosis of diabetes retinopathy (DR) images, this paper proposes a classification method based on artificial intelligence. The core lies in a new data augmentation method, GreenBen, which first extracts the green channel grayscale image from the retinal image and then performs Ben enhancement. Considering that diabetes macular edema (DME) is a complication closely related to DR, this pap… ▽ More For the diagnosis of diabetes retinopathy (DR) images, this paper proposes a classification method based on artificial intelligence. The core lies in a new data augmentation method, GreenBen, which first extracts the green channel grayscale image from the retinal image and then performs Ben enhancement. Considering that diabetes macular edema (DME) is a complication closely related to DR, this paper constructs a joint classification framework of DR and DME based on multi task learning and attention module, and uses GreenBen to enhance its data to reduce the difference of DR images and improve the accuracy of model classification. We conducted extensive experiments on three publicly available datasets, and our method achieved the best results. For GreenBen, whether based on the ResNet50 network or the Swin Transformer network, whether for individual classification or joint DME classification, compared with other data augmentation methods, GreenBen achieved stable and significant improvements in DR classification results, with an accuracy increase of 10%. △ Less

Submitted 12 October, 2024; originally announced October 2024.

arXiv:2410.02510 [pdf, other]

SwarmCVT: Centroidal Voronoi Tessellation-Based Path Planning for Very-Large-Scale Robotics

Authors: James Gao, Jacob Lee, Yuting Zhou, Yunze Hu, Chang Liu, Pingping Zhu

Abstract: Swarm robotics, or very large-scale robotics (VLSR), has many meaningful applications for complicated tasks. However, the complexity of motion control and energy costs stack up quickly as the number of robots increases. In addressing this problem, our previous studies have formulated various methods employing macroscopic and microscopic approaches. These methods enable microscopic robots to adhere… ▽ More Swarm robotics, or very large-scale robotics (VLSR), has many meaningful applications for complicated tasks. However, the complexity of motion control and energy costs stack up quickly as the number of robots increases. In addressing this problem, our previous studies have formulated various methods employing macroscopic and microscopic approaches. These methods enable microscopic robots to adhere to a reference Gaussian mixture model (GMM) distribution observed at the macroscopic scale. As a result, optimizing the macroscopic level will result in an optimal overall result. However, all these methods require systematic and global generation of Gaussian components (GCs) within obstacle-free areas to construct the GMM trajectories. This work utilizes centroidal Voronoi tessellation to generate GCs methodically. Consequently, it demonstrates performance improvement while also ensuring consistency and reliability. △ Less

Submitted 3 October, 2024; originally announced October 2024.

Comments: Submitted to American Control Conference (ACC) 2025

arXiv:2409.17603 [pdf, other]

Deep CLAS: Deep Contextual Listen, Attend and Spell

Authors: Mengzhi Wang, Shifu Xiong, Genshun Wan, Hang Chen, Jianqing Gao, Lirong Dai

Abstract: Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model… ▽ More Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene. △ Less

Submitted 19 December, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: Submitted to JUSTC

arXiv:2409.09289 [pdf, other]

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Authors: Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li

Abstract: Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality re… ▽ More Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications. △ Less

Submitted 13 September, 2024; originally announced September 2024.

arXiv:2409.09284 [pdf, other]

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

Authors: Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li

Abstract: With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modele… ▽ More With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. Experimental results show that M$^{3}$V significantly outperforms models trained using only single or multi-modality and surpasses human judgment performance on ASR error data for the first time. △ Less

Submitted 13 September, 2024; originally announced September 2024.

arXiv:2409.02041 [pdf, other]

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

Authors: Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several a… ▽ More This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively. △ Less

Submitted 24 October, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

arXiv:2408.09715 [pdf, other]

HYDEN: Hyperbolic Density Representations for Medical Images and Reports

Authors: Zhi Qiao, Linbin Han, Xiantong Zhen, Jia-Hong Gao, Zhen Qian

Abstract: In light of the inherent entailment relations between images and text, hyperbolic point vector embeddings, leveraging the hierarchical modeling advantages of hyperbolic space, have been utilized for visual semantic representation learning. However, point vector embedding approaches fail to address the issue of semantic uncertainty, where an image may have multiple interpretations, and text may ref… ▽ More In light of the inherent entailment relations between images and text, hyperbolic point vector embeddings, leveraging the hierarchical modeling advantages of hyperbolic space, have been utilized for visual semantic representation learning. However, point vector embedding approaches fail to address the issue of semantic uncertainty, where an image may have multiple interpretations, and text may refer to different images, a phenomenon particularly prevalent in the medical domain. Therefor, we propose \textbf{HYDEN}, a novel hyperbolic density embedding based image-text representation learning approach tailored for specific medical domain data. This method integrates text-aware local features alongside global features from images, mapping image-text features to density features in hyperbolic space via using hyperbolic pseudo-Gaussian distributions. An encapsulation loss function is employed to model the partial order relations between image-text density distributions. Experimental results demonstrate the interpretability of our approach and its superior performance compared to the baseline methods across various zero-shot tasks and different datasets. △ Less

Submitted 19 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

arXiv:2407.13227 [pdf, other]

Solving the Model Unavailable MARE using Q-Learning Algorithm

Authors: Fei Yan, Jie Gao, Tao Feng, Jianxing Liu

Abstract: In this paper, the discrete-time modified algebraic Riccati equation (MARE) is solved when the system model is completely unavailable. To achieve this, firstly a brand new iterative method based on the standard discrete-time algebraic Riccati equation (DARE) and its input weighting matrix is proposed to solve the MARE. For the single-input case, the iteration can be initialized by an arbitrary pos… ▽ More In this paper, the discrete-time modified algebraic Riccati equation (MARE) is solved when the system model is completely unavailable. To achieve this, firstly a brand new iterative method based on the standard discrete-time algebraic Riccati equation (DARE) and its input weighting matrix is proposed to solve the MARE. For the single-input case, the iteration can be initialized by an arbitrary positive input weighting if and only if the MARE has a stabilizing solution; nevertheless a pre-given input weighting matrix of a sufficiently large magnitude is used to perform the iteration for the multi-input case when the characteristic parameter belongs to a specified subset. Benefit from the developed specific iteration structure, the Q-learning (QL) algorithm can be employed to subtly solve the MARE where only the system input/output data is used thus the system model is not required. Finally, a numerical simulation example is given to verify the effectiveness of the theoretical results and the algorithm. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.00218 [pdf, other]

Resilient Estimator-based Control Barrier Functions for Dynamical Systems with Disturbances and Noise

Authors: Chuyuan Tao, Wenbin Wan, Junjie Gao, Bihao Mo, Hunmin Kim, Naira Hovakimyan

Abstract: Control Barrier Function (CBF) is an emerging method that guarantees safety in path planning problems by generating a control command to ensure the forward invariance of a safety set. Most of the developments up to date assume availability of correct state measurements and absence of disturbances on the system. However, if the system incurs disturbances and is subject to noise, the CBF cannot guar… ▽ More Control Barrier Function (CBF) is an emerging method that guarantees safety in path planning problems by generating a control command to ensure the forward invariance of a safety set. Most of the developments up to date assume availability of correct state measurements and absence of disturbances on the system. However, if the system incurs disturbances and is subject to noise, the CBF cannot guarantee safety due to the distorted state estimate. To improve the resilience and adaptability of the CBF, we propose a resilient estimator-based control barrier function (RE-CBF), which is based on a novel stochastic CBF optimization and resilient estimator, to guarantee the safety of systems with disturbances and noise in the path planning problems. The proposed algorithm uses the resilient estimation algorithm to estimate disturbances and counteract their effect using novel stochastic CBF optimization, providing safe control inputs for dynamical systems with disturbances and noise. To demonstrate the effectiveness of our algorithm in handling both noise and disturbances in dynamics and measurement, we design a quadrotor testing pipeline to simulate the proposed algorithm and then implement the algorithm on a real drone in our flying arena. Both simulations and real-world experiments show that the proposed method can guarantee safety for systems with disturbances and noise. △ Less

Submitted 28 June, 2024; originally announced July 2024.

arXiv:2406.16200 [pdf, other]

Towards unlocking the mystery of adversarial fragility of neural networks

Authors: Jingchao Gao, Raghu Mudumbai, Xiaodong Wu, Jirong Yi, Catherine Xu, Hui Xie, Weiyu Xu

Abstract: In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural ne… ▽ More In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural network's adversarial robustness can degrade as the input dimension $d$ increases. Analytically we show that neural networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness. Our matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: 21 pages

arXiv:2406.14067 [pdf]

A microwave photonic prototype for concurrent radar detection and spectrum sensing over an 8 to 40 GHz bandwidth

Authors: Taixia Shi, Dingding Liang, Lu Wang, Lin Li, Shaogang Guo, Jiawei Gao, Xiaowei Li, Chulun Lin, Lei Shi, Baogang Ding, Shiyang Liu, Fangyi Yang, Chi Jiang, Yang Chen

Abstract: In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz.… ▽ More In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz. The IF LFM signal is converted to the optical domain via an intensity modulator and then filtered by a fiber Bragg grating (FBG) to generate only two 2nd-order optical LFM sidebands. In radar detection, the two optical LFM sidebands beat with each other to generate a frequency-and-bandwidth-quadrupled LFM signal, which is used for ranging, radial velocity measurement, and imaging. By changing the center frequency of the IF LFM signal, the radar function can be operated within 8 to 40 GHz. In spectrum sensing, one 2nd-order optical LFM sideband is selected by another FBG, which then works in conjunction with the stimulated Brillouin scattering gain spectrum to map the frequency of the signal under test to time with an instantaneous measurement bandwidth of 2 GHz. By using a frequency shift module to adjust the pump frequency, the frequency measurement range can be adjusted from 0 to 40 GHz. The prototype is comprehensively studied and tested, which is capable of achieving a range resolution of 3.75 cm, a range error of less than $\pm$ 2 cm, a radial velocity error within $\pm$ 1 cm/s, delivering clear imaging of multiple small targets, and maintaining a frequency measurement error of less than $\pm$ 7 MHz and a frequency resolution of better than 20 MHz. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 18 pages, 12 figures, 1 table

arXiv:2406.13038 [pdf, other]

Traffic Prediction considering Multiple Levels of Spatial-temporal Information: A Multi-scale Graph Wavelet-based Approach

Authors: Zilin Bian, Jingqin Gao, Kaan Ozbay, Zhenning Li

Abstract: Although traffic prediction has been receiving considerable attention with a number of successes in the context of intelligent transportation systems, the prediction of traffic states over a complex transportation network that contains different road types has remained a challenge. This study proposes a multi-scale graph wavelet temporal convolution network (MSGWTCN) to predict the traffic states… ▽ More Although traffic prediction has been receiving considerable attention with a number of successes in the context of intelligent transportation systems, the prediction of traffic states over a complex transportation network that contains different road types has remained a challenge. This study proposes a multi-scale graph wavelet temporal convolution network (MSGWTCN) to predict the traffic states in complex transportation networks. Specifically, a multi-scale spatial block is designed to simultaneously capture the spatial information at different levels, and the gated temporal convolution network is employed to extract the temporal dependencies of the data. The model jointly learns to mount multiple levels of the spatial interactions by stacking graph wavelets with different scales. Two real-world datasets are used in this study to investigate the model performance, including a highway network in Seattle and a dense road network of Manhattan in New York City. Experiment results show that the proposed model outperforms other baseline models. Furthermore, different scales of graph wavelets are found to be effective in extracting local, intermediate and global information at the same time and thus enable the model to learn a complex transportation network topology with various types of road segments. By carefully customizing the scales of wavelets, the model is able to improve the prediction performance and better adapt to different network configurations. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2405.19213 [pdf, other]

EdgeSight: Enabling Modeless and Cost-Efficient Inference at the Edge

Authors: ChonLam Lao, Jiaqi Gao, Ganesh Ananthanarayanan, Aditya Akella, Minlan Yu

Abstract: Traditional ML inference is evolving toward modeless inference, which abstracts the complexity of model selection from users, allowing the system to automatically choose the most appropriate model for each request based on accuracy and resource requirements. While prior studies have focused on modeless inference within data centers, this paper tackles the pressing need for cost-efficient modeless… ▽ More Traditional ML inference is evolving toward modeless inference, which abstracts the complexity of model selection from users, allowing the system to automatically choose the most appropriate model for each request based on accuracy and resource requirements. While prior studies have focused on modeless inference within data centers, this paper tackles the pressing need for cost-efficient modeless inference at the edge -- particularly within its unique constraints of limited device memory, volatile network conditions, and restricted power consumption. To overcome these challenges, we propose EdgeSight, a system that provides cost-efficient EdgeSight serving for diverse DNNs at the edge. EdgeSight employs an edge-data center (edge-DC) architecture, utilizing confidence scaling to reduce the number of model options while meeting diverse accuracy requirements. Additionally, it supports lossy inference in volatile network environments. Our experimental results show that EdgeSight outperforms existing systems by up to 1.6x in P99 latency for modeless services. Furthermore, our FPGA prototype demonstrates similar performance at certain accuracy levels, with a power consumption reduction of up to 3.34x. △ Less

Submitted 14 January, 2025; v1 submitted 29 May, 2024; originally announced May 2024.

Comments: 12 pages

arXiv:2405.15863 [pdf, ps, other]

Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Authors: Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, Yuan Jiang, Jianqing Gao, Feng Ma

Abstract: Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the a… ▽ More Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https://qa-mdt.github.io/, code and pretrained checkpoints are open-sourced at https://github.com/ivcylc/OpenMusic. △ Less

Submitted 17 June, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: IJCAI

arXiv:2405.15271 [pdf]

Seamless Integration and Implementation of Distributed Contact and Contactless Vital Sign Monitoring

Authors: Dingding Liang, Yang Chen, Jiawei Gao, Taixia Shi, Jianping Yao

Abstract: Real-time vital sign monitoring is gaining immense significance not only in the medical field but also in personal health management. Facing the needs of different application scenarios of the smart and healthy city in the future, the low-cost, large-scale, scalable, and distributed vital sign monitoring system is of great significance. In this work, a seamlessly integrated contact and contactless… ▽ More Real-time vital sign monitoring is gaining immense significance not only in the medical field but also in personal health management. Facing the needs of different application scenarios of the smart and healthy city in the future, the low-cost, large-scale, scalable, and distributed vital sign monitoring system is of great significance. In this work, a seamlessly integrated contact and contactless vital sign monitoring system, which can simultaneously implement respiration and heartbeat monitoring, is proposed. In contact vital sign monitoring, the chest wall movement due to respiration and heartbeat is translated into changes in the optical output intensity of a fiber Bragg grating (FBG). The FBG is also an important part of radar signal generation for contactless vital sign monitoring, in which the chest wall movement is translated into phase changes of the radar de-chirped signal. By analyzing the intensity of the FBG output and phase of the radar de-chirped signal, real-time respiration and heartbeat monitoring are realized. In addition, due to the distributed structure of the system and its good integration with the wavelength-division multiplexing optical network, it can be massively scaled by employing more wavelengths. A proof-of-concept experiment is carried out. Contact and contactless respiration and heartbeat monitoring of three people are simultaneously realized. During a monitoring time of 60 s, the maximum absolute measurement errors of respiration and heartbeat rates are 1.6 respirations per minute and 2.3 beats per minute, respectively. The measurement error does not have an obvious change even when the monitoring time is decreased to 5 s. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 14 pages,9 figures

arXiv:2405.11883 [pdf, other]

Asynchronous MIMO-OFDM Massive Unsourced Random Access with Codeword Collisions

Authors: Tianya Li, Yongpeng Wu, Junyuan Gao, Wenjun Zhang, Xiang-Gen Xia, Derrick Wing Kwan Ng, Chengshan Xiao

Abstract: This paper investigates asynchronous multiple-input multiple-output (MIMO) massive unsourced random access (URA) in an orthogonal frequency division multiplexing (OFDM) system over frequency-selective fading channels, with the presence of both timing and carrier frequency offsets (TO and CFO) and non-negligible codeword collisions. The proposed coding framework segregates the data into two compone… ▽ More This paper investigates asynchronous multiple-input multiple-output (MIMO) massive unsourced random access (URA) in an orthogonal frequency division multiplexing (OFDM) system over frequency-selective fading channels, with the presence of both timing and carrier frequency offsets (TO and CFO) and non-negligible codeword collisions. The proposed coding framework segregates the data into two components, namely, preamble and coding parts, with the former being tree-coded and the latter LDPC-coded. By leveraging the dual sparsity of the equivalent channel across both codeword and delay domains (CD and DD), we develop a message-passing-based sparse Bayesian learning algorithm, combined with belief propagation and mean field, to iteratively estimate DD channel responses, TO, and delay profiles. Furthermore, by jointly leveraging the observations among multiple slots, we establish a novel graph-based algorithm to iteratively separate the superimposed channels and compensate for the phase rotations. Additionally, the proposed algorithm is applied to the flat fading scenario to estimate both TO and CFO, where the channel and offset estimation is enhanced by leveraging the geometric characteristics of the signal constellation. Extensive simulations reveal that the proposed algorithm achieves superior performance and substantial complexity reduction in both channel and offset estimation compared to the codebook enlarging-based counterparts, and enhanced data recovery performances compared to state-of-the-art URA schemes. △ Less

Submitted 10 October, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

Comments: This paper has been accepted by the IEEE Transactions on Wireless Communications

arXiv:2405.05565 [pdf, other]

doi 10.1109/TGRS.2024.3406711

Array SAR 3D Sparse Imaging Based on Regularization by Denoising Under Few Observed Data

Authors: Yangyang Wang, Xu Zhan, Jing Gao, Jinjie Yao, Shunjun Wei, JianSheng Bai

Abstract: Array synthetic aperture radar (SAR) three-dimensional (3D) imaging can obtain 3D information of the target region, which is widely used in environmental monitoring and scattering information measurement. In recent years, with the development of compressed sensing (CS) theory, sparse signal processing is used in array SAR 3D imaging. Compared with matched filter (MF), sparse SAR imaging can effect… ▽ More Array synthetic aperture radar (SAR) three-dimensional (3D) imaging can obtain 3D information of the target region, which is widely used in environmental monitoring and scattering information measurement. In recent years, with the development of compressed sensing (CS) theory, sparse signal processing is used in array SAR 3D imaging. Compared with matched filter (MF), sparse SAR imaging can effectively improve image quality. However, sparse imaging based on handcrafted regularization functions suffers from target information loss in few observed SAR data. Therefore, in this article, a general 3D sparse imaging framework based on Regulation by Denoising (RED) and proximal gradient descent type method for array SAR is presented. Firstly, we construct explicit prior terms via state-of-the-art denoising operators instead of regularization functions, which can improve the accuracy of sparse reconstruction and preserve the structure information of the target. Then, different proximal gradient descent type methods are presented, including a generalized alternating projection (GAP) and an alternating direction method of multiplier (ADMM), which is suitable for high-dimensional data processing. Additionally, the proposed method has robust convergence, which can achieve sparse reconstruction of 3D SAR in few observed SAR data. Extensive simulations and real data experiments are conducted to analyze the performance of the proposed method. The experimental results show that the proposed method has superior sparse reconstruction performance. △ Less

Submitted 26 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.15279 [pdf, other]

Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification

Authors: Jimmy Lin, Junkai Li, Jiasi Gao, Weizhi Ma, Yang Liu

Abstract: Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances.… ▽ More Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics. △ Less

Submitted 20 January, 2024; originally announced April 2024.

Comments: Accepted by AAAI 2024

arXiv:2402.18871 [pdf, other]

LoLiSRFlow: Joint Single Image Low-light Enhancement and Super-resolution via Cross-scale Transformer-based Conditional Flow

Authors: Ziyu Yue, Jiaxin Gao, Sihan Xie, Yang Liu, Zhixun Su

Abstract: The visibility of real-world images is often limited by both low-light and low-resolution, however, these issues are only addressed in the literature through Low-Light Enhancement (LLE) and Super- Resolution (SR) methods. Admittedly, a simple cascade of these approaches cannot work harmoniously to cope well with the highly ill-posed problem for simultaneously enhancing visibility and resolution. I… ▽ More The visibility of real-world images is often limited by both low-light and low-resolution, however, these issues are only addressed in the literature through Low-Light Enhancement (LLE) and Super- Resolution (SR) methods. Admittedly, a simple cascade of these approaches cannot work harmoniously to cope well with the highly ill-posed problem for simultaneously enhancing visibility and resolution. In this paper, we propose a normalizing flow network, dubbed LoLiSRFLow, specifically designed to consider the degradation mechanism inherent in joint LLE and SR. To break the bonds of the one-to-many mapping for low-light low-resolution images to normal-light high-resolution images, LoLiSRFLow directly learns the conditional probability distribution over a variety of feasible solutions for high-resolution well-exposed images. Specifically, a multi-resolution parallel transformer acts as a conditional encoder that extracts the Retinex-induced resolution-and-illumination invariant map as the previous one. And the invertible network maps the distribution of usually exposed high-resolution images to a latent distribution. The backward inference is equivalent to introducing an additional constrained loss for the normal training route, thus enabling the manifold of the natural exposure of the high-resolution image to be immaculately depicted. We also propose a synthetic dataset modeling the realistic low-light low-resolution degradation, named DFSR-LLE, containing 7100 low-resolution dark-light/high-resolution normal sharp pairs. Quantitative and qualitative experimental results demonstrate the effectiveness of our method on both the proposed synthetic and real datasets. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.00320 [pdf]

DARCS: Memory-Efficient Deep Compressed Sensing Reconstruction for Acceleration of 3D Whole-Heart Coronary MR Angiography

Authors: Zhihao Xue, Fan Yang, Juan Gao, Zhuo Chen, Hao Peng, Chao Zou, Hang Jin, Chenxi Hu

Abstract: Three-dimensional coronary magnetic resonance angiography (CMRA) demands reconstruction algorithms that can significantly suppress the artifacts from a heavily undersampled acquisition. While unrolling-based deep reconstruction methods have achieved state-of-the-art performance on 2D image reconstruction, their application to 3D reconstruction is hindered by the large amount of memory needed to tr… ▽ More Three-dimensional coronary magnetic resonance angiography (CMRA) demands reconstruction algorithms that can significantly suppress the artifacts from a heavily undersampled acquisition. While unrolling-based deep reconstruction methods have achieved state-of-the-art performance on 2D image reconstruction, their application to 3D reconstruction is hindered by the large amount of memory needed to train an unrolled network. In this study, we propose a memory-efficient deep compressed sensing method by employing a sparsifying transform based on a pre-trained artifact estimation network. The motivation is that the artifact image estimated by a well-trained network is sparse when the input image is artifact-free, and less sparse when the input image is artifact-affected. Thus, the artifact-estimation network can be used as an inherent sparsifying transform. The proposed method, named De-Aliasing Regularization based Compressed Sensing (DARCS), was compared with a traditional compressed sensing method, de-aliasing generative adversarial network (DAGAN), model-based deep learning (MoDL), and plug-and-play for accelerations of 3D CMRA. The results demonstrate that the proposed method improved the reconstruction quality relative to the compared methods by a large margin. Furthermore, the proposed method well generalized for different undersampling rates and noise levels. The memory usage of the proposed method was only 63% of that needed by MoDL. In conclusion, the proposed method achieves improved reconstruction quality for 3D CMRA with reduced memory burden. △ Less

Submitted 2 February, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

Comments: 10 pages, 8 figures

arXiv:2401.02099 [pdf]

Oceanship: A Large-Scale Dataset for Underwater Audio Target Recognition

Authors: Zeyu Li, Suncheng Xiang, Tong Yu, Jingsheng Gao, Jiacheng Ruan, Yanping Hu, Ting Liu, Yuzhuo Fu

Abstract: The recognition of underwater audio plays a significant role in identifying a vessel while it is in motion. Underwater target recognition tasks have a wide range of applications in areas such as marine environmental protection, detection of ship radiated noise, underwater noise control, and coastal vessel dispatch. The traditional UATR task involves training a network to extract features from audi… ▽ More The recognition of underwater audio plays a significant role in identifying a vessel while it is in motion. Underwater target recognition tasks have a wide range of applications in areas such as marine environmental protection, detection of ship radiated noise, underwater noise control, and coastal vessel dispatch. The traditional UATR task involves training a network to extract features from audio data and predict the vessel type. The current UATR dataset exhibits shortcomings in both duration and sample quantity. In this paper, we propose Oceanship, a large-scale and diverse underwater audio dataset. This dataset comprises 15 categories, spans a total duration of 121 hours, and includes comprehensive annotation information such as coordinates, velocity, vessel types, and timestamps. We compiled the dataset by crawling and organizing original communication data from the Ocean Communication Network (ONC) database between 2021 and 2022. While audio retrieval tasks are well-established in general audio classification, they have not been explored in the context of underwater audio recognition. Leveraging the Oceanship dataset, we introduce a baseline model named Oceannet for underwater audio retrieval. This model achieves a recall at 1 (R@1) accuracy of 67.11% and a recall at 5 (R@5) accuracy of 99.13% on the Deepship dataset. △ Less

Submitted 10 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

Comments: Accepted by ICIC 2024

arXiv:2312.17030 [pdf, other]

Learning Multi-axis Representation in Frequency Domain for Medical Image Segmentation

Authors: Jiacheng Ruan, Jingsheng Gao, Mingye Xie, Suncheng Xiang

Abstract: Recently, Visual Transformer (ViT) has been extensively used in medical image segmentation (MIS) due to applying self-attention mechanism in the spatial domain to modeling global knowledge. However, many studies have focused on improving models in the spatial domain while neglecting the importance of frequency domain information. Therefore, we propose Multi-axis External Weights UNet (MEW-UNet) ba… ▽ More Recently, Visual Transformer (ViT) has been extensively used in medical image segmentation (MIS) due to applying self-attention mechanism in the spatial domain to modeling global knowledge. However, many studies have focused on improving models in the spatial domain while neglecting the importance of frequency domain information. Therefore, we propose Multi-axis External Weights UNet (MEW-UNet) based on the U-shape architecture by replacing self-attention in ViT with our Multi-axis External Weights block. Specifically, our block performs a Fourier transform on the three axes of the input features and assigns the external weight in the frequency domain, which is generated by our External Weights Generator. Then, an inverse Fourier transform is performed to change the features back to the spatial domain. We evaluate our model on four datasets, including Synapse, ACDC, ISIC17 and ISIC18 datasets, and our approach demonstrates competitive performance, owing to its effective utilization of frequency domain information. △ Less

Submitted 24 September, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: This paper has been accepted by Machine Learning Journal

arXiv:2312.13752 [pdf]

doi 10.1016/j.media.2024.103253

Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the AIIB23 challenge

Authors: Yang Nan, Xiaodan Xing, Shiyi Wang, Zeyu Tang, Federico N Felder, Sheng Zhang, Roberta Eufrasia Ledda, Xiaoliu Ding, Ruiqi Yu, Weiping Liu, Feng Shi, Tianyang Sun, Zehong Cao, Minghui Zhang, Yun Gu, Hanxiao Zhang, Jian Gao, Pingyu Wang, Wen Tang, Pengxin Yu, Han Kang, Junqiang Chen, Xing Lu, Boyu Zhang, Michail Mamalakis , et al. (16 additional authors not shown)

Abstract: Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intric… ▽ More Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intricate honeycombing patterns present in the lung tissues of fibrotic lung disease patients exacerbate the challenges, often leading to various prediction errors. To address this issue, the 'Airway-Informed Quantitative CT Imaging Biomarker for Fibrotic Lung Disease 2023' (AIIB23) competition was organized in conjunction with the official 2023 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). The airway structures were meticulously annotated by three experienced radiologists. Competitors were encouraged to develop automatic airway segmentation models with high robustness and generalization abilities, followed by exploring the most correlated QIB of mortality prediction. A training set of 120 high-resolution computerised tomography (HRCT) scans were publicly released with expert annotations and mortality status. The online validation set incorporated 52 HRCT scans from patients with fibrotic lung disease and the offline test set included 140 cases from fibrosis and COVID-19 patients. The results have shown that the capacity of extracting airway trees from patients with fibrotic lung disease could be enhanced by introducing voxel-wise weighted general union loss and continuity loss. In addition to the competitive image biomarkers for prognosis, a strong airway-derived biomarker (Hazard ratio>1.5, p<0.0001) was revealed for survival prognostication compared with existing clinical measurements, clinician assessment and AI-based biomarkers. △ Less

Submitted 16 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: 19 pages

arXiv:2312.11460 [pdf, other]

Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response

Authors: Junfeng Long, Zirui Wang, Quanyi Li, Jiawei Gao, Liu Cao, Jiangmiao Pang

Abstract: Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introdu… ▽ More Robust locomotion control depends on accurate state estimations. However, the sensors of most legged robots can only provide partial and noisy observations, making the estimation particularly challenging, especially for external states like terrain frictions and elevation maps. Inspired by the classical Internal Model Control principle, we consider these external states as disturbances and introduce Hybrid Internal Model (HIM) to estimate them according to the response of the robot. The response, which we refer to as the hybrid internal embedding, contains the robot's explicit velocity and implicit stability representation, corresponding to two primary goals for locomotion tasks: explicitly tracking velocity and implicitly maintaining stability. We use contrastive learning to optimize the embedding to be close to the robot's successor state, in which the response is naturally embedded. HIM has several appealing benefits: It only needs the robot's proprioceptions, i.e., those from joint encoders and IMU as observations. It innovatively maintains consistent observations between simulation reference and reality that avoids information loss in mimicking learning. It exploits batch-level information that is more robust to noises and keeps better sample efficiency. It only requires 1 hour of training on an RTX 4090 to enable a quadruped robot to traverse any terrain under any disturbances. A wealth of real-world experiments demonstrates its agility, even in high-difficulty tasks and cases never occurred during the training process, revealing remarkable open-world generalizability. △ Less

Submitted 1 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Use 1 hour to train a quadruped robot capable of traversing any terrain under any disturbances in the open world, Project Page: https://github.com/OpenRobotLab/HIMLoco

arXiv:2312.02298 [pdf, other]

MoE-AMC: Enhancing Automatic Modulation Classification Performance Using Mixture-of-Experts

Authors: Jiaxin Gao, Qinglong Cao, Yuntian Chen

Abstract: Automatic Modulation Classification (AMC) plays a vital role in time series analysis, such as signal classification and identification within wireless communications. Deep learning-based AMC models have demonstrated significant potential in this domain. However, current AMC models inadequately consider the disparities in handling signals under conditions of low and high Signal-to-Noise Ratio (SNR)… ▽ More Automatic Modulation Classification (AMC) plays a vital role in time series analysis, such as signal classification and identification within wireless communications. Deep learning-based AMC models have demonstrated significant potential in this domain. However, current AMC models inadequately consider the disparities in handling signals under conditions of low and high Signal-to-Noise Ratio (SNR), resulting in an unevenness in their performance. In this study, we propose MoE-AMC, a novel Mixture-of-Experts (MoE) based model specifically crafted to address AMC in a well-balanced manner across varying SNR conditions. Utilizing the MoE framework, MoE-AMC seamlessly combines the strengths of LSRM (a Transformer-based model) for handling low SNR signals and HSRM (a ResNet-based model) for high SNR signals. This integration empowers MoE-AMC to achieve leading performance in modulation classification, showcasing its efficacy in capturing distinctive signal features under diverse SNR scenarios. We conducted experiments using the RML2018.01a dataset, where MoE-AMC achieved an average classification accuracy of 71.76% across different SNR levels, surpassing the performance of previous SOTA models by nearly 10%. This study represents a pioneering application of MoE techniques in the realm of AMC, offering a promising avenue for elevating signal classification accuracy within wireless communication systems. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.12223 [pdf, other]

Digital Twin-Based User-Centric Edge Continual Learning in Integrated Sensing and Communication

Authors: Shisheng Hu, Jie Gao, Xinyu Huang, Mushu Li, Kaige Qu, Conghao Zhou, Xuemin, Shen

Abstract: In this paper, we propose a digital twin (DT)-based user-centric approach for processing sensing data in an integrated sensing and communication (ISAC) system with high accuracy and efficient resource utilization. The considered scenario involves an ISAC device with a lightweight deep neural network (DNN) and a mobile edge computing (MEC) server with a large DNN. After collecting sensing data, the… ▽ More In this paper, we propose a digital twin (DT)-based user-centric approach for processing sensing data in an integrated sensing and communication (ISAC) system with high accuracy and efficient resource utilization. The considered scenario involves an ISAC device with a lightweight deep neural network (DNN) and a mobile edge computing (MEC) server with a large DNN. After collecting sensing data, the ISAC device either processes the data locally or uploads them to the server for higher-accuracy data processing. To cope with data drifts, the server updates the lightweight DNN when necessary, referred to as continual learning. Our objective is to minimize the long-term average computation cost of the MEC server by optimizing two decisions, i.e., sensing data offloading and sensing data selection for the DNN update. A DT of the ISAC device is constructed to predict the impact of potential decisions on the long-term computation cost of the server, based on which the decisions are made with closed-form formulas. Experiments on executing DNN-based human motion recognition tasks are conducted to demonstrate the outstanding performance of the proposed DT-based approach in computation cost minimization. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: submitted to IEEE ICC 2024

Showing 1–50 of 142 results for author: Gao, J