-
Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging
Authors:
Lijie Huang,
Jingyi Yin,
Jingke Zhang,
U-Wai Lok,
Ryan M. DeRuiter,
Jieyang Jin,
Kate M. Knoll,
Kendra E. Petersen,
James D. Krier,
Xiang-yang Zhu,
Gina K. Hesley,
Kathryn A. Robinson,
Andrew J. Bentall,
Thomas D. Atwell,
Andrew D. Rule,
Lilach O. Lerman,
Shigao Chen,
Chengwu Huang
Abstract:
Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs…
▽ More
Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs from complementary angular subsets of beamformed radio-frequency (RF) blood flow data, across which vascular signals remain consistent while noise varies. HA2HA was trained using in-vivo contrast-free pig kidney data and validated across diverse datasets, including contrast-free and contrast-enhanced data from pig kidneys, as well as human liver and kidney. An improvement exceeding 15 dB in both contrast-to-noise ratio (CNR) and SNR was observed, indicating a substantial enhancement in image quality. In addition to power Doppler imaging, denoising directly in the RF domain is also beneficial for other downstream processing such as color Doppler imaging (CDI). CDI results of human liver derived from the HA2HA-denoised signals exhibited improved microvascular flow visualization, with a suppressed noisy background. HA2HA offers a label-free, generalizable, and clinically applicable solution for robust vascular imaging in both contrast-free and contrast-enhanced UMI.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
Authors:
Ke-Han Lu,
Zhehuai Chen,
Szu-Wei Fu,
Chao-Han Huck Yang,
Sung-Feng Huang,
Chih-Kai Yang,
Chee-En Yu,
Chun-Wei Chen,
Wei-Chih Chen,
Chien-yu Huang,
Yi-Cheng Lin,
Yu-Xiang Lin,
Chi-An Fu,
Chun-Yi Kuan,
Wenze Ren,
Xuanjun Chen,
Wei-Ping Huang,
En-Pei Hu,
Tzu-Quan Lin,
Yuan-Kuei Wu,
Kuan-Po Huang,
Hsiao-Ying Huang,
Huang-Cheng Chou,
Kai-Wei Chang,
Cheng-Han Chiang
, et al. (3 additional authors not shown)
Abstract:
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these…
▽ More
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Dynamical Multimodal Fusion with Mixture-of-Experts for Localizations
Authors:
Bohao Wang,
Zitao Shuai,
Fenghao Zhu,
Chongwen Huang,
Yongliang Shen,
Zhaoyang Zhang,
Qianqian Yang,
Sami Muhaidat,
Merouane Debbah
Abstract:
Multimodal fingerprinting is a crucial technique to sub-meter 6G integrated sensing and communications (ISAC) localization, but two hurdles block deployment: (i) the contribution each modality makes to the target position varies with the operating conditions such as carrier frequency, and (ii) spatial and fingerprint ambiguities markedly undermine localization accuracy, especially in non-line-of-s…
▽ More
Multimodal fingerprinting is a crucial technique to sub-meter 6G integrated sensing and communications (ISAC) localization, but two hurdles block deployment: (i) the contribution each modality makes to the target position varies with the operating conditions such as carrier frequency, and (ii) spatial and fingerprint ambiguities markedly undermine localization accuracy, especially in non-line-of-sight (NLOS) scenarios. To solve these problems, we introduce SCADF-MoE, a spatial-context aware dynamic fusion network built on a soft mixture-of-experts backbone. SCADF-MoE first clusters neighboring points into short trajectories to inject explicit spatial context. Then, it adaptively fuses channel state information, angle of arrival profile, distance, and gain through its learnable MoE router, so that the most reliable cues dominate at each carrier band. The fused representation is fed to a modality-task MoE that simultaneously regresses the coordinates of every vertex in the trajectory and its centroid, thereby exploiting inter-point correlations. Finally, an auxiliary maximum-mean-discrepancy loss enforces expert diversity and mitigates gradient interference, stabilizing multi-task training. On three real urban layouts and three carrier bands (2.6, 6, 28 GHz), the model delivers consistent sub-meter MSE and halves unseen-NLOS error versus the best prior work. To our knowledge, this is the first work that leverages large-scale multimodal MoE for frequency-robust ISAC localization.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
Point Cloud Environment-Based Channel Knowledge Map Construction
Authors:
Yancheng Wang,
Wei Guo,
Chuan Huang,
Guanying Chen,
Ye Zhang,
Shuguang Cui
Abstract:
Channel knowledge map (CKM) provides certain levels of channel state information (CSI) for an area of interest, serving as a critical enabler for environment-aware communications by reducing the overhead of frequent CSI acquisition. However, existing CKM construction schemes adopt over-simplified environment information, which significantly compromises their accuracy. To address this issue, this w…
▽ More
Channel knowledge map (CKM) provides certain levels of channel state information (CSI) for an area of interest, serving as a critical enabler for environment-aware communications by reducing the overhead of frequent CSI acquisition. However, existing CKM construction schemes adopt over-simplified environment information, which significantly compromises their accuracy. To address this issue, this work proposes a joint model- and data-driven approach to construct CKM by leveraging point cloud environmental data along with a few samples of location-tagged channel information. First, we propose a novel point selector to identify subsets of point cloud that contain environmental information relevant to multipath channel gains, by constructing a set of co-focal ellipsoids based on different time of arrival (ToAs). Then, we trained a neural channel gain estimator to learn the mapping between each selected subset and its corresponding channel gain, using a real-world dataset we collected through field measurements, comprising environmental point clouds and corresponding channel data. Finally, experimental results demonstrate that: For CKM construction of power delay profile (PDP), the proposed method achieves a root mean squared error (RMSE) of 2.95 dB, significantly lower than the 7.32 dB achieved by the conventional ray-tracing method; for CKM construction of received power values, i.e., radio map, it achieves an RMSE of 1.04 dB, surpassing the Kriging interpolation method with an RMSE of 1.68 dB.
△ Less
Submitted 26 June, 2025; v1 submitted 26 June, 2025;
originally announced June 2025.
-
Widely Linear Augmented Extreme Learning Machine Based Impairments Compensation for Satellite Communications
Authors:
Yang Luo,
Arunprakash Jayaprakash,
Gaojie Chen,
Chong Huang,
Qu Luo,
Pei Xiao
Abstract:
Satellite communications are crucial for the evolution beyond fifth-generation networks. However, the dynamic nature of satellite channels and their inherent impairments present significant challenges. In this paper, a novel post-compensation scheme that combines the complex-valued extreme learning machine with augmented hidden layer (CELMAH) architecture and widely linear processing (WLP) is deve…
▽ More
Satellite communications are crucial for the evolution beyond fifth-generation networks. However, the dynamic nature of satellite channels and their inherent impairments present significant challenges. In this paper, a novel post-compensation scheme that combines the complex-valued extreme learning machine with augmented hidden layer (CELMAH) architecture and widely linear processing (WLP) is developed to address these issues by exploiting signal impropriety in satellite communications. Although CELMAH shares structural similarities with WLP, it employs a different core algorithm and does not fully exploit the signal impropriety. By incorporating WLP principles, we derive a tailored formulation suited to the network structure and propose the CELM augmented by widely linear least squares (CELM-WLLS) for post-distortion. The proposed approach offers enhanced communication robustness and is highly effective for satellite communication scenarios characterized by dynamic channel conditions and non-linear impairments. CELM-WLLS is designed to improve signal recovery performance and outperform traditional methods such as least square (LS) and minimum mean square error (MMSE). Compared to CELMAH, CELM-WLLS demonstrates approximately 0.8 dB gain in BER performance, and also achieves a two-thirds reduction in computational complexity, making it a more efficient solution.
△ Less
Submitted 19 June, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
Relaxation-Free Min-k-Partition for PCI Assignment in 5G Networks
Authors:
Yeqing Qiu,
Chengpiao Huang,
Ye Xue,
Zhipeng Jiang,
Qingjiang Shi,
Dong Zhang,
Zhi-Quan Luo
Abstract:
Physical Cell Identity (PCI) is a critical parameter in 5G networks. Efficient and accurate PCI assignment is essential for mitigating mod-3 interference, mod-30 interference, collisions, and confusions among cells, which directly affect network reliability and user experience. In this paper, we propose a novel framework for PCI assignment by decomposing the problem into Min-3-Partition, Min-10-Pa…
▽ More
Physical Cell Identity (PCI) is a critical parameter in 5G networks. Efficient and accurate PCI assignment is essential for mitigating mod-3 interference, mod-30 interference, collisions, and confusions among cells, which directly affect network reliability and user experience. In this paper, we propose a novel framework for PCI assignment by decomposing the problem into Min-3-Partition, Min-10-Partition, and a graph coloring problem, leveraging the Chinese Remainder Theorem (CRT). Furthermore, we develop a relaxation-free approach to the general Min-k-Partition problem by reformulating it as a quadratic program with a norm-equality constraint and solving it using a penalized mirror descent (PMD) algorithm. The proposed method demonstrates superior computational efficiency and scalability, significantly reducing interference while eliminating collisions and confusions in large-scale 5G networks. Numerical evaluations on real-world datasets show that our approach reduces computational time by up to 20 times compared to state-of-the-art methods, making it highly practical for real-time PCI optimization in large-scale networks. These results highlight the potential of our method to improve network performance and reduce deployment costs in modern 5G systems.
△ Less
Submitted 13 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Joint Routing and Control Optimization in VANET
Authors:
Chen Huang,
Dingxuan Wang,
Ronghui Hou
Abstract:
In this paper, we introduce DynaRoute, an adaptive joint optimization framework for dynamic vehicular networks that simultaneously addresses platoon control and data transmission through trajectory-aware routing and safety-constrained vehicle coordination. DynaRoute guarantees continuous vehicle movement via platoon safety control with optimizing transmission paths through real-time trajectory pre…
▽ More
In this paper, we introduce DynaRoute, an adaptive joint optimization framework for dynamic vehicular networks that simultaneously addresses platoon control and data transmission through trajectory-aware routing and safety-constrained vehicle coordination. DynaRoute guarantees continuous vehicle movement via platoon safety control with optimizing transmission paths through real-time trajectory prediction and ensuring reliable data. Our solution achieves three key objectives: (1) maintaining platoon stability through accurate data transmission, (2) enabling adaptive routing based on vehicle movement patterns, and (3) enhancing overall intelligent transportation system performance. DynaRoute equires predefined traffic models and adapts to dynamic network conditions using local vehicle state information. We present comprehensive simulation results demonstrating that DynaRoute maintains control and transmission performance in multiple complex scenarios while significantly improving throughput and reliability compared to traditional approaches.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Multimodal Spatial Language Maps for Robot Navigation and Manipulation
Authors:
Chenguang Huang,
Oier Mees,
Andy Zeng,
Wolfram Burgard
Abstract:
Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language ma…
▽ More
Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals (e.g., "in between the sofa and TV") directly localized in the map, and (ii) be shared across different robot embodiments to generate tailored obstacle maps on demand. Building upon the capabilities above, AVLMaps extend VLMaps by introducing a unified 3D spatial representation integrating audio, visual, and language cues through the fusion of features from pretrained multimodal foundation models. This enables robots to ground multimodal goal queries (e.g., text, images, or audio snippets) to spatial locations for navigation. Additionally, the incorporation of diverse sensory inputs significantly enhances goal disambiguation in ambiguous environments. Experiments in simulation and real-world settings demonstrate that our multimodal spatial language maps enable zero-shot spatial and multimodal goal navigation and improve recall by 50% in ambiguous scenarios. These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Integrated Sensing, Computing and Semantic Communication for Vehicular Networks
Authors:
Yinchao Yang,
Zhaohui Yang,
Chongwen Huang,
Wei Xu,
Zhaoyang Zhang,
Dusit Niyato,
Mohammad Shikh-Bahaei
Abstract:
This paper introduces a novel framework for integrated sensing, computing, and semantic communication (ISCSC) within vehicular networks comprising a roadside unit (RSU) and multiple autonomous vehicles. Both the RSU and the vehicles are equipped with local knowledge bases to facilitate semantic communication. The framework incorporates a secure communication design to ensure that messages intended…
▽ More
This paper introduces a novel framework for integrated sensing, computing, and semantic communication (ISCSC) within vehicular networks comprising a roadside unit (RSU) and multiple autonomous vehicles. Both the RSU and the vehicles are equipped with local knowledge bases to facilitate semantic communication. The framework incorporates a secure communication design to ensure that messages intended for specific vehicles are protected against interception. In this model, an extended Kalman filter (EKF) is employed by the RSU to accurately track all vehicles. We formulate a joint optimization problem that balances maximizing the probabilistically constrained semantic secrecy rate for each vehicle while minimizing the sum of the posterior Cramér-Rao bound (PCRB), subject to the RSU's computing capabilities. This non-convex optimization problem is addressed using Bernstein-type inequality (BTI) and alternating optimization (AO) techniques. Simulation results validate the effectiveness of the proposed framework, demonstrating its advantages in reliable sensing, high data throughput, and secure communication.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
SpeechVerifier: Robust Acoustic Fingerprint against Tampering Attacks via Watermarking
Authors:
Lingfeng Yao,
Chenpei Huang,
Shengyao Wang,
Junpei Xue,
Hanqing Guo,
Jiang Liu,
Xun Chen,
Miao Pan
Abstract:
With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle…
▽ More
With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle these challenges, we introduce SpeechVerifer to proactively verify speech integrity using only the published speech itself, i.e., without requiring any external references. Inspired by audio fingerprinting and watermarking, SpeechVerifier can (i) effectively detect tampering attacks, (ii) be robust to benign operations and (iii) verify the integrity only based on published speeches. Briefly, SpeechVerifier utilizes multiscale feature extraction to capture speech features across different temporal resolutions. Then, it employs contrastive learning to generate fingerprints that can detect modifications at varying granularities. These fingerprints are designed to be robust to benign operations, but exhibit significant changes when malicious tampering occurs. To enable speech verification in a self-contained manner, the generated fingerprints are then embedded into the speech signal by segment-wise watermarking. Without external references, SpeechVerifier can retrieve the fingerprint from the published audio and check it with the embedded watermark to verify the integrity of the speech. Extensive experimental results demonstrate that the proposed SpeechVerifier is effective in detecting tampering attacks and robust to benign operations.
△ Less
Submitted 1 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
ZeroSep: Separate Anything in Audio with Zero Training
Authors:
Chao Huang,
Yuesheng Ma,
Junxuan Huang,
Susan Liang,
Yunlong Tang,
Jing Bi,
Wenqiang Liu,
Nima Mesgarani,
Chenliang Xu
Abstract:
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of ge…
▽ More
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
OpenNIRScap: An Open-Source, Low-Cost Wearable Near-Infrared Spectroscopy-based Brain Interfacing Cap
Authors:
Tony Kim,
Haotian Liu,
Chiung-Ting Huang,
Ingrid Wu,
Xilin Liu
Abstract:
Functional Near-Infrared Spectroscopy (fNIRS) is a non-invasive, real-time method for monitoring brain activity by measuring hemodynamic responses in the cerebral cortex. However, existing systems are expensive, bulky, and limited to clinical or research environments. This paper introduces OpenNIRScap, an open-source, low-cost, and wearable fNIRS system designed to make real-time brain monitoring…
▽ More
Functional Near-Infrared Spectroscopy (fNIRS) is a non-invasive, real-time method for monitoring brain activity by measuring hemodynamic responses in the cerebral cortex. However, existing systems are expensive, bulky, and limited to clinical or research environments. This paper introduces OpenNIRScap, an open-source, low-cost, and wearable fNIRS system designed to make real-time brain monitoring more accessible in everyday environments. The device features 24 custom-designed sensor boards with dual-wavelength light emitters and photodiode detectors, a central electrical control unit (ECU) with analog multiplexing, and a real-time data processing pipeline. Bench validation and pilot tests on volunteers have confirmed the ability of the system to capture cognitively evoked hemodynamic responses, supporting its potential as an affordable tool for cognitive monitoring and portable neurotechnology applications. The hardware, software, and graphical user interface have all been open-sourced and made publicly available at the following link: https://github.com/tonykim07/fNIRS.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation
Authors:
Xiaolu Chen,
Chenghao Huang,
Yanru Zhang,
Hao Wang
Abstract:
With the proliferation of smart grids, smart cities face growing challenges due to cyber-attacks and sophisticated electricity theft behaviors, particularly in residential photovoltaic (PV) generation systems. Traditional Electricity Theft Detection (ETD) methods often struggle to capture complex temporal dependencies and integrating multi-source data, limiting their effectiveness. In this work, w…
▽ More
With the proliferation of smart grids, smart cities face growing challenges due to cyber-attacks and sophisticated electricity theft behaviors, particularly in residential photovoltaic (PV) generation systems. Traditional Electricity Theft Detection (ETD) methods often struggle to capture complex temporal dependencies and integrating multi-source data, limiting their effectiveness. In this work, we propose an efficient ETD method that accurately identifies fraudulent behaviors in residential PV generation, thus ensuring the supply-demand balance in smart cities. Our hybrid deep learning model, combining multi-scale Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Transformer, excels in capturing both short-term and long-term temporal dependencies. Additionally, we introduce a data embedding technique that seamlessly integrates time-series data with discrete temperature variables, enhancing detection robustness. Extensive simulation experiments using real-world data validate the effectiveness of our approach, demonstrating significant improvements in the accuracy of detecting sophisticated energy theft activities, thereby contributing to the stability and fairness of energy systems in smart cities.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Agent-Based Decentralized Energy Management of EV Charging Station with Solar Photovoltaics via Multi-Agent Reinforcement Learning
Authors:
Jiarong Fan,
Chenghao Huang,
Hao Wang
Abstract:
In the pursuit of energy net zero within smart cities, transportation electrification plays a pivotal role. The adoption of Electric Vehicles (EVs) keeps increasing, making energy management of EV charging stations critically important. While previous studies have managed to reduce energy cost of EV charging while maintaining grid stability, they often overlook the robustness of EV charging manage…
▽ More
In the pursuit of energy net zero within smart cities, transportation electrification plays a pivotal role. The adoption of Electric Vehicles (EVs) keeps increasing, making energy management of EV charging stations critically important. While previous studies have managed to reduce energy cost of EV charging while maintaining grid stability, they often overlook the robustness of EV charging management against uncertainties of various forms, such as varying charging behaviors and possible faults in faults in some chargers. To address the gap, a novel Multi-Agent Reinforcement Learning (MARL) approach is proposed treating each charger to be an agent and coordinate all the agents in the EV charging station with solar photovoltaics in a more realistic scenario, where system faults may occur. A Long Short-Term Memory (LSTM) network is incorporated in the MARL algorithm to extract temporal features from time-series. Additionally, a dense reward mechanism is designed for training the agents in the MARL algorithm to improve EV charging experience. Through validation on a real-world dataset, we show that our approach is robust against system uncertainties and faults and also effective in minimizing EV charging costs and maximizing charging service satisfaction.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Season-Independent PV Disaggregation Using Multi-Scale Net Load Temporal Feature Extraction and Weather Factor Fusion
Authors:
Xiaolu Chen,
Chenghao Huang,
Yanru Zhang,
Hao Wang
Abstract:
With the advancement of energy Internet and energy system integration, the increasing adoption of distributed photovoltaic (PV) systems presents new challenges on smart monitoring and measurement for utility companies, particularly in separating PV generation from net electricity load. Existing methods struggle with feature extraction from net load and capturing the relevance between weather facto…
▽ More
With the advancement of energy Internet and energy system integration, the increasing adoption of distributed photovoltaic (PV) systems presents new challenges on smart monitoring and measurement for utility companies, particularly in separating PV generation from net electricity load. Existing methods struggle with feature extraction from net load and capturing the relevance between weather factors. This paper proposes a PV disaggregation method that integrates Hierarchical Interpolation (HI) and multi-head self-attention mechanisms. By using HI to extract net load features and multi-head self-attention to capture the complex dependencies between weather factors, the method achieves precise PV generation predictions. Simulation experiments demonstrate the effectiveness of the proposed method in real-world data, supporting improved monitoring and management of distributed energy systems.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Joint Magnetometer-IMU Calibration via Maximum A Posteriori Estimation
Authors:
Chuan Huang,
Gustaf Hendeby,
Isaac Skog
Abstract:
This paper presents a new approach for jointly calibrating magnetometers and inertial measurement units, focusing on improving calibration accuracy and computational efficiency. The proposed method formulates the calibration problem as a maximum a posteriori estimation problem, treating both the calibration parameters and orientation trajectory of the sensors as unknowns. This formulation enables…
▽ More
This paper presents a new approach for jointly calibrating magnetometers and inertial measurement units, focusing on improving calibration accuracy and computational efficiency. The proposed method formulates the calibration problem as a maximum a posteriori estimation problem, treating both the calibration parameters and orientation trajectory of the sensors as unknowns. This formulation enables efficient optimization with closed-form derivatives. The method is compared against two state-of-the-art approaches in terms of computational complexity and estimation accuracy. Simulation results demonstrate that the proposed method achieves lower root mean square error in calibration parameters while maintaining competitive computational efficiency. Further validation through real-world experiments confirms the practical benefits of our approach: it effectively reduces position drift in a magnetic field-aided inertial navigation system by more than a factor of two on most datasets. Moreover, the proposed method calibrated 30 magnetometers in less than 2 minutes. The contributions include a new calibration method, an analysis of existing methods, and a comprehensive empirical evaluation. Datasets and algorithms are made publicly available to promote reproducible research.
△ Less
Submitted 27 May, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Authors:
Yutong Liu,
Ziyue Zhang,
Ban Ma-bao,
Yuqing Cai,
Yongbin Yu,
Renzeng Duojie,
Xiangxiang Wang,
Fan Gao,
Cheng Huang,
Nyima Tashi
Abstract:
Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a…
▽ More
Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion Model
Authors:
Yang Xiang,
Canan Huang,
Desheng Hu,
Jingguang Tian,
Xinhui Hu,
Chao Zhang
Abstract:
Most current speech enhancement (SE) methods recover clean speech from noisy inputs by directly estimating time-frequency masks or spectrums. However, these approaches often neglect the distinct attributes, such as semantic content and acoustic details, inherent in speech signals, which can hinder performance in downstream tasks. Moreover, their effectiveness tends to degrade in complex acoustic e…
▽ More
Most current speech enhancement (SE) methods recover clean speech from noisy inputs by directly estimating time-frequency masks or spectrums. However, these approaches often neglect the distinct attributes, such as semantic content and acoustic details, inherent in speech signals, which can hinder performance in downstream tasks. Moreover, their effectiveness tends to degrade in complex acoustic environments. To overcome these challenges, we propose a novel, semantic information-based, step-by-step factorized SE method using factorized codec and diffusion model. Unlike traditional SE methods, our hierarchical modeling of semantic and acoustic attributes enables more robust clean speech recovery, particularly in challenging acoustic scenarios. Moreover, this method offers further advantages for downstream TTS tasks. Experimental results demonstrate that our algorithm not only outperforms SOTA baselines in terms of speech quality but also enhances TTS performance in noisy environments.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Learning to Highlight Audio by Watching Movies
Authors:
Chao Huang,
Ruohan Gao,
J. M. F. Tsang,
Jan Kurcius,
Cagdas Bilen,
Chenliang Xu,
Anurag Kumar,
Sanjeel Parekh
Abstract:
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often…
▽ More
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
FLAM: Frame-Wise Language-Audio Modeling
Authors:
Yusong Wu,
Christos Tsirigotis,
Ke Chen,
Cheng-Zhi Anna Huang,
Aaron Courville,
Oriol Nieto,
Prem Seetharaman,
Justin Salamon
Abstract:
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are…
▽ More
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
△ Less
Submitted 8 June, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Over-the-Air ODE-Inspired Neural Network for Dual Task-Oriented Semantic Communications
Authors:
Mengbing Liu,
Jiancheng An,
Chongwen Huang,
Chau Yuen
Abstract:
Analog machine-learning hardware platforms promise greater speed and energy efficiency than their digital counterparts. Specifically, over-the-air analog computation allows offloading computation to the wireless propagation through carefully constructed transmitted signals. In addition, reconfigurable intelligent surface (RIS) is emerging as a promising solution for next-generation wireless networ…
▽ More
Analog machine-learning hardware platforms promise greater speed and energy efficiency than their digital counterparts. Specifically, over-the-air analog computation allows offloading computation to the wireless propagation through carefully constructed transmitted signals. In addition, reconfigurable intelligent surface (RIS) is emerging as a promising solution for next-generation wireless networks, offering the ability to tailor the communication environment. Leveraging the advantages of RIS, we design and implement the ordinary differential equation (ODE) neural network using over-the-air computation (AirComp) and demonstrate its effectiveness for dual tasks. We engineer the ambient wireless propagation environment through distributed RISs to create an architecture termed the over-the-air ordinary differential equation (Air-ODE) network. Unlike the conventional digital ODE-inspired neural network, the Air-ODE block utilizes the physics of wave reflection and the reconfigurable phase shifts of RISs to implement an ODE block in the analog domain, enhancing spectrum efficiency. Moreover, the advantages of Air-ODE are demonstrated in a deep learning-based semantic communication (DeepSC) system by extracting effective semantic information to reduce the data transmission load, while achieving the dual functions of image reconstruction and semantic tagging simultaneously at the receiver. Simulation results show that the analog Air-ODE network can achieve similar performance to the digital ODE-inspired network. Specifically, for the image reconstruction and semantic tagging task, compared with the analog network without the Air-ODE block, the Air-ODE block can achieve around 2 times gain in both reconstruction quality and tagging accuracy.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Robust Deep Learning-Based Physical Layer Communications: Strategies and Approaches
Authors:
Fenghao Zhu,
Xinquan Wang,
Chen Zhu,
Tierui Gong,
Zhaohui Yang,
Chongwen Huang,
Xiaoming Chen,
Zhaoyang Zhang,
Mérouane Debbah
Abstract:
Deep learning (DL) has emerged as a transformative technology with immense potential to reshape the sixth-generation (6G) wireless communication network. By utilizing advanced algorithms for feature extraction and pattern recognition, DL provides unprecedented capabilities in optimizing the network efficiency and performance, particularly in physical layer communications. Although DL technologies…
▽ More
Deep learning (DL) has emerged as a transformative technology with immense potential to reshape the sixth-generation (6G) wireless communication network. By utilizing advanced algorithms for feature extraction and pattern recognition, DL provides unprecedented capabilities in optimizing the network efficiency and performance, particularly in physical layer communications. Although DL technologies present the great potential, they also face significant challenges related to the robustness, which are expected to intensify in the complex and demanding 6G environment. Specifically, current DL models typically exhibit substantial performance degradation in dynamic environments with time-varying channels, interference of noise and different scenarios, which affect their effectiveness in diverse real-world applications. This paper provides a comprehensive overview of strategies and approaches for robust DL-based methods in physical layer communications. First we introduce the key challenges that current DL models face. Then we delve into a detailed examination of DL approaches specifically tailored to enhance robustness in 6G, which are classified into data-driven and model-driven strategies. Finally, we verify the effectiveness of these methods by case studies and outline future research directions.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Swapped Logit Distillation via Bi-level Teacher Alignment
Authors:
Stephen Ekaputra Limantoro,
Jhe-Hao Lin,
Chih-Yu Wang,
Yi-Lung Tsai,
Hong-Han Shuai,
Ching-Chun Huang,
Wen-Huang Cheng
Abstract:
Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, name…
▽ More
Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the "natural" limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers' alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Wireless Large AI Model: Shaping the AI-Native Future of 6G and Beyond
Authors:
Fenghao Zhu,
Xinquan Wang,
Xinyi Li,
Maojun Zhang,
Yixuan Chen,
Chongwen Huang,
Zhaohui Yang,
Xiaoming Chen,
Zhaoyang Zhang,
Richeng Jin,
Yongming Huang,
Wei Feng,
Tingting Yang,
Baoming Bai,
Feifei Gao,
Kun Yang,
Yuanwei Liu,
Sami Muhaidat,
Chau Yuen,
Kaibin Huang,
Kai-Kit Wong,
Dusit Niyato,
Mérouane Debbah
Abstract:
The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and d…
▽ More
The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and decision-making. In light of these remarkable capabilities, this paper provides a comprehensive survey of WLAM, elucidating its fundamental principles, diverse applications, critical challenges, and future research opportunities. We begin by introducing the background of WLAM and analyzing the key synergies with wireless networks, emphasizing the mutual benefits. Subsequently, we explore the foundational characteristics of WLAM, delving into their unique relevance in wireless environments. Then, the role of WLAM in optimizing wireless communication systems across various use cases and the reciprocal benefits are systematically investigated. Furthermore, we discuss the integration of WLAM with emerging technologies, highlighting their potential to enable transformative capabilities and breakthroughs in wireless communication. Finally, we thoroughly examine the high-level challenges hindering the practical implementation of WLAM and discuss pivotal future research directions.
△ Less
Submitted 28 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
Beamforming Design and Association Scheme for Multi-RIS Multi-User mmWave Systems Through Graph Neural Networks
Authors:
Mengbing Liu,
Chongwen Huang,
Ahmed Alhammadi,
Marco Di Renzo,
Merouane Debbah,
Chau Yuen
Abstract:
Reconfigurable intelligent surface (RIS) is emerging as a promising technology for next-generation wireless communication networks, offering a variety of merits such as the ability to tailor the communication environment. Moreover, deploying multiple RISs helps mitigate severe signal blocking between the base station (BS) and users, providing a practical and efficient solution to enhance the servi…
▽ More
Reconfigurable intelligent surface (RIS) is emerging as a promising technology for next-generation wireless communication networks, offering a variety of merits such as the ability to tailor the communication environment. Moreover, deploying multiple RISs helps mitigate severe signal blocking between the base station (BS) and users, providing a practical and efficient solution to enhance the service coverage. However, fully reaping the potential of a multi-RIS aided communication system requires solving a non-convex optimization problem. This challenge motivates the adoption of learning-based methods for determining the optimal policy. In this paper, we introduce a novel heterogeneous graph neural network (GNN) to effectively leverage the graph topology of a wireless communication environment. Specifically, we design an association scheme that selects a suitable RIS for each user. Then, we maximize the weighted sum rate (WSR) of all the users by iteratively optimizing the RIS association scheme, and beamforming designs until the considered heterogeneous GNN converges. Based on the proposed approach, each user is associated with the best RIS, which is shown to significantly improve the system capacity in multi-RIS multi-user millimeter wave (mmWave) communications. Specifically, simulation results demonstrate that the proposed heterogeneous GNN closely approaches the performance of the high-complexity alternating optimization (AO) algorithm in the considered multi-RIS aided communication system, and it outperforms other benchmark schemes. Moreover, the performance improvement achieved through the RIS association scheme is shown to be of the order of 30%.
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
The Communication and Computation Trade-off in Wireless Semantic Communications
Authors:
Xuyang Chen,
Chong Huang,
Gaojie Chen,
Daquan Feng,
Pei Xiao
Abstract:
Semantic communications have emerged as a crucial research direction for future wireless communication networks. However, as wireless systems become increasingly complex, the demands for computation and communication resources in semantic communications continue to grow rapidly. This paper investigates the trade-off between computation and communication in wireless semantic communications, taking…
▽ More
Semantic communications have emerged as a crucial research direction for future wireless communication networks. However, as wireless systems become increasingly complex, the demands for computation and communication resources in semantic communications continue to grow rapidly. This paper investigates the trade-off between computation and communication in wireless semantic communications, taking into consideration transmission task delay and performance constraints within the semantic communication framework. We propose a novel tradeoff metric to analyze the balance between computation and communication in semantic transmissions and employ the deep reinforcement learning (DRL) algorithm to minimize this metric, thereby reducing the cost associated with balancing computation and communication. Through simulations, we analyze the tradeoff between computation and communication and demonstrate the effectiveness of optimizing this trade-off metric.
△ Less
Submitted 13 May, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization
Authors:
Zirui Chen,
Zhaoyang Zhang,
Ziqing Xing,
Ridong Li,
Zhaohui Yang,
Richeng Jin,
Chongwen Huang,
Yuzhi Yang,
Mérouane Debbah
Abstract:
Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is primarily due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite that data of each scenario has a distinct reference frame, its generation generally follows common underlying physical rules. Based on this understanding, this…
▽ More
Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is primarily due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite that data of each scenario has a distinct reference frame, its generation generally follows common underlying physical rules. Based on this understanding, this article proposes a deep learning framework named analogical learning (AL), which implicitly retrieves the reference frame information associated with a scenario and then to make accurate prediction by relative analogy with other scenarios. Specifically, we design a bipartite neural network called Mateformer. Its first part captures the relativity within multiple latent feature spaces between the input data and a small amount of embedded data from the studied scenario, while its second part uses this relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments validate AL's superiority across three key dimensions. First, it achieves state-of-the-art accuracy in single-scenario benchmarks. Second, it demonstrates stable transferability between different scenarios, avoiding catastrophic forgetting. Finally, and most importantly, it robustly adapts to new, unseen scenarios--including dynamic weather and traffic conditions--without any tuning. All data and code are available at https://github.com/ziruichen-research/ALLoc.
△ Less
Submitted 30 June, 2025; v1 submitted 8 April, 2025;
originally announced April 2025.
-
TeleMoM: Consensus-Driven Telecom Intelligence via Mixture of Models
Authors:
Xinquan Wang,
Fenghao Zhu,
Chongwen Huang,
Zhaohui Yang,
Zhaoyang Zhang,
Sami Muhaidat,
Chau Yuen,
Mérouane Debbah
Abstract:
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-aug…
▽ More
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-augmented generation, mixture of experts, and fine-tuning struggle with accuracy, efficiency, and coordination. To address this issue, we propose Telecom mixture of models (TeleMoM), a consensus-driven ensemble framework that integrates multiple LLMs for enhanced decision-making in Telecom. TeleMoM employs a two-stage process: proponent models generate justified responses, and an adjudicator finalizes decisions, supported by a quality-checking mechanism. This approach leverages strengths of diverse models to improve accuracy, reduce biases, and handle domain-specific complexities effectively. Evaluation results demonstrate that TeleMoM achieves a 9.7\% increase in answer accuracy, highlighting its effectiveness in Telecom applications.
△ Less
Submitted 1 June, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
Liquid Neural Networks: Next-Generation AI for Telecom from First Principles
Authors:
Fenghao Zhu,
Xinquan Wang,
Chen Zhu,
Chongwen Huang
Abstract:
Artificial intelligence (AI) has emerged as a transformative technology with immense potential to reshape the next-generation of wireless networks. By leveraging advanced algorithms and machine learning techniques, AI offers unprecedented capabilities in optimizing network performance, enhancing data processing efficiency, and enabling smarter decision-making processes. However, existing AI soluti…
▽ More
Artificial intelligence (AI) has emerged as a transformative technology with immense potential to reshape the next-generation of wireless networks. By leveraging advanced algorithms and machine learning techniques, AI offers unprecedented capabilities in optimizing network performance, enhancing data processing efficiency, and enabling smarter decision-making processes. However, existing AI solutions face significant challenges in terms of robustness and interpretability. Specifically, current AI models exhibit substantial performance degradation in dynamic environments with varying data distributions, and the black-box nature of these algorithms raises concerns regarding safety, transparency, and fairness. This presents a major challenge in integrating AI into practical communication systems. Recently, a novel type of neural network, known as the liquid neural networks (LNNs), has been designed from first principles to address these issues. In this paper, we explore the potential of LNNs in telecommunications. First, we illustrate the mechanisms of LNNs and highlight their unique advantages over traditional networks. Then we unveil the opportunities that LNNs bring to future wireless networks. Furthermore, we discuss the challenges and design directions for the implementation of LNNs. Finally, we summarize the performance of LNNs in two case studies.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Investigation of intelligent barbell squat coaching system based on computer vision and machine learning
Authors:
Yinq-Rong Chern,
Yuhao Lee,
Hsiao-Ching Lin,
Guan-Ting Chen,
Ying-Hsien Chen,
Fu-Sung Lin,
Chih-Yao Chuang,
Jenn-Jier James Lien,
Chih-Hsien Huang
Abstract:
Purpose: Research has revealed that strength training can reduce the incidence of chronic diseases and physical deterioration at any age. Therefore, having a movement diagnostic system is crucial for training alone. Hence, this study developed an artificial intelligence and computer vision-based barbell squat coaching system with a real-time mode that immediately diagnoses the issue and provides f…
▽ More
Purpose: Research has revealed that strength training can reduce the incidence of chronic diseases and physical deterioration at any age. Therefore, having a movement diagnostic system is crucial for training alone. Hence, this study developed an artificial intelligence and computer vision-based barbell squat coaching system with a real-time mode that immediately diagnoses the issue and provides feedback after each squat. In addition, a replay mode allows users to examine their previous squats and check their comments. Initially, four primary characteristics of the barbell squat were identified: body joint angles, dorsiflexion, the ratio of knee-to-hip movement, and barbell stability. Methods: We collect 8,151 squats from 77 participants, categorizing them as good squats and six issues. Then, we trained the diagnosis models with three machine-learning architectures. Furthermore, this research applied the SHapley Additive exPlanations (SHAP) method to enhance the accuracy of issue prediction and reduce the computation time by feature selection. Results: The F1 score of the six issues reached 86.86%, 69.01%, 77.42%, 90.74%, 95.83%, and 100%. Each squat diagnosis took less than 0.5 seconds. Finally, this study examined the efficacy of the proposed system with two groups of participants trained with and without the system. Subsequently, participants trained with the system exhibited substantial improvements in their squat technique, as assessed both by the system itself and by a professional weightlifting coach. Conclusion: This is a comprehensive study that integrates artificial intelligence, computer vision and multivariable processing technologies, aimed at building a real-time, user-friendly barbell squat feedback and training system.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
GlaGAN: A Generative Unsupervised Model for High-Precision Segmentation of Retinal Main Vessels toward Early Detection of Glaucoma
Authors:
Cheng Huang,
Weizheng Xie,
Tsengdar J. Lee,
Jui-Kai Wang,
Karanjit Kooner,
Ning Zhang,
Jia Zhang
Abstract:
Structural changes in the main retinal blood vessels are critical biomarkers for glaucoma onset and progression. Identifying these vessels is essential for vascular modeling yet highly challenging. This paper introduces GlaGAN, an unsupervised generative AI model for segmenting main blood vessels in Optical Coherence Tomography Angiography (OCTA) images. The process begins with the Space Colonizat…
▽ More
Structural changes in the main retinal blood vessels are critical biomarkers for glaucoma onset and progression. Identifying these vessels is essential for vascular modeling yet highly challenging. This paper introduces GlaGAN, an unsupervised generative AI model for segmenting main blood vessels in Optical Coherence Tomography Angiography (OCTA) images. The process begins with the Space Colonization Algorithm (SCA) to rapidly generate vessel skeletons, including radius estimations. By synergistically integrating generative adversarial networks (GANs) with biostatistical modeling of vessel radii, GlaGAN efficiently reconstructs 2D and 3D representations, achieving nearly 100\% segmentation accuracy without requiring labeled data or high-performance computing resources. To address data scarcity, we also present GSS-RetVein, a high-definition mixed 2D/3D glaucoma retinal dataset featuring clear capillary structures. Designed for robustness testing, GSS-RetVein incorporates controlled noise while maintaining sharp capillary boundaries in 2D and enhancing 3D vascular reconstruction for blood flow prediction and glaucoma progression simulations. Experimental results demonstrate GSS-RetVein outperforms existing datasets in evaluating main vessel segmentation. Code and dataset are available: https://github.com/VikiXie/SatMar8.
△ Less
Submitted 7 July, 2025; v1 submitted 9 March, 2025;
originally announced March 2025.
-
Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform
Authors:
Chenyu Huang,
Peng Ye,
Xiaohui Wang,
Shenghe Zheng,
Biqing Qi,
Lei Bai,
Wanli Ouyang,
Tao Chen
Abstract:
With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usuall…
▽ More
With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Omnidirectional Multi-Object Tracking
Authors:
Kai Luo,
Hao Shi,
Sheng Wu,
Fei Teng,
Mengfei Duan,
Chang Huang,
Yuhang Wang,
Kaiwei Wang,
Kailun Yang
Abstract:
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geomet…
▽ More
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in panoramic field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset--a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as panoramic fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45% on QuadTrack, surpassing the baseline by 6.81%. The established dataset and source code are available at https://github.com/xifen523/OmniTrack.
△ Less
Submitted 23 March, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
On the Connection Between Magnetic-Field Odometry Aided Inertial Navigation and Magnetic-Field SLAM
Authors:
Isaac Skog,
Manon Kok,
Gustaf Hendeby,
Chuan Huang,
Thomas Edridge
Abstract:
Magnetic-field simultaneous localization and mapping (SLAM) using consumer-grade inertial and magnetometer sensors offers a scalable, cost-effective solution for indoor localization. However, the rapid error accumulation in the inertial navigation process limits the feasible exploratory phases of these systems. Advances in magnetometer array processing have demonstrated that odometry information,…
▽ More
Magnetic-field simultaneous localization and mapping (SLAM) using consumer-grade inertial and magnetometer sensors offers a scalable, cost-effective solution for indoor localization. However, the rapid error accumulation in the inertial navigation process limits the feasible exploratory phases of these systems. Advances in magnetometer array processing have demonstrated that odometry information, i.e., displacement and rotation information, can be extracted from local magnetic field variations and used to create magnetic-field odometry-aided inertial navigation systems. The error growth rate of these systems is significantly lower than that of standalone inertial navigation systems. This study seeks an answer to whether a magnetic-field SLAM system fed with measurements from a magnetometer array can indirectly extract odometry information -- without requiring algorithmic modifications -- and thus sustain longer exploratory phases. The theoretical analysis and simulation results show that such a system can extract odometry information and indirectly create a magnetic field odometry-aided inertial navigation system during the exploration phases. However, practical challenges related to map resolution and computational complexity remain significant.
△ Less
Submitted 14 May, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
Joint Size and Placement Optimization for IRS-Aided Communications with Active and Passive Elements
Authors:
Qiaoyan Peng,
Qingqing Wu,
Wen Chen,
Chaoying Huang,
Beixiong Zheng,
Shaodan Ma,
Mengnan Jian,
Yijian Chen,
Jun Yang
Abstract:
Different types of intelligent reflecting surfaces (IRS) are exploited for assisting wireless communications. The joint use of passive IRS (PIRS) and active IRS (AIRS) emerges as a promising solution owing to their complementary advantages. They can be integrated into a single hybrid active-passive IRS (HIRS) or deployed in a distributed manner, which poses challenges in determining the IRS elemen…
▽ More
Different types of intelligent reflecting surfaces (IRS) are exploited for assisting wireless communications. The joint use of passive IRS (PIRS) and active IRS (AIRS) emerges as a promising solution owing to their complementary advantages. They can be integrated into a single hybrid active-passive IRS (HIRS) or deployed in a distributed manner, which poses challenges in determining the IRS element allocation and placement for rate maximization. In this paper, we investigate the capacity of an IRS-aided wireless communication system with both active and passive elements. Specifically, we consider three deployment schemes: 1) base station (BS)-HIRS-user (BHU); 2) BS-AIRS-PIRS-user (BAPU); 3) BS-PIRS-AIRS-user (BPAU). Under the line-of-sight channel model, we formulate a rate maximization problem via a joint optimization of the IRS element allocation and placement. We first derive the optimized number of active and passive elements for BHU, BAPU, and BPAU schemes, respectively. Then, low-complexity HIRS/AIRS placement strategies are provided. To obtain more insights, we characterize the system capacity scaling orders for the three schemes with respect to the large total number of IRS elements, amplification power budget, and BS transmit power. Finally, simulation results are presented to validate our theoretical findings and show the performance difference among the BHU, BAPU, and BPAU schemes with the proposed joint design under various system setups.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing
Authors:
Shreya Ghosh,
Yi-Huan Chen,
Ching-Hsiang Huang,
Abu Shafin Mohammad Mahdee Jameel,
Chien Chou Ho,
Aly El Gamal,
Samuel Labi
Abstract:
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collab…
▽ More
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at github.com/RaceGAN.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
A Preliminary Exploration with GPT-4o Voice Mode
Authors:
Yu-Xiang Lin,
Chih-Kai Yang,
Wei-Chih Chen,
Chen-An Li,
Chien-yu Huang,
Xuanjun Chen,
Hung-yi Lee
Abstract:
With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command clas…
▽ More
With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o's safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a significantly different refusal rate when responding to speaker verification tasks on different datasets. This is likely due to variations in the accompanying instructions or the quality of the input audio, suggesting the sensitivity of its built-in safeguards. Finally, we acknowledge that model performance varies with evaluation protocols. This report only serves as a preliminary exploration of the current state of LALMs.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Utilizing 3D Fast Spin Echo Anatomical Imaging to Reduce the Number of Contrast Preparations in $T_{1ρ}$ Quantification of Knee Cartilage Using Learning-Based Methods
Authors:
Junru Zhong,
Chaoxing Huang,
Ziqiang Yu,
Fan Xiao,
Siyue Li,
Tim-Yun Michael Ong,
Ki-Wai Kevin Ho,
Queenie Chan,
James F. Griffith,
Weitian Chen
Abstract:
Purpose: To propose and evaluate an accelerated $T_{1ρ}$ quantification method that combines $T_{1ρ}$-weighted fast spin echo (FSE) images and proton density (PD)-weighted anatomical FSE images, leveraging deep learning models for $T_{1ρ}$ mapping. The goal is to reduce scan time and facilitate integration into routine clinical workflows for osteoarthritis (OA) assessment. Methods: This retrospect…
▽ More
Purpose: To propose and evaluate an accelerated $T_{1ρ}$ quantification method that combines $T_{1ρ}$-weighted fast spin echo (FSE) images and proton density (PD)-weighted anatomical FSE images, leveraging deep learning models for $T_{1ρ}$ mapping. The goal is to reduce scan time and facilitate integration into routine clinical workflows for osteoarthritis (OA) assessment. Methods: This retrospective study utilized MRI data from 40 participants (30 OA patients and 10 healthy volunteers). A volume of PD-weighted anatomical FSE images and a volume of $T_{1ρ}$-weighted images acquired at a non-zero spin-lock time were used as input to train deep learning models, including a 2D U-Net and a multi-layer perceptron (MLP). $T_{1ρ}$ maps generated by these models were compared with ground truth maps derived from a traditional non-linear least squares (NLLS) fitting method using four $T_{1ρ}$-weighted images. Evaluation metrics included mean absolute error (MAE), mean absolute percentage error (MAPE), regional error (RE), and regional percentage error (RPE). Results: Deep learning models achieved RPEs below 5% across all evaluated scenarios, outperforming NLLS methods, especially in low signal-to-noise conditions. The best results were obtained using the 2D U-Net, which effectively leveraged spatial information for accurate $T_{1ρ}$ fitting. The proposed method demonstrated compatibility with shorter TSLs, alleviating RF hardware and specific absorption rate (SAR) limitations. Conclusion: The proposed approach enables efficient $T_{1ρ}$ mapping using PD-weighted anatomical images, reducing scan time while maintaining clinical standards. This method has the potential to facilitate the integration of quantitative MRI techniques into routine clinical practice, benefiting OA diagnosis and monitoring.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Electromagnetic Channel Modeling and Capacity Analysis for HMIMO Communications
Authors:
Li Wei,
Shuai S. A. Yuan,
Chongwen Huang,
Jianhua Zhang,
Faouzi Bader,
Zhaoyang Zhang,
Sami Muhaidat,
Merouane Debbah,
Chau Yuen
Abstract:
Advancements in emerging technologies, e.g., reconfigurable intelligent surfaces and holographic MIMO (HMIMO), facilitate unprecedented manipulation of electromagnetic (EM) waves, significantly enhancing the performance of wireless communication systems. To accurately characterize the achievable performance limits of these systems, it is crucial to develop a universal EM-compliant channel model. T…
▽ More
Advancements in emerging technologies, e.g., reconfigurable intelligent surfaces and holographic MIMO (HMIMO), facilitate unprecedented manipulation of electromagnetic (EM) waves, significantly enhancing the performance of wireless communication systems. To accurately characterize the achievable performance limits of these systems, it is crucial to develop a universal EM-compliant channel model. This paper addresses this necessity by proposing a comprehensive EM channel model tailored for realistic multi-path environments, accounting for the combined effects of antenna array configurations and propagation conditions in HMIMO communications. Both polarization phenomena and spatial correlation are incorporated into this probabilistic channel model. Additionally, physical constraints of antenna configurations, such as mutual coupling effects and energy consumption, are integrated into the channel modeling framework. Simulation results validate the effectiveness of the proposed probabilistic channel model, indicating that traditional Rician and Rayleigh fading models cannot accurately depict the channel characteristics and underestimate the channel capacity. More importantly, the proposed channel model outperforms free-space Green's functions in accurately depicting both near-field gain and multi-path effects in radiative near-field regions. These gains are much more evident in tri-polarized systems, highlighting the necessity of polarization interference elimination techniques. Moreover, the theoretical analysis accurately verifies that capacity decreases with expanding communication regions of two-user communications.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Semantic Communication with Entropy-and-Channel-Adaptive Rate Control over Multi-User MIMO Fading Channels
Authors:
Weixuan Chen,
Qianqian Yang,
Yuhao Chen,
Chongwen Huang,
Qian Wang,
Zehui Xiong,
Zhaoyang Zhang
Abstract:
Although significant improvements in transmission efficiency have been achieved, existing semantic communication (SemCom) methods typically use a fixed transmission rate for varying channel conditions and transmission contents, leading to performance degradation under harsh channel conditions. To address these challenges, we propose a novel SemCom method for wireless image transmission that integr…
▽ More
Although significant improvements in transmission efficiency have been achieved, existing semantic communication (SemCom) methods typically use a fixed transmission rate for varying channel conditions and transmission contents, leading to performance degradation under harsh channel conditions. To address these challenges, we propose a novel SemCom method for wireless image transmission that integrates entropy-andchannel-adaptive rate control mechanism, specifically designed for multi-user multiple-input multiple-output (MU-MIMO) fading channels. Unlike existing methods, our system dynamically adjusts transmission rates by leveraging the entropy of feature maps, channel state information (CSI), and signal-to-noise ratio (SNR), ensuring optimal communication resource usage. It incorporates feature map pruning, channel attention, spatial attention, and multi-head self-attention (MHSA) to effectively prioritize critical semantic features while minimizing unnecessary transmission overhead. Experimental results demonstrate that the proposed system outperforms separated source and channel coding and deep joint source and channel coding (Deep JSCC), in terms of rate-distortion performance, flexibility, and robustness, particularly in challenging scenarios such as low SNR, imperfect CSI, and inter-user interference.
△ Less
Submitted 23 April, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution
Authors:
Shao-Hao Lu,
Ren Wang,
Ching-Chun Huang,
Wei-Chen Chiu
Abstract:
Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity…
▽ More
Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity results for non-blind SR. However, these methods rely on known degradation kernels, making them difficult to apply to blind SR. To address these issues, we present DADiff in this paper. DADiff incorporates degradation-aware models into the diffusion guidance framework, eliminating the need to know degradation kernels. Additionally, we propose two novel techniques: input perturbation and guidance scalar, to further improve our performance. Extensive experimental results show that our proposed method has superior performance over state-of-the-art methods on blind SR benchmarks.
△ Less
Submitted 22 January, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Digital Twin Online Channel Modeling: Challenges,Principles, and Applications
Authors:
Junling Li,
Cheng-Xiang Wang,
Chen Huang,
Tianrun Qi,
Tong Wu
Abstract:
Different from traditional offline channel modeling, digital twin online channel modeling can sense and accurately characterize dynamic wireless channels in real time, and can therefore greatly assist 6G network optimization. This article proposes a novel promising framework and a step-by-step design procedure of digital twin online channel models (DTOCM). By enabling continuous visualization and…
▽ More
Different from traditional offline channel modeling, digital twin online channel modeling can sense and accurately characterize dynamic wireless channels in real time, and can therefore greatly assist 6G network optimization. This article proposes a novel promising framework and a step-by-step design procedure of digital twin online channel models (DTOCM). By enabling continuous visualization and accurate prediction of dynamic channel variations, DTOCM can synchronize the performance between simulated and real networks. We first explore the evolution and conceptual advancements of DTOCM, highlighting its visions and associated challenges. Then, we explain its operational principles, construction mechanisms, and applications to typical 6G scenarios. Subsequently, the real-time channel information provisioning and visualization capabilities of DTOCM are illustrated through our DTOCM platform based on practical scenarios. Finally, future research directions and open issues are discussed.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
Movable Antenna-Assisted Integrated Sensing and Communication Systems
Authors:
Chengjun Jiang,
Chensi Zhang,
Chongwen Huang,
Jianhua Ge,
Dusit Niyato,
Chau Yuen
Abstract:
Movable antennas (MAs) enhance flexibility in beamforming gain and interference suppression by adjusting position within certain areas of the transceivers. In this paper, we propose an MA-assisted integrated sensing and communication framework, wherein MAs are deployed for reconfiguring the channel array responses at both the receiver and transmitter of a base station. Then, we develop an optimiza…
▽ More
Movable antennas (MAs) enhance flexibility in beamforming gain and interference suppression by adjusting position within certain areas of the transceivers. In this paper, we propose an MA-assisted integrated sensing and communication framework, wherein MAs are deployed for reconfiguring the channel array responses at both the receiver and transmitter of a base station. Then, we develop an optimization framework aimed at maximizing the sensing signal-to-interference-plus-noise-ratio (SINR) by jointly optimizing the receive beamforming vector, the transmit beamforming matrix, and the positions of MAs while meeting the minimum SINR requirement for each user. To address this nonconvex problem involving complex coupled variables, we devise an alternating optimization-based algorithm that incorporates techniques including the Charnes-Cooper transform, second-order Taylor expansion, and successive convex approximation (SCA). Specifically, the closed form of the received vector and the optimal transmit matrix can be first obtained in each iteration. Subsequently, the solutions for the positions of the transmit and receive MAs are obtained using the SCA method based on the second-order Taylor expansion. The simulation results show that the proposed scheme has significant advantages over the other baseline schemes. In particular, the proposed scheme has the ability to match the performance of the fixed position antenna scheme while utilizing fewer resources.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Generative AI Empowered Semantic Feature Multiple Access (SFMA) Over Wireless Networks
Authors:
Jiaxiang Wang,
Yinchao Yang,
Zhaohui Yang,
Chongwen Huang,
Mingzhe Chen,
Zhaoyang Zhang,
Mohammad Shikh-Bahaei
Abstract:
This paper investigates a novel generative artificial intelligence (GAI) empowered multi-user semantic communication system called semantic feature multiple access (SFMA) for video transmission, which comprises a base station (BS) and paired users. The BS generates and combines semantic information of several frames simultaneously requested by paired users into a single signal. Users recover their…
▽ More
This paper investigates a novel generative artificial intelligence (GAI) empowered multi-user semantic communication system called semantic feature multiple access (SFMA) for video transmission, which comprises a base station (BS) and paired users. The BS generates and combines semantic information of several frames simultaneously requested by paired users into a single signal. Users recover their frames from this combined signal and input the recovered frames into a GAI-based video frame interpolation model to generate the intermediate frame. To optimize transmission rates and temporal gaps between simultaneously transmitted frames, we formulate an optimization problem to maximize the system sum rate while minimizing temporal gaps. Since the standard signal-to-interference-plus-noise ratio (SINR) equation does not accurately capture the performance of our semantic communication system, we introduce a weight parameter into the SINR equation to better represent the system's performance. Due to its dependence on transmit power, we propose a three-step solution. First, we develop a user pairing algorithm that pairs two users with the highest preference value, a weighted combination of semantic transmission rate and temporal gap. Second, we optimize inter-group power allocation by formulating an optimization problem that allocates proper transmit power across all user groups to maximize system sum rates while satisfying each user's minimum rate requirement. Third, we address intra-group power allocation to enhance each user's performance. Simulation results demonstrate that our method improves transmission rates by up to 24.8%, 45.8%, and 66.1% compared to fixed-power non-orthogonal multiple access (F-NOMA), orthogonal joint source-channel coding (O-JSCC), and orthogonal frequency division multiple access (OFDMA), respectively.
△ Less
Submitted 30 December, 2024;
originally announced December 2024.
-
Deep Learning-Based Traffic-Aware Base Station Sleep Mode and Cell Zooming Strategy in RIS-Aided Multi-Cell Networks
Authors:
Shuo Sun,
Chong Huang,
Gaojie Chen,
Pei Xiao,
Rahim Tafazolli
Abstract:
Advances in wireless technology have significantly increased the number of wireless connections, leading to higher energy consumption in networks. Among these, base stations (BSs) in radio access networks (RANs) account for over half of the total energy usage. To address this, we propose a multi-cell sleep strategy combined with adaptive cell zooming, user association, and reconfigurable intellige…
▽ More
Advances in wireless technology have significantly increased the number of wireless connections, leading to higher energy consumption in networks. Among these, base stations (BSs) in radio access networks (RANs) account for over half of the total energy usage. To address this, we propose a multi-cell sleep strategy combined with adaptive cell zooming, user association, and reconfigurable intelligent surface (RIS) to minimize BS energy consumption. This approach allows BSs to enter sleep during low traffic, while adaptive cell zooming and user association dynamically adjust coverage to balance traffic load and enhance data rates through RIS, minimizing the number of active BSs. However, it is important to note that the proposed method may achieve energy-savings at the cost of increased delay, requiring a trade-off between these two factors. Moreover, minimizing BS energy consumption under the delay constraint is a complicated non-convex problem. To address this issue, we model the RIS-aided multi-cell network as a Markov decision process (MDP) and use the proximal policy optimization (PPO) algorithm to optimize sleep mode (SM), cell zooming, and user association. Besides, we utilize a double cascade correlation network (DCCN) algorithm to optimize the RIS reflection coefficients. Simulation results demonstrate that PPO balances energy-savings and delay, while DCCN-optimized RIS enhances BS energy-savings. Compared to systems optimised by the benchmark DQN algorithm, energy consumption is reduced by 49.61%
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
Optimizing In Vivo Data Acquisition for Robust Clinical Microvascular Imaging Using Ultrasound Localization Microscopy
Authors:
Chengwu Huang,
U-Wai Lok,
Jingke Zhang,
Xiang Yang Zhu,
James D. Krier,
Amy Stern,
Kate M. Knoll,
Kendra E. Petersen,
Kathryn A. Robinson,
Gina K. Hesley,
Andrew J. Bentall,
Thomas D. Atwell,
Andrew D. Rule,
Lilach O. Lerman,
Shigao Chen
Abstract:
Ultrasound localization microscopy (ULM) enables microvascular imaging at spatial resolutions beyond the acoustic diffraction limit, offering significant clinical potentials. However, ULM performance relies heavily on microbubble (MB) signal sparsity, the number of detected MBs, and signal-to-noise ratio (SNR), all of which vary in clinical scenarios involving bolus MB injections. These sources of…
▽ More
Ultrasound localization microscopy (ULM) enables microvascular imaging at spatial resolutions beyond the acoustic diffraction limit, offering significant clinical potentials. However, ULM performance relies heavily on microbubble (MB) signal sparsity, the number of detected MBs, and signal-to-noise ratio (SNR), all of which vary in clinical scenarios involving bolus MB injections. These sources of variations underscore the need to optimize MB dosage, data acquisition timing, and imaging settings in order to standardize and optimize ULM of microvasculature. This pilot study investigated temporal changes in MB signals during bolus injections in both pig and human models to optimize data acquisition for clinical ULM. Quantitative indices were developed to evaluate MB signal quality, guiding selection of acquisition timing that balances the MB localization quality and adequate MB counts. The effects of transmitted voltage and dosage were also explored. In the pig model, a relatively short window (approximately 10 seconds) for optimal acquisition was identified during the rapid wash-out phase, highlighting the need for real-time MB signal monitoring during data acquisition. The slower wash-out phase in humans allowed for a more flexible imaging window of 1-2 minutes, while trade-offs were observed between localization quality and MB density (or acquisition length) at different wash-out phase timings. Guided by these findings, robust ULM imaging was achieved in both pig and human kidneys using a short period of data acquisition, demonstrating its feasibility in clinical practice. This study provides insights into optimizing data acquisition for consistent and reproducible ULM, paving the way for its standardization and broader clinical applications.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Overview of AI and Communication for 6G Network: Fundamentals, Challenges, and Future Research Opportunities
Authors:
Qimei Cui,
Xiaohu You,
Ni Wei,
Guoshun Nan,
Xuefei Zhang,
Jianhua Zhang,
Xinchen Lyu,
Ming Ai,
Xiaofeng Tao,
Zhiyong Feng,
Ping Zhang,
Qingqing Wu,
Meixia Tao,
Yongming Huang,
Chongwen Huang,
Guangyi Liu,
Chenghui Peng,
Zhiwen Pan,
Tao Sun,
Dusit Niyato,
Tao Chen,
Muhammad Khurram Khan,
Abbas Jamalipour,
Mohsen Guizani,
Chau Yuen
Abstract:
With the growing demand for seamless connectivity and intelligent communication, the integration of artificial intelligence (AI) and sixth-generation (6G) communication networks has emerged as a transformative paradigm. By embedding AI capabilities across various network layers, this integration enables optimized resource allocation, improved efficiency, and enhanced system robust performance, par…
▽ More
With the growing demand for seamless connectivity and intelligent communication, the integration of artificial intelligence (AI) and sixth-generation (6G) communication networks has emerged as a transformative paradigm. By embedding AI capabilities across various network layers, this integration enables optimized resource allocation, improved efficiency, and enhanced system robust performance, particularly in intricate and dynamic environments. This paper presents a comprehensive overview of AI and communication for 6G networks, with a focus on emphasizing their foundational principles, inherent challenges, and future research opportunities. We first review the integration of AI and communications in the context of 6G, exploring the driving factors behind incorporating AI into wireless communications, as well as the vision for the convergence of AI and 6G. The discourse then transitions to a detailed exposition of the envisioned integration of AI within 6G networks, delineated across three progressive developmental stages. The first stage, AI for Network, focuses on employing AI to augment network performance, optimize efficiency, and enhance user service experiences. The second stage, Network for AI, highlights the role of the network in facilitating and buttressing AI operations and presents key enabling technologies, such as digital twins for AI and semantic communication. In the final stage, AI as a Service, it is anticipated that future 6G networks will innately provide AI functions as services, supporting application scenarios like immersive communication and intelligent industrial robots. In addition, we conduct an in-depth analysis of the critical challenges faced by the integration of AI and communications in 6G. Finally, we outline promising future research opportunities that are expected to drive the development and refinement of AI and 6G communications.
△ Less
Submitted 13 February, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
Deep Reinforcement Learning-Based Resource Allocation for Hybrid Bit and Generative Semantic Communications in Space-Air-Ground Integrated Networks
Authors:
Chong Huang,
Xuyang Chen,
Gaojie Chen,
Pei Xiao,
Geoffrey Ye Li,
Wei Huang
Abstract:
In this paper, we introduce a novel framework consisting of hybrid bit-level and generative semantic communications for efficient downlink image transmission within space-air-ground integrated networks (SAGINs). The proposed model comprises multiple low Earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users. Considering the limitations in signal coverage and receiver anten…
▽ More
In this paper, we introduce a novel framework consisting of hybrid bit-level and generative semantic communications for efficient downlink image transmission within space-air-ground integrated networks (SAGINs). The proposed model comprises multiple low Earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users. Considering the limitations in signal coverage and receiver antennas that make the direct communication between satellites and ground users unfeasible in many scenarios, thus UAVs serve as relays and forward images from satellites to the ground users. Our hybrid communication framework effectively combines bit-level transmission with several semantic-level image generation modes, optimizing bandwidth usage to meet stringent satellite link budget constraints and ensure communication reliability and low latency under low signal-to-noise ratio (SNR) conditions. To reduce the transmission delay while ensuring reconstruction quality for the ground user, we propose a novel metric to measure delay and reconstruction quality in the proposed system, and employ a deep reinforcement learning (DRL)-based strategy to optimize resource allocation in the proposed network. Simulation results demonstrate the superiority of the proposed framework in terms of communication resource conservation, reduced latency, and maintaining high image quality, significantly outperforming traditional solutions. Therefore, the proposed framework can ensure that real-time image transmission requirements in SAGINs, even under dynamic network conditions and user demand.
△ Less
Submitted 26 May, 2025; v1 submitted 7 December, 2024;
originally announced December 2024.
-
Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis: From Data Augmentation to Preference-Based Comparison
Authors:
Cailian Ruan,
Chengyue Huang,
Yahe Yang
Abstract:
This study introduces an evaluation framework for multimodal models in medical imaging diagnostics. We developed a pipeline incorporating data preprocessing, model inference, and preference-based evaluation, expanding an initial set of 500 clinical cases to 3,000 through controlled augmentation. Our method combined medical images with clinical observations to generate assessments, using Claude 3.5…
▽ More
This study introduces an evaluation framework for multimodal models in medical imaging diagnostics. We developed a pipeline incorporating data preprocessing, model inference, and preference-based evaluation, expanding an initial set of 500 clinical cases to 3,000 through controlled augmentation. Our method combined medical images with clinical observations to generate assessments, using Claude 3.5 Sonnet for independent evaluation against physician-authored diagnoses. The results indicated varying performance across models, with Llama 3.2-90B outperforming human diagnoses in 85.27% of cases. In contrast, specialized vision models like BLIP2 and Llava showed preferences in 41.36% and 46.77% of cases, respectively. This framework highlights the potential of large multimodal models to outperform human diagnostics in certain tasks.
△ Less
Submitted 6 December, 2024;
originally announced December 2024.
-
Communication Compression for Distributed Learning without Control Variates
Authors:
Tomas Ortega,
Chun-Yin Huang,
Xiaoxiao Li,
Hamid Jafarkhani
Abstract:
Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, which require error feedback to achieve convergence when the compression is aggressive. In turn, error feedback requires client-specific control variates, which directly contradicts…
▽ More
Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, which require error feedback to achieve convergence when the compression is aggressive. In turn, error feedback requires client-specific control variates, which directly contradicts privacy-preserving principles and requires stateful clients. In this paper, we propose Compressed Aggregate Feedback (CAFe), a novel distributed learning framework that allows highly compressible client updates by exploiting past aggregated updates, and does not require control variates. We consider Distributed Gradient Descent (DGD) as a representative algorithm and provide a theoretical proof of CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-smooth regime with bounded gradient dissimilarity. Experimental results confirm that CAFe consistently outperforms distributed learning with direct compression and highlight the compressibility of the client updates with CAFe.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.