Search | arXiv e-print repository

SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech

Authors: Zhuangfei Cheng, Guangyan Zhang, Zehai Tu, Yangyang Song, Shuiyang Mao, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Jiasong Wu

Abstract: Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classif… ▽ More Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classification (CTC) directly into codebook discretization for speech content tokenization. This novel architecture generates tokens with a unique "locality" property, as validated by experiments demonstrating optimal trade-offs among content faithfulness, temporal coherence, and structural recoverability. Then, to address data scarcity for the FAC module, we adopted a multitask learning strategy that jointly trains the FAC and TTS modules. Beyond mitigating data limitations, this approach yielded accelerated convergence and superior speech quality compared to standalone FAC training. Moreover, leveraging the salient properties of our discrete speech representations, we introduce SpeechRestorer, a postprocessing architecture designed to refine LLM-generated outputs. This module effectively mitigates stochastic errors prevalent in LLM inference pipelines while enhancing prosodic continuity, as validated by ablation experiments. △ Less

Submitted 8 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

Comments: 10 pages, includes references, 4 figures, 4 tables

ACM Class: I.2.7

arXiv:2506.21851 [pdf, ps, other]

End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model

Authors: Haofeng Wang, Fangtao Zhou, Qi Zhang, Zeyuan Chen, Enci Zhang, Zhao Wang, Xiaofeng Huang, Siwei Ma

Abstract: RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cr… ▽ More RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cross-modality prior information for accurate context probability modeling within and between modalities, we propose a Channel-wise Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are designed for extracting and aggregating the global low-frequency information from both modalities, which assist the model in predicting entropy parameters more accurately. Experimental results demonstrate that our approach outperforms existing RGB-IR image pair and single-modality compression methods on LLVIP and KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec presented at CVPR 2022. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: IEEE International Conference on Systems, Man, and Cybernetics 2025. (SMC), under review

arXiv:2506.15136 [pdf, ps, other]

Out-of-Band Modality Synergy Based Multi-User Beam Prediction and Proactive BS Selection with Zero Pilot Overhead

Authors: Kehui Li, Binggui Zhou, Jiajia Guo, Feifei Gao, Guanghua Yang, Shaodan Ma

Abstract: Multi-user millimeter-wave communication relies on narrow beams and dense cell deployments to ensure reliable connectivity. However, tracking optimal beams for multiple mobile users across multiple base stations (BSs) results in significant signaling overhead. Recent works have explored the capability of out-of-band (OOB) modalities in obtaining spatial characteristics of wireless channels and red… ▽ More Multi-user millimeter-wave communication relies on narrow beams and dense cell deployments to ensure reliable connectivity. However, tracking optimal beams for multiple mobile users across multiple base stations (BSs) results in significant signaling overhead. Recent works have explored the capability of out-of-band (OOB) modalities in obtaining spatial characteristics of wireless channels and reducing pilot overhead in single-BS single-user/multi-user systems. However, applying OOB modalities for multi-BS selection towards dense cell deployments leads to high coordination overhead, i.e, excessive computing overhead and high latency in data exchange. How to leverage OOB modalities to eliminate pilot overhead and achieve efficient multi-BS coordination in multi-BS systems remains largely unexplored. In this paper, we propose a novel OOB modality synergy (OMS) based mobility management scheme to realize multi-user beam prediction and proactive BS selection by synergizing two OOB modalities, i.e., vision and location. Specifically, mobile users are initially identified via spatial alignment of visual sensing and location feedback, and then tracked according to the temporal correlation in image sequence. Subsequently, a binary encoding map based gain and beam prediction network (BEM-GBPN) is designed to predict beamforming gains and optimal beams for mobile users at each BS, such that a central unit can control the BSs to perform user handoff and beam switching. Simulation results indicate that the proposed OMS-based mobility management scheme enhances beam prediction and BS selection accuracy and enables users to achieve 91% transmission rates of the optimal with zero pilot overhead and significantly improve multi-BS coordination efficiency compared to existing methods. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.12308 [pdf, ps, other]

From Ground to Sky: Architectures, Applications, and Challenges Shaping Low-Altitude Wireless Networks

Authors: Weijie Yuan, Yuanhao Cui, Jiacheng Wang, Fan Liu, Geng Sun, Tao Xiang, Jie Xu, Shi Jin, Dusit Niyato, Sinem Coleri, Sumei Sun, Shiwen Mao, Abbas Jamalipour, Dong In Kim, Mohamed-Slim Alouini, Xuemin Shen

Abstract: In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communica… ▽ More In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace. △ Less

Submitted 16 June, 2025; v1 submitted 13 June, 2025; originally announced June 2025.

Comments: 10 pages, 5 figures

arXiv:2505.22286 [pdf, ps, other]

Wireless Communication for Low-Altitude Economy with UAV Swarm Enabled Two-Level Movable Antenna System

Authors: Haiquan Lu, Yong Zeng, Shaodan Ma, Bin Li, Shi Jin, Rui Zhang

Abstract: Unmanned aerial vehicle (UAV) is regarded as a key enabling platform for low-altitude economy, due to its advantages such as 3D maneuverability, flexible deployment, and LoS air-to-air/ground communication links. In particular, the intrinsic high mobility renders UAV especially suitable for operating as a movable antenna (MA) from the sky. In this paper, by exploiting the flexible mobility of UAV… ▽ More Unmanned aerial vehicle (UAV) is regarded as a key enabling platform for low-altitude economy, due to its advantages such as 3D maneuverability, flexible deployment, and LoS air-to-air/ground communication links. In particular, the intrinsic high mobility renders UAV especially suitable for operating as a movable antenna (MA) from the sky. In this paper, by exploiting the flexible mobility of UAV swarm and antenna position adjustment of MA, we propose a novel UAV swarm enabled two-level MA system, where UAVs not only individually deploy a local MA array, but also form a larger-scale MA system with their individual MA arrays via swarm coordination. We formulate a general optimization problem to maximize the minimum achievable rate over all ground UEs, by jointly optimizing the 3D UAV swarm placement positions, their individual MAs' positions, and receive beamforming for different UEs. We first consider the special case where each UAV has only one antenna, under different scenarios of one single UE, two UEs, and arbitrary number of UEs. In particular, for the two-UE case, we derive the optimal UAV swarm placement positions in closed-form that achieves IUI-free communication, where the UAV swarm forms a uniform sparse array (USA) satisfying collision avoidance constraint. While for the general case with arbitrary number of UEs, we propose an efficient alternating optimization algorithm to solve the formulated non-convex optimization problem. Then, we extend the results to the case where each UAV is equipped with multiple antennas. Numerical results verify that the proposed low-altitude UAV swarm enabled MA system significantly outperforms various benchmark schemes, thanks to the exploitation of two-level mobility to create more favorable channel conditions for multi-UE communications. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 13 pages, 10 figures

arXiv:2505.21951 [pdf, ps, other]

When Feedback Empowers the Uplink: Integrating Adaptive Coding with Wireless Power Transfer

Authors: Zijian Yang, Yulin Shao, Shaodan Ma

Abstract: Energy consumption and device lifetime are critical concerns for battery-constrained IoT devices. This paper introduces the Feedback-Aided Coding and Energy Transfer (FACET) framework, which synergistically combines adaptive feedback channel coding with wireless power transfer. FACET leverages the saturation effect of feedback coding, where increasing downlink power yields diminishing returns, to… ▽ More Energy consumption and device lifetime are critical concerns for battery-constrained IoT devices. This paper introduces the Feedback-Aided Coding and Energy Transfer (FACET) framework, which synergistically combines adaptive feedback channel coding with wireless power transfer. FACET leverages the saturation effect of feedback coding, where increasing downlink power yields diminishing returns, to design a dual-purpose feedback mechanism that simultaneously guides uplink coding and replenishes device energy. We characterize the inherent tradeoff between feedback precision and harvested power, and formulate a fairness-constrained min-max optimization problem to minimize worst-case net energy consumption. An efficient algorithm based on alternating optimization and Lagrangian duality is developed, with each subproblem admitting a closed-form solution. Simulations show that FACET nearly triples device lifetime compared to conventional feedback coding architectures, and remains robust across a wide range of power regimes. These results suggest that FACET not only improves communication efficiency but also redefines the role of feedback in energy-constrained IoT systems. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.11978 [pdf, ps, other]

LLM-guided DRL for Multi-tier LEO Satellite Networks with Hybrid FSO/RF Links

Authors: Jiahui Li, Geng Sun, Zemin Sun, Jiacheng Wang, Yinqiu Liu, Ruichen Zhang, Dusit Niyato, Shiwen Mao

Abstract: Despite significant advancements in terrestrial networks, inherent limitations persist in providing reliable coverage to remote areas and maintaining resilience during natural disasters. Multi-tier networks with low Earth orbit (LEO) satellites and high-altitude platforms (HAPs) offer promising solutions, but face challenges from high mobility and dynamic channel conditions that cause unstable con… ▽ More Despite significant advancements in terrestrial networks, inherent limitations persist in providing reliable coverage to remote areas and maintaining resilience during natural disasters. Multi-tier networks with low Earth orbit (LEO) satellites and high-altitude platforms (HAPs) offer promising solutions, but face challenges from high mobility and dynamic channel conditions that cause unstable connections and frequent handovers. In this paper, we design a three-tier network architecture that integrates LEO satellites, HAPs, and ground terminals with hybrid free-space optical (FSO) and radio frequency (RF) links to maximize coverage while maintaining connectivity reliability. This hybrid approach leverages the high bandwidth of FSO for satellite-to-HAP links and the weather resilience of RF for HAP-to-ground links. We formulate a joint optimization problem to simultaneously balance downlink transmission rate and handover frequency by optimizing network configuration and satellite handover decisions. The problem is highly dynamic and non-convex with time-coupled constraints. To address these challenges, we propose a novel large language model (LLM)-guided truncated quantile critics algorithm with dynamic action masking (LTQC-DAM) that utilizes dynamic action masking to eliminate unnecessary exploration and employs LLMs to adaptively tune hyperparameters. Simulation results demonstrate that the proposed LTQC-DAM algorithm outperforms baseline algorithms in terms of convergence, downlink transmission rate, and handover frequency. We also reveal that compared to other state-of-the-art LLMs, DeepSeek delivers the best performance through gradual, contextually-aware parameter adjustments. △ Less

Submitted 17 May, 2025; originally announced May 2025.

Comments: This paper has been submitted to IEEE JSAC

arXiv:2505.03261 [pdf, other]

DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

Authors: Wei-Ting Chen, Yu-Jiet Vong, Yi-Tsung Lee, Sy-Yen Kuo, Qiang Gao, Sizhuo Ma, Jian Wang

Abstract: Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited sc… ▽ More Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited scale and diversity of available datasets. To address this limitation, we introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets. Our framework adapts these models to reconstruct identical input frames through a control module. The adapted diffusion model is then used to extract semantic and distortion features from a resizing branch and a cropping branch, respectively. To enhance the model's ability to handle long-term temporal dynamics, a parallel Mamba module is introduced, which extracts temporal coherence augmented features that are merged with the diffusion features to predict the final score. Experiments across multiple datasets demonstrate DiffVQA's superior performance on intra-dataset evaluations and its exceptional generalization across datasets. These results confirm that leveraging a diffusion model as a feature extractor can offer enhanced VQA performance compared to CNN and ViT backbones. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.00687 [pdf, ps, other]

GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution

Authors: Aditya Arora, Zhengzhong Tu, Yufei Wang, Ruizheng Bai, Jian Wang, Sizhuo Ma

Abstract: In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. G… ▽ More In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. GuideSR addresses this limitation by introducing a dual-branch architecture comprising: (1) a Guidance Branch that preserves high-fidelity structures from the original-resolution degraded input, and (2) a Diffusion Branch, which a pre-trained latent diffusion model to enhance perceptual quality. Unlike conventional conditioning mechanisms, our Guidance Branch features a tailored structure for image restoration tasks, combining Full Resolution Blocks (FRBs) with channel attention and an Image Guidance Network (IGN) with guided attention. By embedding detailed structural information directly into the restoration pipeline, GuideSR produces sharper and more visually consistent results. Extensive experiments on benchmark datasets demonstrate that GuideSR achieves state-of-the-art performance while maintaining the low computational cost of single-step approaches, with up to 1.39dB PSNR gain on challenging real-world datasets. Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement for real-world image restoration. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.21445 [pdf, other]

Emerging Advances in Learned Video Compression: Models, Systems and Beyond

Authors: Chuanmin Jia, Feng Ye, Siwei Ma, Wen Gao, Huifang Sun, Leonardo Chiariglione

Abstract: Video compression is a fundamental topic in the visual intelligence, bridging visual signal sensing/capturing and high-level visual analytics. The broad success of artificial intelligence (AI) technology has enriched the horizon of video compression into novel paradigms by leveraging end-to-end optimized neural models. In this survey, we first provide a comprehensive and systematic overview of rec… ▽ More Video compression is a fundamental topic in the visual intelligence, bridging visual signal sensing/capturing and high-level visual analytics. The broad success of artificial intelligence (AI) technology has enriched the horizon of video compression into novel paradigms by leveraging end-to-end optimized neural models. In this survey, we first provide a comprehensive and systematic overview of recent literature on end-to-end optimized learned video coding, covering the spectrum of pioneering efforts in both uni-directional and bi-directional prediction based compression model designation. We further delve into the optimization techniques employed in learned video compression (LVC), emphasizing their technical innovations, advantages. Some standardization progress is also reported. Furthermore, we investigate the system design and hardware implementation challenges of the LVC inclusively. Finally, we present the extensive simulation results to demonstrate the superior compression performance of LVC models, addressing the question that why learned codecs and AI-based video technology would have with broad impact on future visual intelligence research. △ Less

Submitted 30 April, 2025; originally announced April 2025.

arXiv:2504.20441 [pdf, ps, other]

Task-Oriented Semantic Communication with Importance-Aware Rate Control

Authors: Zhiye Sun, Shuai Ma, Shiyin Li

Abstract: Semantic communication is recognized for its high compression efficiency and robust resistance to noise. However, utilizing a fixed transmission rate in environments with dynamic signal-to-noise ratios (SNR) often results in inefficient use of communication resources. To address this challenge, this letter proposes an importance-aware rate control semantic communication (IRCSC) scheme, which dynam… ▽ More Semantic communication is recognized for its high compression efficiency and robust resistance to noise. However, utilizing a fixed transmission rate in environments with dynamic signal-to-noise ratios (SNR) often results in inefficient use of communication resources. To address this challenge, this letter proposes an importance-aware rate control semantic communication (IRCSC) scheme, which dynamically adjusts transmission rates in response to both channel conditions and semantic importance. The scheme employs a contribution-based importance analyzer to rank semantic importance. Additionaly, a novel metric, the semantic transmission integrity index (STII), is proposed to quantify the amount of correctly transmitted information and to correlate it with inference performance. Simulations indicate that, with low computational complexity, IRCSC guarantees a controllable trade-off between performance and rate, delivering higher compression efficiency and improved task performance in high-SNR scenarios. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: 5 pages, 4 figures

arXiv:2504.19660 [pdf, other]

Decentralization of Generative AI via Mixture of Experts for Wireless Networks: A Comprehensive Survey

Authors: Yunting Xu, Jiacheng Wang, Ruichen Zhang, Changyuan Zhao, Dusit Niyato, Jiawen Kang, Zehui Xiong, Bo Qian, Haibo Zhou, Shiwen Mao, Abbas Jamalipour, Xuemin Shen, Dong In Kim

Abstract: Mixture of Experts (MoE) has emerged as a promising paradigm for scaling model capacity while preserving computational efficiency, particularly in large-scale machine learning architectures such as large language models (LLMs). Recent advances in MoE have facilitated its adoption in wireless networks to address the increasing complexity and heterogeneity of modern communication systems. This paper… ▽ More Mixture of Experts (MoE) has emerged as a promising paradigm for scaling model capacity while preserving computational efficiency, particularly in large-scale machine learning architectures such as large language models (LLMs). Recent advances in MoE have facilitated its adoption in wireless networks to address the increasing complexity and heterogeneity of modern communication systems. This paper presents a comprehensive survey of the MoE framework in wireless networks, highlighting its potential in optimizing resource efficiency, improving scalability, and enhancing adaptability across diverse network tasks. We first introduce the fundamental concepts of MoE, including various gating mechanisms and the integration with generative AI (GenAI) and reinforcement learning (RL). Subsequently, we discuss the extensive applications of MoE across critical wireless communication scenarios, such as vehicular networks, unmanned aerial vehicles (UAVs), satellite communications, heterogeneous networks, integrated sensing and communication (ISAC), and mobile edge networks. Furthermore, key applications in channel prediction, physical layer signal processing, radio resource management, network optimization, and security are thoroughly examined. Additionally, we present a detailed overview of open-source datasets that are widely used in MoE-based models to support diverse machine learning tasks. Finally, this survey identifies crucial future research directions for MoE, emphasizing the importance of advanced training techniques, resource-aware gating strategies, and deeper integration with emerging 6G technologies. △ Less

Submitted 28 April, 2025; originally announced April 2025.

Comments: Survey paper, 30 pages, 13 figures

arXiv:2504.16146 [pdf, other]

Aerial Active STAR-RIS-assisted Satellite-Terrestrial Covert Communications

Authors: Chuang Zhang, Geng Sun, Jiahui Li, Jiacheng Wang, Ruichen Zhang, Dusit Niyato, Shiwen Mao, Tony Q. S. Quek

Abstract: An integration of satellites and terrestrial networks is crucial for enhancing performance of next generation communication systems. However, the networks are hindered by the long-distance path loss and security risks in dense urban environments. In this work, we propose a satellite-terrestrial covert communication system assisted by the aerial active simultaneous transmitting and reflecting recon… ▽ More An integration of satellites and terrestrial networks is crucial for enhancing performance of next generation communication systems. However, the networks are hindered by the long-distance path loss and security risks in dense urban environments. In this work, we propose a satellite-terrestrial covert communication system assisted by the aerial active simultaneous transmitting and reflecting reconfigurable intelligent surface (AASTAR-RIS) to improve the channel capacity while ensuring the transmission covertness. Specifically, we first derive the minimal detection error probability (DEP) under the worst condition that the Warden has perfect channel state information (CSI). Then, we formulate an AASTAR-RIS-assisted satellite-terrestrial covert communication optimization problem (ASCCOP) to maximize the sum of the fair channel capacity for all ground users while meeting the strict covert constraint, by jointly optimizing the trajectory and active beamforming of the AASTAR-RIS. Due to the challenges posed by the complex and high-dimensional state-action spaces as well as the need for efficient exploration in dynamic environments, we propose a generative deterministic policy gradient (GDPG) algorithm, which is a generative deep reinforcement learning (DRL) method to solve the ASCCOP. Concretely, the generative diffusion model (GDM) is utilized as the policy representation of the algorithm to enhance the exploration process by generating diverse and high-quality samples through a series of denoising steps. Moreover, we incorporate an action gradient mechanism to accomplish the policy improvement of the algorithm, which refines the better state-action pairs through the gradient ascent. Simulation results demonstrate that the proposed approach significantly outperforms important benchmarks. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.16119 [pdf, other]

Micro-Ring Perceptron Sensor for High-Speed, Low-Power Radio-Frequency Signal

Authors: Bo-Han Wu, Shi-Yuan Ma, Sri Krishna Vadlamani, Hyeongrak Choi, Dirk Englund

Abstract: Radio-frequency (RF) sensing enables long-range, high-resolution detection for applications such as radar and wireless communication. RF photonic sensing mitigates the bandwidth limitations and high transmission losses of electronic systems by transducing the detected RF signals into broadband optical carriers. However, these sensing systems remain limited by detector noise and Nyquist rate sampli… ▽ More Radio-frequency (RF) sensing enables long-range, high-resolution detection for applications such as radar and wireless communication. RF photonic sensing mitigates the bandwidth limitations and high transmission losses of electronic systems by transducing the detected RF signals into broadband optical carriers. However, these sensing systems remain limited by detector noise and Nyquist rate sampling with analog-to-digital converters, particularly under low-power and high-data rate conditions. To overcome these limitations, we introduce the micro-ring perceptron (MiRP) sensor, a physics-inspired AI framework that integrates the micro-ring (MiR) dynamics-based analog processor with a machine-learning-driven digital backend. By embedding the nonlinear optical dynamics of MiRs into an end-to-end architecture, MiRP sensing maps the input signal into a learned feature space for the subsequent digital neural network. The trick is to encode the entire temporal structure of the incoming signal into each output sample in order to enable effectively sub-Nyquist sampling without loss of task-relevant information. Evaluations of three target classification datasets demonstrate the performance advantages of MiRP sensing. For example, on MNIST, MiRP detection achieves $94\pm0.1$\% accuracy at $1/49$ the Nyquist rate at the input RF signal of $1$~ pW, compared to $11\pm0.4$\% for the conventional RF detection method. Thus, our sensor framework provides a robust and efficient solution for the detection of low-power and high-speed signals in real-world sensing applications. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.09905 [pdf, other]

Fusing Bluetooth with Pedestrian Dead Reckoning: A Floor Plan-Assisted Positioning Approach

Authors: Wenxuan Pan, Yang Yang, Mingzhe Chen, Dong Wei, Caili Guo, Shiwen Mao

Abstract: Floor plans can provide valuable prior information that helps enhance the accuracy of indoor positioning systems. However, existing research typically faces challenges in efficiently leveraging floor plan information and applying it to complex indoor layouts. To fully exploit information from floor plans for positioning, we propose a floor plan-assisted fusion positioning algorithm (FP-BP) using B… ▽ More Floor plans can provide valuable prior information that helps enhance the accuracy of indoor positioning systems. However, existing research typically faces challenges in efficiently leveraging floor plan information and applying it to complex indoor layouts. To fully exploit information from floor plans for positioning, we propose a floor plan-assisted fusion positioning algorithm (FP-BP) using Bluetooth low energy (BLE) and pedestrian dead reckoning (PDR). In the considered system, a user holding a smartphone walks through a positioning area with BLE beacons installed on the ceiling, and can locate himself in real time. In particular, FP-BP consists of two phases. In the offline phase, FP-BP programmatically extracts map features from a stylized floor plan based on their binary masks, and constructs a mapping function to identify the corresponding map feature of any given position on the map. In the online phase, FP-BP continuously computes BLE positions and PDR results from BLE signals and smartphone sensors, where a novel grid-based maximum likelihood estimation (GML) algorithm is introduced to enhance BLE positioning. Then, a particle filter is used to fuse them and obtain an initial estimate. Finally, FP-BP performs post-position correction to obtain the final position based on its specific map feature. Experimental results show that FP-BP can achieve a real-time mean positioning accuracy of 1.19 m, representing an improvement of over 28% compared to existing floor plan-fused baseline algorithms. △ Less

Submitted 19 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.08520 [pdf, other]

Joint Transmit Waveform and Receive Filter Design for ISAC System with Jamming

Authors: Yuan Shu, Chenhao Qi, Shiwen Mao

Abstract: In this paper, to suppress jamming in the complex electromagnetic environment, we propose a joint transmit waveform and receive filter design framework for integrated sensing and communications (ISAC). By jointly optimizing the transmit waveform and receive filters, we aim at minimizing the multiuser interference (MUI), subject to the constraints of the target mainlobe, jamming mainlobe and peak s… ▽ More In this paper, to suppress jamming in the complex electromagnetic environment, we propose a joint transmit waveform and receive filter design framework for integrated sensing and communications (ISAC). By jointly optimizing the transmit waveform and receive filters, we aim at minimizing the multiuser interference (MUI), subject to the constraints of the target mainlobe, jamming mainlobe and peak sidelobe level of the receive filter output as well as the transmit power of the ISAC base station. We propose two schemes to solve the problem, including joint transmit waveform and matched filter design (JTMD) and joint transmit waveform and mismatched filter design (JTMMD) schemes. For both schemes, we adopt the alternating direction method of multipliers to iteratively optimize the transmit waveform and receive filters, where the number of targets as well as the range and angles of each target can also be estimated. Simulation results show that both the JTMD and JTMMD schemes achieve superior performance in terms of communication MUI and radar detection performance. △ Less

Submitted 11 April, 2025; originally announced April 2025.

arXiv:2504.02061 [pdf, other]

Aligned Better, Listen Better for Audio-Visual Large Language Models

Authors: Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak un… ▽ More Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: Accepted to ICLR 2025

arXiv:2504.01333 [pdf, other]

Reconfigurable Codebook-Based Beamforming for RDARS-Aided mmWave MU-MIMO Systems

Authors: Chengwang Ji, Qing Xue, Haiquan Lu, Jintao Wang, Qiaoyan Peng, Shaodan Ma, Wei Zhang

Abstract: Reconfigurable distributed antenna and reflecting surface (RDARS) is a new architecture for the sixth-generation (6G) millimeter wave (mmWave) communications. In RDARS-aided mmWave systems, the active and passive beamforming design and working mode configuration for reconfigurable elements are crucial for system performance. In this paper, we aim to maximize the weighted sum rate (WSR) in the RDAR… ▽ More Reconfigurable distributed antenna and reflecting surface (RDARS) is a new architecture for the sixth-generation (6G) millimeter wave (mmWave) communications. In RDARS-aided mmWave systems, the active and passive beamforming design and working mode configuration for reconfigurable elements are crucial for system performance. In this paper, we aim to maximize the weighted sum rate (WSR) in the RDARS-aided mmWave system. To take advantage of RDARS, we first design a reconfigurable codebook (RCB) in which the number and dimension of the codeword can be flexibly adjusted. Then, a low overhead beam training scheme based on hierarchical search is proposed. Accordingly, the active and passive beamforming for data transmission is designed to achieve the maximum WSR for both space-division multiple access (SDMA) and time-division multiple access (TDMA) schemes. For the TDMA scheme, the optimal number of RDARS transmit elements and the allocated power budget for WSR maximization are derived in closed form. Besides, the superiority of the RDARS is verified and the conditions under which RDARS outperforms RIS and DAS are given. For the SDMA scheme, we characterize the relationship between the number of RDARS connected elements and the user distribution, followed by the derivation of the optimal placement positions of the RDARS transmit elements. High-quality beamforming design solutions are derived to minimize the inter-user interference (IUI) at the base station and RDARS side respectively, which nearly leads to the maximal WSR. Finally, simulation results confirm our theoretical findings and the superiority of the proposed schemes. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.07139 [pdf, other]

Power Allocation for Coordinated Multi-Point Aided ISAC Systems

Authors: Jianpeng Zou, Zhanfeng Zhong, Jintao Wang, Zheng Shi, Guanghua Yang, Shaodan Ma

Abstract: In this letter, we investigate a coordinated multiple point (CoMP)-aided integrated sensing and communication (ISAC) system that supports multiple users and targets. Multiple base stations (BSs) employ a coordinated power allocation strategy to serve their associated single-antenna communication users (CUs) while utilizing the echo signals for joint radar target (RT) detection. The probability of… ▽ More In this letter, we investigate a coordinated multiple point (CoMP)-aided integrated sensing and communication (ISAC) system that supports multiple users and targets. Multiple base stations (BSs) employ a coordinated power allocation strategy to serve their associated single-antenna communication users (CUs) while utilizing the echo signals for joint radar target (RT) detection. The probability of detection (PoD) of the CoMP-ISAC system is then proposed for assessing the sensing performance. To maximize the sum rate while ensuring the PoD for each RT and adhering to the total transmit power budget across all BSs, we introduce an efficient power allocation strategy. Finally, simulation results are provided to validate the analytical findings, demonstrating that the proposed power allocation scheme effectively enhances the sum rate while satisfying the sensing requirements. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: 4 pages, 4 figures

arXiv:2503.06149 [pdf, other]

Wireless Hallucination in Generative AI-enabled Communications: Concepts, Issues, and Solutions

Authors: Xudong Wang, Jiacheng Wang, Lei Feng, Dusit Niyato, Ruichen Zhang, Jiawen Kang, Zehui Xiong, Hongyang Du, Shiwen Mao

Abstract: Generative AI (GenAI) is driving the intelligence of wireless communications. Due to data limitations, random generation, and dynamic environments, GenAI may generate channel information or optimization strategies that violate physical laws or deviate from actual real-world requirements. We refer to this phenomenon as wireless hallucination, which results in invalid channel information, spectrum w… ▽ More Generative AI (GenAI) is driving the intelligence of wireless communications. Due to data limitations, random generation, and dynamic environments, GenAI may generate channel information or optimization strategies that violate physical laws or deviate from actual real-world requirements. We refer to this phenomenon as wireless hallucination, which results in invalid channel information, spectrum wastage, and low communication reliability but remains underexplored. To address this gap, this article provides a comprehensive concept of wireless hallucinations in GenAI-driven communications, focusing on hallucination mitigation. Specifically, we first introduce the fundamental, analyze its causes based on the GenAI workflow, and propose mitigation solutions at the data, model, and post-generation levels. Then, we systematically examines representative hallucination scenarios in GenAI-enabled communications and their corresponding solutions. Finally, we propose a novel integrated mitigation solution for GenAI-based channel estimation. At the data level, we establish a channel estimation hallucination dataset and employ generative adversarial networks (GANs)-based data augmentation. Additionally, we incorporate attention mechanisms and large language models (LLMs) to enhance both training and inference performance. Experimental results demonstrate that the proposed hybrid solutions reduce the normalized mean square error (NMSE) by 0.19, effectively reducing wireless hallucinations. △ Less

Submitted 8 March, 2025; originally announced March 2025.

Comments: 7 pages, 4 figures

arXiv:2503.02725 [pdf, other]

A Joint Visual Compression and Perception Framework for Neuralmorphic Spiking Camera

Authors: Kexiang Feng, Chuanmin Jia, Siwei Ma, Wen Gao

Abstract: The advent of neuralmorphic spike cameras has garnered significant attention for their ability to capture continuous motion with unparalleled temporal resolution.However, this imaging attribute necessitates considerable resources for binary spike data storage and transmission.In light of compression and spike-driven intelligent applications, we present the notion of Spike Coding for Intelligence (… ▽ More The advent of neuralmorphic spike cameras has garnered significant attention for their ability to capture continuous motion with unparalleled temporal resolution.However, this imaging attribute necessitates considerable resources for binary spike data storage and transmission.In light of compression and spike-driven intelligent applications, we present the notion of Spike Coding for Intelligence (SCI), wherein spike sequences are compressed and optimized for both bit-rate and task performance.Drawing inspiration from the mammalian vision system, we propose a dual-pathway architecture for separate processing of spatial semantics and motion information, which is then merged to produce features for compression.A refinement scheme is also introduced to ensure consistency between decoded features and motion vectors.We further propose a temporal regression approach that integrates various motion dynamics, capitalizing on the advancements in warping and deformation simultaneously.Comprehensive experiments demonstrate our scheme achieves state-of-the-art (SOTA) performance for spike compression and analysis.We achieve an average 17.25% BD-rate reduction compared to SOTA codecs and a 4.3% accuracy improvement over SpiReco for spike-based classification, with 88.26% complexity reduction and 42.41% inference time saving on the encoding side. △ Less

Submitted 4 March, 2025; originally announced March 2025.

arXiv:2502.19315 [pdf, ps, other]

Epitaxial high-K AlBN barrier GaN HEMTs

Authors: Chandrashekhar Savant, Thai-Son Nguyen, Kazuki Nomoto, Saurabh Vishwakarma, Siyuan Ma, Akshey Dhar, Yu-Hsin Chen, Joseph Casamento, David J. Smith, Huili Grace Xing, Debdeep Jena

Abstract: We report a polarization-induced 2D electron gas (2DEG) at an epitaxial AlBN/GaN heterojunction grown on a SiC substrate. Using this 2DEG in a long conducting channel, we realize ultra-thin barrier AlBN/GaN high electron mobility transistors that exhibit current densities of more than 0.25 A/mm, clean current saturation, a low pinch-off voltage of -0.43 V, and a peak transconductance of 0.14 S/mm.… ▽ More We report a polarization-induced 2D electron gas (2DEG) at an epitaxial AlBN/GaN heterojunction grown on a SiC substrate. Using this 2DEG in a long conducting channel, we realize ultra-thin barrier AlBN/GaN high electron mobility transistors that exhibit current densities of more than 0.25 A/mm, clean current saturation, a low pinch-off voltage of -0.43 V, and a peak transconductance of 0.14 S/mm. Transistor performance in this preliminary realization is limited by the contact resistance. Capacitance-voltage measurements reveal that introducing 7 % B in the epitaxial AlBN barrier on GaN boosts the relative dielectric constant of AlBN to 16, higher than the AlN dielectric constant of 9. Epitaxial high-K barrier AlBN/GaN HEMTs can thus extend performance beyond the capabilities of current GaN transistors. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: Manuscript: 7 pages, 5 figures and Supplementary data: 2 pages, 4 figures

arXiv:2502.16864 [pdf, other]

Joint Size and Placement Optimization for IRS-Aided Communications with Active and Passive Elements

Authors: Qiaoyan Peng, Qingqing Wu, Wen Chen, Chaoying Huang, Beixiong Zheng, Shaodan Ma, Mengnan Jian, Yijian Chen, Jun Yang

Abstract: Different types of intelligent reflecting surfaces (IRS) are exploited for assisting wireless communications. The joint use of passive IRS (PIRS) and active IRS (AIRS) emerges as a promising solution owing to their complementary advantages. They can be integrated into a single hybrid active-passive IRS (HIRS) or deployed in a distributed manner, which poses challenges in determining the IRS elemen… ▽ More Different types of intelligent reflecting surfaces (IRS) are exploited for assisting wireless communications. The joint use of passive IRS (PIRS) and active IRS (AIRS) emerges as a promising solution owing to their complementary advantages. They can be integrated into a single hybrid active-passive IRS (HIRS) or deployed in a distributed manner, which poses challenges in determining the IRS element allocation and placement for rate maximization. In this paper, we investigate the capacity of an IRS-aided wireless communication system with both active and passive elements. Specifically, we consider three deployment schemes: 1) base station (BS)-HIRS-user (BHU); 2) BS-AIRS-PIRS-user (BAPU); 3) BS-PIRS-AIRS-user (BPAU). Under the line-of-sight channel model, we formulate a rate maximization problem via a joint optimization of the IRS element allocation and placement. We first derive the optimized number of active and passive elements for BHU, BAPU, and BPAU schemes, respectively. Then, low-complexity HIRS/AIRS placement strategies are provided. To obtain more insights, we characterize the system capacity scaling orders for the three schemes with respect to the large total number of IRS elements, amplification power budget, and BS transmit power. Finally, simulation results are presented to validate our theoretical findings and show the performance difference among the BHU, BAPU, and BPAU schemes with the proposed joint design under various system setups. △ Less

Submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.12622 [pdf, other]

Generative AI Enabled Robust Data Augmentation for Wireless Sensing in ISAC Networks

Authors: Jiacheng Wang, Changyuan Zhao, Hongyang Du, Geng Sun, Jiawen Kang, Shiwen Mao, Dusit Niyato, Dong In Kim

Abstract: Integrated sensing and communication (ISAC) uses the same software and hardware resources to achieve both communication and sensing functionalities. Thus, it stands as one of the core technologies of 6G and has garnered significant attention in recent years. In ISAC systems, a variety of machine learning models are trained to analyze and identify signal patterns, thereby ensuring reliable sensing… ▽ More Integrated sensing and communication (ISAC) uses the same software and hardware resources to achieve both communication and sensing functionalities. Thus, it stands as one of the core technologies of 6G and has garnered significant attention in recent years. In ISAC systems, a variety of machine learning models are trained to analyze and identify signal patterns, thereby ensuring reliable sensing and communications. However, considering factors such as communication rates, costs, and privacy, collecting sufficient training data from various ISAC scenarios for these models is impractical. Hence, this paper introduces a generative AI (GenAI) enabled robust data augmentation scheme. The scheme first employs a conditioned diffusion model trained on a limited amount of collected CSI data to generate new samples, thereby expanding the sample quantity. Building on this, the scheme further utilizes another diffusion model to enhance the sample quality, thereby facilitating the data augmentation in scenarios where the original sensing data is insufficient and unevenly distributed. Moreover, we propose a novel algorithm to estimate the acceleration and jerk of signal propagation path length changes from CSI. We then use the proposed scheme to enhance the estimated parameters and detect the number of targets based on the enhanced data. The evaluation reveals that our scheme improves the detection performance by up to 70%, demonstrating reliability and robustness, which supports the deployment and practical use of the ISAC network. △ Less

Submitted 18 February, 2025; originally announced February 2025.

Comments: 13 pages, 10 figures

arXiv:2502.03949 [pdf, other]

doi 10.1109/JIOT.2025.3538764

Semantic Feature Division Multiple Access for Digital Semantic Broadcast Channels

Authors: Shuai Ma, Zhiye Sun, Bin Shen, Youlong Wu, Hang Li, Guangming Shi, Shiyin Li, Naofal Al-Dhahir

Abstract: In this paper, we propose a digital semantic feature division multiple access (SFDMA) paradigm in multi-user broadcast (BC) networks for the inference and the image reconstruction tasks. In this SFDMA scheme, the multi-user semantic information is encoded into discrete approximately orthogonal representations, and the encoded semantic features of multiple users can be simultaneously transmitted in… ▽ More In this paper, we propose a digital semantic feature division multiple access (SFDMA) paradigm in multi-user broadcast (BC) networks for the inference and the image reconstruction tasks. In this SFDMA scheme, the multi-user semantic information is encoded into discrete approximately orthogonal representations, and the encoded semantic features of multiple users can be simultaneously transmitted in the same time-frequency resource. Specifically, for inference tasks, we design a SFDMA digital BC network based on robust information bottleneck (RIB), which can achieve a tradeoff between inference performance, data compression and multi-user interference. Moreover, for image reconstruction tasks, we develop a SFDMA digital BC network by utilizing a Swin Transformer, which significantly reduces multi-user interference. More importantly, SFDMA can protect the privacy of users' semantic information, in which each receiver can only decode its own semantic information. Furthermore, we establish a relationship between performance and signal to interference plus noise ratio (SINR), which is fitted by an Alpha-Beta-Gamma (ABG) function. Furthermore, an optimal power allocation method is developed for the inference and reconstruction tasks. Extensive simulations verify the effectiveness and superiority of our proposed SFDMA scheme. △ Less

Submitted 6 February, 2025; originally announced February 2025.

Comments: 14 pages, 13 figures

arXiv:2501.10705 [pdf, other]

Secure Communication in Dynamic RDARS-Driven Systems

Authors: Ziqian Pei, Jintao Wang, Pingping Zhang, Zheng Shi, Guanghua Yang, Shaodan Ma

Abstract: In this letter, we investigate a dynamic reconfigurable distributed antenna and reflection surface (RDARS)-driven secure communication system, where the working mode of the RDARS can be flexibly configured. We aim to maximize the secrecy rate by jointly designing the active beamforming vectors, reflection coefficients, and the channel-aware mode selection matrix. To address the non-convex binary a… ▽ More In this letter, we investigate a dynamic reconfigurable distributed antenna and reflection surface (RDARS)-driven secure communication system, where the working mode of the RDARS can be flexibly configured. We aim to maximize the secrecy rate by jointly designing the active beamforming vectors, reflection coefficients, and the channel-aware mode selection matrix. To address the non-convex binary and cardinality constraints introduced by dynamic mode selection, we propose an efficient alternating optimization (AO) framework that employs penalty-based fractional programming (FP) and successive convex approximation (SCA) transformations. Simulation results demonstrate the potential of RDARS in enhancing the secrecy rate and show its superiority compared to existing reflection surface-based schemes. △ Less

Submitted 18 January, 2025; originally announced January 2025.

Comments: 5 pages, 5 figures

arXiv:2501.01773 [pdf, other]

Compressed Domain Prior-Guided Video Super-Resolution for Cloud Gaming Content

Authors: Qizhe Wang, Qian Yin, Zhimeng Huang, Weijia Jiang, Yi Su, Siwei Ma, Jiaqi Zhang

Abstract: Cloud gaming is an advanced form of Internet service that necessitates local terminals to decode within limited resources and time latency. Super-Resolution (SR) techniques are often employed on these terminals as an efficient way to reduce the required bit-rate bandwidth for cloud gaming. However, insufficient attention has been paid to SR of compressed game video content. Most SR networks amplif… ▽ More Cloud gaming is an advanced form of Internet service that necessitates local terminals to decode within limited resources and time latency. Super-Resolution (SR) techniques are often employed on these terminals as an efficient way to reduce the required bit-rate bandwidth for cloud gaming. However, insufficient attention has been paid to SR of compressed game video content. Most SR networks amplify block artifacts and ringing effects in decoded frames while ignoring edge details of game content, leading to unsatisfactory reconstruction results. In this paper, we propose a novel lightweight network called Coding Prior-Guided Super-Resolution (CPGSR) to address the SR challenges in compressed game video content. First, we design a Compressed Domain Guided Block (CDGB) to extract features of different depths from coding priors, which are subsequently integrated with features from the U-net backbone. Then, a series of re-parameterization blocks are utilized for reconstruction. Ultimately, inspired by the quantization in video coding, we propose a partitioned focal frequency loss to effectively guide the model's focus on preserving high-frequency information. Extensive experiments demonstrate the advancement of our approach. △ Less

Submitted 3 January, 2025; originally announced January 2025.

Comments: 10 pages, 4 figures, Data Compression Conference2025

arXiv:2412.19494 [pdf, other]

Retrieval-augmented Generation for GenAI-enabled Semantic Communications

Authors: Shunpu Tang, Ruichen Zhang, Yuxuan Yan, Qianqian Yang, Dusit Niyato, Xianbin Wang, Shiwen Mao

Abstract: Semantic communication (SemCom) is an emerging paradigm aiming at transmitting only task-relevant semantic information to the receiver, which can significantly improve communication efficiency. Recent advancements in generative artificial intelligence (GenAI) have empowered GenAI-enabled SemCom (GenSemCom) to further expand its potential in various applications. However, current GenSemCom systems… ▽ More Semantic communication (SemCom) is an emerging paradigm aiming at transmitting only task-relevant semantic information to the receiver, which can significantly improve communication efficiency. Recent advancements in generative artificial intelligence (GenAI) have empowered GenAI-enabled SemCom (GenSemCom) to further expand its potential in various applications. However, current GenSemCom systems still face challenges such as semantic inconsistency, limited adaptability to diverse tasks and dynamic environments, and the inability to leverage insights from past transmission. Motivated by the success of retrieval-augmented generation (RAG) in the domain of GenAI, this paper explores the integration of RAG in GenSemCom systems. Specifically, we first provide a comprehensive review of existing GenSemCom systems and the fundamentals of RAG techniques. We then discuss how RAG can be integrated into GenSemCom. Following this, we conduct a case study on semantic image transmission using an RAG-enabled diffusion-based SemCom system, demonstrating the effectiveness of the proposed integration. Finally, we outline future directions for advancing RAG-enabled GenSemCom systems. △ Less

Submitted 27 December, 2024; originally announced December 2024.

arXiv:2412.18817 [pdf, ps, other]

Wireless Communication with Flexible Reflector: Joint Placement and Rotation Optimization for Coverage Enhancement

Authors: Haiquan Lu, Zhi Yu, Yong Zeng, Shaodan Ma, Shi Jin, Rui Zhang

Abstract: Passive metal reflectors for communication enhancement have appealing advantages such as ultra low cost, zero energy expenditure, maintenance-free operation, long life span, and full compatibility with legacy wireless systems. To unleash the full potential of passive reflectors for wireless communications, this paper proposes a new passive reflector architecture, termed flexible reflector (FR), fo… ▽ More Passive metal reflectors for communication enhancement have appealing advantages such as ultra low cost, zero energy expenditure, maintenance-free operation, long life span, and full compatibility with legacy wireless systems. To unleash the full potential of passive reflectors for wireless communications, this paper proposes a new passive reflector architecture, termed flexible reflector (FR), for enabling the flexible adjustment of beamforming direction via the FR placement and rotation optimization. We consider the multi-FR aided area coverage enhancement and aim to maximize the minimum expected receive power over all locations within the target coverage area, by jointly optimizing the placement positions and rotation angles of multiple FRs. To gain useful insights, the special case of movable reflector (MR) with fixed rotation is first studied to maximize the expected receive power at a target location, where the optimal single-MR placement positions for electrically large and small reflectors are derived in closed-form, respectively. It is shown that the reflector should be placed at the specular reflection point for electrically large reflector. While for area coverage enhancement, the optimal placement is obtained for the single-MR case and a sequential placement algorithm is proposed for the multi-MR case. Moreover, for the general case of FR, joint placement and rotation design is considered for the single-/multi-FR aided coverage enhancement, respectively. Numerical results are presented which demonstrate significant performance gains of FRs over various benchmark schemes under different practical setups in terms of receive power enhancement. △ Less

Submitted 4 March, 2025; v1 submitted 25 December, 2024; originally announced December 2024.

Comments: 14 pages, 16 figures

arXiv:2412.11771 [pdf, other]

Point Cloud-Assisted Neural Image Compression

Authors: Ziqun Li, Qi Zhang, Xiaofeng Huang, Zhao Wang, Siwei Ma, Wei Yan

Abstract: High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is… ▽ More High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is widely adopted in the area of autonomous driving. We first unify the data representation for both modalities to facilitate data processing. Then, we propose the point cloud-assisted neural image codec (PCA-NIC) to enhance the preservation of image texture and structure by utilizing the high-dimensional point cloud information. We further introduce a multi-modal feature fusion transform module (MMFFT) to capture more representative image features, remove redundant information between channels and modalities that are not relevant to the image content. Our work is the first to improve image compression performance using point cloud and achieves state-of-the-art performance. △ Less

Submitted 16 December, 2024; originally announced December 2024.

arXiv:2412.05403 [pdf, other]

Knowledge-Based Deep Learning for Time-Efficient Inverse Dynamics

Authors: Shuhao Ma, Yu Cao, Ian D. Robertson, Chaoyang Shi, Jindong Liu, Zhi-Qiang Zhang

Abstract: Accurate understanding of muscle activation and muscle forces plays an essential role in neuro-rehabilitation and musculoskeletal disorder treatments. Computational musculoskeletal modeling has been widely used as a powerful non-invasive tool to estimate them through inverse dynamics using static optimization, but the inherent computational complexity results in time-consuming analysis. In this pa… ▽ More Accurate understanding of muscle activation and muscle forces plays an essential role in neuro-rehabilitation and musculoskeletal disorder treatments. Computational musculoskeletal modeling has been widely used as a powerful non-invasive tool to estimate them through inverse dynamics using static optimization, but the inherent computational complexity results in time-consuming analysis. In this paper, we propose a knowledge-based deep learning framework for time-efficient inverse dynamic analysis, which can predict muscle activation and muscle forces from joint kinematic data directly while not requiring any label information during model training. The Bidirectional Gated Recurrent Unit (BiGRU) neural network is selected as the backbone of our model due to its proficient handling of time-series data. Prior physical knowledge from forward dynamics and pre-selected inverse dynamics based physiological criteria are integrated into the loss function to guide the training of neural networks. Experimental validations on two datasets, including one benchmark upper limb movement dataset and one self-collected lower limb movement dataset from six healthy subjects, are performed. The experimental results have shown that the selected BiGRU architecture outperforms other neural network models when trained using our specifically designed loss function, which illustrates the effectiveness and robustness of the proposed framework. △ Less

Submitted 6 December, 2024; originally announced December 2024.

Comments: 10 pages, 8 figures, Journal paper

arXiv:2412.04213 [pdf, other]

doi 10.1109/TNSRE.2024.3375320

Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

Authors: Shuhao Ma, Jie Zhang, Chaoyang Shi, Pei Di, Ian D. Robertson, Zhi-Qiang Zhang

Abstract: Computational biomechanical analysis plays a pivotal role in understanding and improving human movements and physical functions. Although physics-based modeling methods can interpret the dynamic interaction between the neural drive to muscle dynamics and joint kinematics, they suffer from high computational latency. In recent years, data-driven methods have emerged as a promising alternative due t… ▽ More Computational biomechanical analysis plays a pivotal role in understanding and improving human movements and physical functions. Although physics-based modeling methods can interpret the dynamic interaction between the neural drive to muscle dynamics and joint kinematics, they suffer from high computational latency. In recent years, data-driven methods have emerged as a promising alternative due to their fast execution speed, but label information is still required during training, which is not easy to acquire in practice. To tackle these issues, this paper presents a novel physics-informed deep learning method to predict muscle forces without any label information during model training. In addition, the proposed method could also identify personalized muscle-tendon parameters. To achieve this, the Hill muscle model-based forward dynamics is embedded into the deep neural network as the additional loss to further regulate the behavior of the deep neural network. Experimental validations on the wrist joint from six healthy subjects are performed, and a fully connected neural network (FNN) is selected to implement the proposed method. The predicted results of muscle forces show comparable or even lower root mean square error (RMSE) and higher coefficient of determination compared with baseline methods, which have to use the labeled surface electromyography (sEMG) signals, and it can also identify muscle-tendon parameters accurately, demonstrating the effectiveness of the proposed physics-informed deep learning method. △ Less

Submitted 5 December, 2024; originally announced December 2024.

Comments: 11pages, 8 figures, journal

Journal ref: IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 32, pp. 1246-1256, 2024

arXiv:2411.17056 [pdf, ps, other]

Robust Max-Min Fair Beamforming Design for Rate Splitting Multiple Access-aided Visible Light Communications

Authors: Zhengqing Qiu, Yijie Mao, Shuai Ma, Bruno Clerckx

Abstract: This paper addresses the robust beamforming design for rate splitting multiple access (RSMA)-aided visible light communication (VLC) networks with imperfect channel state information at the transmitter (CSIT). In particular, we first derive the theoretical lower bound for the channel capacity of RSMA-aided VLC networks. Then we investigate the beamforming design to solve the max-min fairness (MMF)… ▽ More This paper addresses the robust beamforming design for rate splitting multiple access (RSMA)-aided visible light communication (VLC) networks with imperfect channel state information at the transmitter (CSIT). In particular, we first derive the theoretical lower bound for the channel capacity of RSMA-aided VLC networks. Then we investigate the beamforming design to solve the max-min fairness (MMF) problem of RSMA-aided VLC networks under the practical optical power constraint and electrical power constraint while considering the practical imperfect CSIT scenario. To address the problem, we propose a constrained-concave-convex programming (CCCP)-based beamforming design algorithm which exploits semidefinite relaxation (SDR) technique and a penalty method to deal with the rank-one constraint caused by SDR. Numerical results show that the proposed robust beamforming design algorithm for RSMA-aided VLC network achieves a superior performance over the existing ones for space-division multiple access (SDMA) and non-orthogonal multiple access (NOMA). △ Less

Submitted 26 November, 2024; v1 submitted 25 November, 2024; originally announced November 2024.

arXiv:2411.14135 [pdf, other]

Compact Visual Data Representation for Green Multimedia -- A Human Visual System Perspective

Authors: Peilin Chen, Xiaohan Fang, Meng Wang, Shiqi Wang, Siwei Ma

Abstract: The Human Visual System (HVS), with its intricate sophistication, is capable of achieving ultra-compact information compression for visual signals. This remarkable ability is coupled with high generalization capability and energy efficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC) standard achieves a compression ratio of around 1,000 times for raw visual data. This notable d… ▽ More The Human Visual System (HVS), with its intricate sophistication, is capable of achieving ultra-compact information compression for visual signals. This remarkable ability is coupled with high generalization capability and energy efficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC) standard achieves a compression ratio of around 1,000 times for raw visual data. This notable disparity motivates the research community to draw inspiration to effectively handle the immense volume of visual data in a green way. Therefore, this paper provides a survey of how visual data can be efficiently represented for green multimedia, in particular when the ultimate task is knowledge extraction instead of visual signal reconstruction. We introduce recent research efforts that promote green, sustainable, and efficient multimedia in this field. Moreover, we discuss how the deep understanding of the HVS can benefit the research community, and envision the development of future green multimedia technologies. △ Less

Submitted 26 December, 2024; v1 submitted 21 November, 2024; originally announced November 2024.

arXiv:2411.04762 [pdf, other]

JC5A: Service Delay Minimization for Aerial MEC-assisted Industrial Cyber-Physical Systems

Authors: Geng Sun, Jiaxu Wu, Zemin Sun, Long He, Jiacheng Wang, Dusit Niyato, Abbas Jamalipour, Shiwen Mao

Abstract: In the era of the sixth generation (6G) and industrial Internet of Things (IIoT), an industrial cyber-physical system (ICPS) drives the proliferation of sensor devices and computing-intensive tasks. To address the limited resources of IIoT sensor devices, unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) has emerged as a promising solution, providing flexible and cost-effective se… ▽ More In the era of the sixth generation (6G) and industrial Internet of Things (IIoT), an industrial cyber-physical system (ICPS) drives the proliferation of sensor devices and computing-intensive tasks. To address the limited resources of IIoT sensor devices, unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) has emerged as a promising solution, providing flexible and cost-effective services in close proximity of IIoT sensor devices (ISDs). However, leveraging aerial MEC to meet the delay-sensitive and computation-intensive requirements of the ISDs could face several challenges, including the limited communication, computation and caching (3C) resources, stringent offloading requirements for 3C services, and constrained on-board energy of UAVs. To address these issues, we first present a collaborative aerial MEC-assisted ICPS architecture by incorporating the computing capabilities of the macro base station (MBS) and UAVs. We then formulate a service delay minimization optimization problem (SDMOP). Since the SDMOP is proved to be an NP-hard problem, we propose a joint computation offloading, caching, communication resource allocation, computation resource allocation, and UAV trajectory control approach (JC5A). Specifically, JC5A consists of a block successive upper bound minimization method of multipliers (BSUMM) for computation offloading and service caching, a convex optimization-based method for communication and computation resource allocation, and a successive convex approximation (SCA)-based method for UAV trajectory control. Moreover, we theoretically prove the convergence and polynomial complexity of JC5A. Simulation results demonstrate that the proposed approach can achieve superior system performance compared to the benchmark approaches and algorithms. △ Less

Submitted 2 December, 2024; v1 submitted 7 November, 2024; originally announced November 2024.

arXiv:2410.14697 [pdf, other]

Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios

Authors: Shihan Ma, Bo Hu, Tianyu Jia, Alexander Kenneth Clarke, Blanka Zicher, Arnault H. Caillet, Dario Farina, Jose C. Principe

Abstract: The cortico-spinal neural pathway is fundamental for motor control and movement execution, and in humans it is typically studied using concurrent electroencephalography (EEG) and electromyography (EMG) recordings. However, current approaches for capturing high-level and contextual connectivity between these recordings have important limitations. Here, we present a novel application of statistical… ▽ More The cortico-spinal neural pathway is fundamental for motor control and movement execution, and in humans it is typically studied using concurrent electroencephalography (EEG) and electromyography (EMG) recordings. However, current approaches for capturing high-level and contextual connectivity between these recordings have important limitations. Here, we present a novel application of statistical dependence estimators based on orthonormal decomposition of density ratios to model the relationship between cortical and muscle oscillations. Our method extends from traditional scalar-valued measures by learning eigenvalues, eigenfunctions, and projection spaces of density ratios from realizations of the signal, addressing the interpretability, scalability, and local temporal dependence of cortico-muscular connectivity. We experimentally demonstrate that eigenfunctions learned from cortico-muscular connectivity can accurately classify movements and subjects. Moreover, they reveal channel and temporal dependencies that confirm the activation of specific EEG channels during movement. Our code is available at https://github.com/bohu615/corticomuscular-eigen-encoder. △ Less

Submitted 19 December, 2024; v1 submitted 4 October, 2024; originally announced October 2024.

arXiv:2409.13398 [pdf]

Unsourced Sparse Multiple Access foUnsourced Sparse Multiple Access for 6G Massive Communicationr 6G Massive Communication

Authors: Yifei Yuan, Yuhong Huang, Chunlin Yan, Sen Wang, Shuai Ma, Xiaodong Shen

Abstract: Massive communication is one of key scenarios of 6G where two magnitude higher connection density would be required to serve diverse services. As a promising direction, unsourced multiple access has been proved to outperform significantly over orthogonal multiple access (OMA) or slotted-ALOHA in massive connections. In this paper we describe a design framework of unsourced sparse multiple access (… ▽ More Massive communication is one of key scenarios of 6G where two magnitude higher connection density would be required to serve diverse services. As a promising direction, unsourced multiple access has been proved to outperform significantly over orthogonal multiple access (OMA) or slotted-ALOHA in massive connections. In this paper we describe a design framework of unsourced sparse multiple access (USMA) that consists of two key modules: compressed sensing for preamble generation, and sparse interleaver division multiple access (SIDMA) for main packet transmission. Simulation results of general design of USMA show that the theoretical bound can be approached within 1~1.5 dB by using simple channel codes like convolutional. To illustrate the scalability of USMA, a customized design for ambient Internet of Things (A-IoT) is proposed, so that much less memory and computation are required. Simulations results of Rayleigh fading and realistic channel estimation show that USMA based A-IoT solution can deliver nearly 4 times capacity and 6 times efficiency for random access over traditional radio frequency identification (RFID) technology. △ Less

Submitted 15 November, 2024; v1 submitted 20 September, 2024; originally announced September 2024.

Comments: 7 pages, 5 figures and 1 table

arXiv:2409.10127 [pdf, ps, other]

Joint Beamforming and Illumination Pattern Design for Beam-Hopping LEO Satellite Communications

Authors: Jing Wang, Chenhao Qi, Shui Yu, Shiwen Mao

Abstract: Since hybrid beamforming (HBF) can approach the performance of fully-digital beamforming (FDBF) with much lower hardware complexity, we investigate the HBF design for beam-hopping (BH) low earth orbit (LEO) satellite communications (SatComs). Aiming at maximizing the sum-rate of totally illuminated beam positions during the whole BH period, we consider joint beamforming and illumination pattern de… ▽ More Since hybrid beamforming (HBF) can approach the performance of fully-digital beamforming (FDBF) with much lower hardware complexity, we investigate the HBF design for beam-hopping (BH) low earth orbit (LEO) satellite communications (SatComs). Aiming at maximizing the sum-rate of totally illuminated beam positions during the whole BH period, we consider joint beamforming and illumination pattern design subject to the HBF constraints and sum-rate requirements. To address the non-convexity of the HBF constraints, we temporarily replace the HBF constraints with the FDBF constraints. Then we propose an FDBF and illumination pattern random search (FDBF-IPRS) scheme to optimize illumination patterns and fully-digital beamformers using constrained random search and fractional programming methods. To further reduce the computational complexity, we propose an FDBF and illumination pattern alternating optimization (FDBF-IPAO) scheme, where we relax the integer illumination pattern to continuous variables and after finishing all the iterations we quantize the continuous variables into integer ones. Based on the fully-digital beamformers designed by the FDBF-IPRS or FDBF-IPAO scheme, we propose an HBF alternating minimization algorithm to design the hybrid beamformers. Simulation results show that the proposed schemes can achieve satisfactory sum-rate performance for BH LEO SatComs. △ Less

Submitted 16 September, 2024; originally announced September 2024.

arXiv:2409.06946 [pdf, other]

Refracting Reconfigurable Intelligent Surface Assisted URLLC for Millimeter Wave High-Speed Train Communication Coverage Enhancement

Authors: Changzhu Liu, Ruisi He, Yong Niu, Shiwen Mao, Bo Ai, Ruifeng Chen

Abstract: High-speed train (HST) has garnered significant attention from both academia and industry due to the rapid development of railways worldwide. Millimeter wave (mmWave) communication, known for its large bandwidth is an effective way to address performance bottlenecks in cellular network based HST wireless communication systems. However, mmWave signals suffer from significant path loss when traversi… ▽ More High-speed train (HST) has garnered significant attention from both academia and industry due to the rapid development of railways worldwide. Millimeter wave (mmWave) communication, known for its large bandwidth is an effective way to address performance bottlenecks in cellular network based HST wireless communication systems. However, mmWave signals suffer from significant path loss when traversing carriage, posing substantial challenges to cellular networks. To address this issue, reconfigurable intelligent surfaces (RIS) have gained considerable interest for its ability to enhance cell coverage by reflecting signals toward receiver. Ensuring communication reliability, a core performance indicators of ultra-reliable and low-latency communications (URLLC) in fifth-generation systems, is crucial for providing steady and reliable data transmissions along railways, particularly for delivering safety and control messages and monitoring HST signaling information. In this paper, we investigate a refracting RIS-assisted multi-user multiple-input single-output URLLC system in mmWave HST communications. We propose a sum rate maximization problem, subject to base station beamforming constraint, as well as refracting RIS discrete phase shifts and reliability constraints. To solve this optimization problem, we design a joint optimization algorithm based on alternating optimization method. This involves decoupling the original optimization problem into active beamforming design and packet error probability optimization subproblem, and discrete phase shift design subproblems. These subproblems are addressed exploiting Lagrangian dual method and the local search method, respectively. Simulation results demonstrate the fast convergence of the proposed algorithm and highlight the benefits of refracting RIS adoption for sum rate improvement in mmWave HST networks. △ Less

Submitted 10 September, 2024; originally announced September 2024.

Comments: 11 figures, accepted by IEEE Transactions on Vehicular Technology

arXiv:2409.00956 [pdf]

Physics-Informed Neural Network Based Digital Image Correlation Method

Authors: Boda Li, Shichao Zhou, Qinwei Ma, Shaopeng Ma

Abstract: Digital Image Correlation (DIC) is a key technique in experimental mechanics for full-field deformation measurement, traditionally relying on subset matching to determine displacement fields. However, selecting optimal parameters like shape functions and subset size can be challenging in non-uniform deformation scenarios. Recent deep learning-based DIC approaches, both supervised and unsupervised,… ▽ More Digital Image Correlation (DIC) is a key technique in experimental mechanics for full-field deformation measurement, traditionally relying on subset matching to determine displacement fields. However, selecting optimal parameters like shape functions and subset size can be challenging in non-uniform deformation scenarios. Recent deep learning-based DIC approaches, both supervised and unsupervised, use neural networks to map speckle images to deformation fields, offering precise measurements without manual tuning. However, these methods require complex network architectures to extract speckle image features, which does not guarantee solution accuracy This paper introduces PINN-DIC, a novel DIC method based on Physics-Informed Neural Networks (PINNs). Unlike traditional approaches, PINN-DIC uses a simple fully connected neural network that takes the coordinate domain as input and outputs the displacement field. By integrating the DIC governing equation into the loss function, PINN-DIC directly extracts the displacement field from reference and deformed speckle images through iterative optimization. Evaluations on simulated and real experiments demonstrate that PINN-DIC maintains the accuracy of deep learning-based DIC in non-uniform fields while offering three distinct advantages: 1) enhanced precision with a simpler network by directly fitting the displacement field from coordinates, 2) effective handling of irregular boundary displacement fields with minimal parameter adjustments, and 3) easy integration with other neural network-based mechanical analysis methods for comprehensive DIC result analysis. △ Less

Submitted 2 September, 2024; originally announced September 2024.

arXiv:2408.11398 [pdf, other]

Generative AI based Secure Wireless Sensing for ISAC Networks

Authors: Jiacheng Wang, Hongyang Du, Yinqiu Liu, Geng Sun, Dusit Niyato, Shiwen Mao, Dong In Kim, Xuemin Shen

Abstract: Integrated sensing and communications (ISAC) is expected to be a key technology for 6G, and channel state information (CSI) based sensing is a key component of ISAC. However, current research on ISAC focuses mainly on improving sensing performance, overlooking security issues, particularly the unauthorized sensing of users. In this paper, we propose a secure sensing system (DFSS) based on two dist… ▽ More Integrated sensing and communications (ISAC) is expected to be a key technology for 6G, and channel state information (CSI) based sensing is a key component of ISAC. However, current research on ISAC focuses mainly on improving sensing performance, overlooking security issues, particularly the unauthorized sensing of users. In this paper, we propose a secure sensing system (DFSS) based on two distinct diffusion models. Specifically, we first propose a discrete conditional diffusion model to generate graphs with nodes and edges, guiding the ISAC system to appropriately activate wireless links and nodes, which ensures the sensing performance while minimizing the operation cost. Using the activated links and nodes, DFSS then employs the continuous conditional diffusion model to generate safeguarding signals, which are next modulated onto the pilot at the transmitter to mask fluctuations caused by user activities. As such, only ISAC devices authorized with the safeguarding signals can extract the true CSI for sensing, while unauthorized devices are unable to achieve the same sensing. Experiment results demonstrate that DFSS can reduce the activity recognition accuracy of the unauthorized devices by approximately 70%, effectively shield the user from the unauthorized surveillance. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.08833 [pdf, other]

Intra-symbol Differential Amplitude Shift Keying-aided Blind Detector for Ambient Backscatter Communication Systems

Authors: Shuaijun Ma, Peng Wei, Sa Xiao, Jianquan Wang, Wanbin Tang, Wei Xiang

Abstract: Ambient backscatter communications (AmBC) are a promising technology for addressing the energy consumption challenge in wireless communications through the reflection or absorption of surrounding radio frequency (RF) signals. However, it grapples with the intricacies of ambient RF signal and the round-trip path loss. For traditional detectors, the incorporation of pilot sequences results in a redu… ▽ More Ambient backscatter communications (AmBC) are a promising technology for addressing the energy consumption challenge in wireless communications through the reflection or absorption of surrounding radio frequency (RF) signals. However, it grapples with the intricacies of ambient RF signal and the round-trip path loss. For traditional detectors, the incorporation of pilot sequences results in a reduction in spectral efficiency. Furthermore, traditional energy-based detectors are inherently susceptible to a notable error floor issue, attributed to the co-channel direct link interference (DLI). Consequently, this paper proposes a blind symbol detector without the prior knowledge of the channel state information, signal variance, and noise variance. By leveraging the intra-symbol differential amplitude shift keying (IDASK) scheme, this detector effectively redirects the majority of the DLI energy towards the largest eigenvalue of the received sample covariance matrix, thereby utilizing the second largest eigenvalue for efficient symbol detection. In addition, this paper conducts theoretical performance analyses of the proposed detector in terms of the false alarm probability, missed detection probability, and the bit-error rate (BER) lower bound. Simulation results demonstrate that the proposed blind detector exhibits a significant enhancement in symbol detection performance compared to its traditional counterparts. △ Less

Submitted 16 August, 2024; originally announced August 2024.

arXiv:2407.15395 [pdf, other]

FAST-GSC: Fast and Adaptive Semantic Transmission for Generative Semantic Communication

Authors: Yiru Wang, Wanting Yang, Zehui Xiong, Yuping Zhao, Shiwen Mao, Tony Q. S. Quek, H. Vincent Poor

Abstract: The rapidly evolving field of generative artificial intelligence technology has introduced innovative approaches for developing semantic communication (SemCom) frameworks, leading to the emergence of a new paradigm-generative SemCom (GSC). However, the complex processes involved in semantic extraction and generative inference may result in considerable latency in resource-constrained scenarios. To… ▽ More The rapidly evolving field of generative artificial intelligence technology has introduced innovative approaches for developing semantic communication (SemCom) frameworks, leading to the emergence of a new paradigm-generative SemCom (GSC). However, the complex processes involved in semantic extraction and generative inference may result in considerable latency in resource-constrained scenarios. To tackle these issues, we introduce a new GSC framework that involves fast and adaptive semantic transmission (FAST-GSC). This framework incorporates one innovative communication mechanism and two enhancement strategies at the transmitter and receiver, respectively. Aiming to reduce task latency, our communication mechanism enables fast semantic transmission by parallelizing the processes of semantic extraction at the transmitter and inference at the receiver. Preliminary evaluations indicate that while this mechanism effectively reduces task latency, it could potentially compromise task performance. To address this issue, we propose two additional methods for enhancement. First, at the transmitter, we employ reinforcement learning to discern the intrinsic temporal dependencies among the semantic units and design their extraction and transmission sequence accordingly. Second, at the receiver, we design a semantic difference calculation module and propose a sequential conditional denoising approach to alleviate the stringent immediacy requirement for the reception of semantic features. Extensive experiments demonstrate that our proposed architecture achieves a performance score comparable to the conventional GSC architecture while realizing a 52% reduction in residual task latency that extends beyond the fixed inference duration. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.08919 [pdf, other]

Redefinition of Digital Twin and its Situation Awareness Framework Designing Towards Fourth Paradigm for Energy Internet of Things

Authors: Xing He, Yuezhong Tang, Shuyan Ma, Qian Ai, Fei Tao, Robert Qiu

Abstract: Traditional knowledge-based situation awareness (SA) modes struggle to adapt to the escalating complexity of today's Energy Internet of Things (EIoT), necessitating a pivotal paradigm shift. In response, this work introduces a pioneering data-driven SA framework, termed digital twin-based situation awareness (DT-SA), aiming to bridge existing gaps between data and demands, and further to enhance S… ▽ More Traditional knowledge-based situation awareness (SA) modes struggle to adapt to the escalating complexity of today's Energy Internet of Things (EIoT), necessitating a pivotal paradigm shift. In response, this work introduces a pioneering data-driven SA framework, termed digital twin-based situation awareness (DT-SA), aiming to bridge existing gaps between data and demands, and further to enhance SA capabilities within the complex EIoT landscape. First, we redefine the concept of digital twin (DT) within the EIoT context, aligning it with data-intensive scientific discovery paradigm (the Fourth Paradigm) so as to waken EIoT's sleeping data; this contextual redefinition lays the cornerstone of our DT-SA framework for EIoT. Then, the framework is comprehensively explored through its four fundamental steps: digitalization, simulation, informatization, and intellectualization. These steps initiate a virtual ecosystem conducive to a continuously self-adaptive, self-learning, and self-evolving big model (BM), further contributing to the evolution and effectiveness of DT-SA in engineering. Our framework is characterized by the incorporation of system theory and Fourth Paradigm as guiding ideologies, DT as data engine, and BM as intelligence engine. This unique combination forms the backbone of our approach. This work extends beyond engineering, stepping into the domain of data science -- DT-SA not only enhances management practices for EIoT users/operators, but also propels advancements in pattern analysis and machine intelligence (PAMI) within the intricate fabric of a complex system. Numerous real-world cases validate our DT-SA framework. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 16 pages, 15 figures Accepted by IEEE Transactions on Systems, Man and Cybernetics: Systems

arXiv:2407.08424 [pdf, other]

Semantic Feature Division Multiple Access for Multi-user Digital Interference Networks

Authors: Shuai Ma, Chuanhui Zhang, Bin Shen, Youlong Wu, Hang Li, Shiyin Li, Guangming Shi, Naofal Al-Dhahir

Abstract: With the ever-increasing user density and quality of service (QoS) demand,5G networks with limited spectrum resources are facing massive access challenges. To address these challenges, in this paper, we propose a novel discrete semantic feature division multiple access (SFDMA) paradigm for multi-user digital interference networks. Specifically, by utilizing deep learning technology, SFDMA extracts… ▽ More With the ever-increasing user density and quality of service (QoS) demand,5G networks with limited spectrum resources are facing massive access challenges. To address these challenges, in this paper, we propose a novel discrete semantic feature division multiple access (SFDMA) paradigm for multi-user digital interference networks. Specifically, by utilizing deep learning technology, SFDMA extracts multi-user semantic information into discrete representations in distinguishable semantic subspaces, which enables multiple users to transmit simultaneously over the same time-frequency resources. Furthermore, based on a robust information bottleneck, we design a SFDMA based multi-user digital semantic interference network for inference tasks, which can achieve approximate orthogonal transmission. Moreover, we propose a SFDMA based multi-user digital semantic interference network for image reconstruction tasks, where the discrete outputs of the semantic encoders of the users are approximately orthogonal, which significantly reduces multi-user interference. Furthermore, we propose an Alpha-Beta-Gamma (ABG) formula for semantic communications, which is the first theoretical relationship between inference accuracy and transmission power. Then, we derive adaptive power control methods with closed-form expressions for inference tasks. Extensive simulations verify the effectiveness and superiority of the proposed SFDMA. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.04675 [pdf, other]

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.01006 [pdf, other]

Multi-Functional Beamforming Design for Integrated Sensing, Communication, and Computation

Authors: Yapeng Zhao, Qingqing Wu, Wen Chen, Yong Zeng, Ruiqi Liu, Weidong Mei, Fen Hou, Shaodan Ma

Abstract: Integrated sensing and communication (ISAC) systems may face a heavy computation burden since the sensory data needs to be further processed. This paper studies a novel system that integrates sensing, communication, and computation, aiming to provide services for different objectives efficiently. This system consists of a multi-antenna multi-functional base station (BS), an edge server, a target,… ▽ More Integrated sensing and communication (ISAC) systems may face a heavy computation burden since the sensory data needs to be further processed. This paper studies a novel system that integrates sensing, communication, and computation, aiming to provide services for different objectives efficiently. This system consists of a multi-antenna multi-functional base station (BS), an edge server, a target, and multiple singleantenna communication users. The BS needs to allocate the available resources to efficiently provide sensing, communication, and computation services. Due to the heavy service burden and limited power budget, the BS can partially offload the tasks to the nearby edge server instead of computing them locally. We consider the estimation of the target response matrix, a general problem in radar sensing, and utilize Cramer-Rao bound (CRB) as the corresponding performance metric. To tackle the non-convex optimization problem, we propose both semidefinite relaxation (SDR)-based alternating optimization and SDR-based successive convex approximation (SCA) algorithms to minimize the CRB of radar sensing while meeting the requirement of communication users and the need for task computing. Furthermore, we demonstrate that the optimal rankone solutions of both the alternating and SCA algorithms can be directly obtained via the solver or further constructed even when dealing with multiple functionalities. Simulation results show that the proposed algorithms can provide higher target estimation performance than state-of-the-art benchmarks while satisfying the communication and computation constraints. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.09627 [pdf, other]

RobustSAM: Segment Anything Robustly on Degraded Images

Authors: Wei-Ting Chen, Yu-Jiet Vong, Sy-Yen Kuo, Sizhuo Ma, Jian Wang

Abstract: Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation, acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless, its performance is challenged by images with degraded quality. Addressing this limitation, we propose the Robust Segment Anything Model (RobustSAM), which enhances SAM's performance on low-quality image… ▽ More Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation, acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless, its performance is challenged by images with degraded quality. Addressing this limitation, we propose the Robust Segment Anything Model (RobustSAM), which enhances SAM's performance on low-quality images while preserving its promptability and zero-shot generalization. Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs, demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset, a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM's superior performance, especially under zero-shot conditions, underscoring its potential for extensive real-world application. Additionally, our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted by CVPR2024 (Highlight); Project Page: https://robustsam.github.io/

arXiv:2406.09622 [pdf, other]

DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer

Authors: Wei-Ting Chen, Gurunandan Krishnan, Qiang Gao, Sy-Yen Kuo, Sizhuo Ma, Jian Wang

Abstract: Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images, which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA, which is aided by two unique mechanisms. First, a Dual-Set Degradation Representation Learning (DSL) mechanism uses facial image… ▽ More Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images, which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA, which is aided by two unique mechanisms. First, a Dual-Set Degradation Representation Learning (DSL) mechanism uses facial images with both synthetic and real degradations to decouple degradation from content, ensuring generalizability to real-world scenarios. This self-supervised method learns degradation features on a global scale, providing a robust alternative to conventional methods that use local patch information in degradation learning. Second, our transformer leverages facial landmarks to emphasize visually salient parts of a face image in evaluating its perceptual quality. We also introduce a balanced and diverse Comprehensive Generic Face IQA (CGFIQA-40k) dataset of 40K images carefully designed to overcome the biases, in particular the imbalances in skin tone and gender representation, in existing datasets. Extensive analysis and evaluation demonstrate the robustness of our method, marking a significant improvement over prior methods. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted by CVPR 2024, Project Page: https://dsl-fiqa.github.io/

arXiv:2406.09389 [pdf, other]

Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

Authors: Baiang Li, Sizhuo Ma, Yanhong Zeng, Xiaogang Xu, Youqing Fang, Zhao Zhang, Jian Wang, Kai Chen

Abstract: Capturing High Dynamic Range (HDR) scenery using 8-bit cameras often suffers from over-/underexposure, loss of fine details due to low bit-depth compression, skewed color distributions, and strong noise in dark areas. Traditional LDR image enhancement methods primarily focus on color mapping, which enhances the visual representation by expanding the image's color range and adjusting the brightness… ▽ More Capturing High Dynamic Range (HDR) scenery using 8-bit cameras often suffers from over-/underexposure, loss of fine details due to low bit-depth compression, skewed color distributions, and strong noise in dark areas. Traditional LDR image enhancement methods primarily focus on color mapping, which enhances the visual representation by expanding the image's color range and adjusting the brightness. However, these approaches fail to effectively restore content in dynamic range extremes, which are regions with pixel values close to 0 or 255. To address the full scope of challenges in HDR imaging and surpass the limitations of current models, we propose a novel two-stage approach. The first stage maps the color and brightness to an appropriate range while keeping the existing details, and the second stage utilizes a diffusion prior to generate content in dynamic range extremes lost during capture. This generative refinement module can also be used as a plug-and-play module to enhance and complement existing LDR enhancement models. The proposed method markedly improves the quality and details of LDR images, demonstrating superior performance through rigorous experimental validation. The project page is at https://sagiri0208.github.io △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: https://sagiri0208.github.io

Showing 1–50 of 228 results for author: Ma, S