-
Image Restoration via Multi-domain Learning
Authors:
Xingyu Jiang,
Ning Gao,
Xiuhui Zhang,
Hongkun Dou,
Shaowen Fu,
Xiaoqing Zhong,
Hongjue Li,
Yue Deng
Abstract:
Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both traini…
▽ More
Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both training and real-time deployment. Furthermore, instead of investigating the commonalities among different degradations, most existing restoration methods focus on modifying Transformer under limited restoration priors. In this work, we first review various degradation phenomena under multi-domain perspective, identifying common priors. Then, we introduce a novel restoration framework, which integrates multi-domain learning into Transformer. Specifically, in Token Mixer, we propose a Spatial-Wavelet-Fourier multi-domain structure that facilitates local-region-global multi-receptive field modeling to replace vanilla self-attention. Additionally, in Feed-Forward Network, we incorporate multi-scale learning to fuse multi-domain features at different resolutions. Comprehensive experimental results across ten restoration tasks, such as dehazing, desnowing, motion deblurring, defocus deblurring, rain streak/raindrop removal, cloud removal, shadow removal, underwater enhancement and low-light enhancement, demonstrate that our proposed model outperforms state-of-the-art methods and achieves a favorable trade-off among restoration performance, parameter size, computational cost and inference latency. The code is available at: https://github.com/deng-ai-lab/SWFormer.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
ReeM: Ensemble Building Thermodynamics Model for Efficient HVAC Control via Hierarchical Reinforcement Learning
Authors:
Yang Deng,
Yaohui Liu,
Rui Liang,
Dafang Zhao,
Donghua Xie,
Ittetsu Taniguchi,
Dan Wang
Abstract:
The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavil…
▽ More
The building thermodynamics model, which predicts real-time indoor temperature changes under potential HVAC (Heating, Ventilation, and Air Conditioning) control operations, is crucial for optimizing HVAC control in buildings. While pioneering studies have attempted to develop such models for various building environments, these models often require extensive data collection periods and rely heavily on expert knowledge, making the modeling process inefficient and limiting the reusability of the models. This paper explores a model ensemble perspective that utilizes existing developed models as base models to serve a target building environment, thereby providing accurate predictions while reducing the associated efforts. Given that building data streams are non-stationary and the number of base models may increase, we propose a Hierarchical Reinforcement Learning (HRL) approach to dynamically select and weight the base models. Our approach employs a two-tiered decision-making process: the high-level focuses on model selection, while the low-level determines the weights of the selected models. We thoroughly evaluate the proposed approach through offline experiments and an on-site case study, and the experimental results demonstrate the effectiveness of our method.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Understanding the Mechanisms Behind Structural Influences on Link Prediction: A Case Study on FB15k-237
Authors:
Xiaobo Jiang,
Yadong Deng
Abstract:
FB15k-237 mitigates the data leakage issue by excluding inverse and symmetric relationship triples, however, this has led to substantial performance degradation and slow improvement progress. Traditional approaches demonstrate limited effectiveness on FB15k-237, primarily because the underlying mechanism by which structural features of the dataset influence model performance remains unexplored. To…
▽ More
FB15k-237 mitigates the data leakage issue by excluding inverse and symmetric relationship triples, however, this has led to substantial performance degradation and slow improvement progress. Traditional approaches demonstrate limited effectiveness on FB15k-237, primarily because the underlying mechanism by which structural features of the dataset influence model performance remains unexplored. To bridge this gap, we systematically investigate the impact mechanism of dataset structural features on link prediction performance. Firstly, we design a structured subgraph sampling strategy that ensures connectivity while constructing subgraphs with distinct structural features. Then, through correlation and sensitivity analyses conducted across several mainstream models, we observe that the distribution of relationship categories within subgraphs significantly affects performance, followed by the size of strongly connected components. Further exploration using the LIME model clarifies the intrinsic mechanism by which relationship categories influence link prediction performance, revealing that relationship categories primarily modulate the relative importance between entity embeddings and relationship embeddings and relationship embeddings, thereby affecting link prediction outcomes. These findings provide theoretical insights for addressing performance bottlenecks on FB15k-237, while the proposed analytical framework also offers methodological guidance for future studies dealing with structurally constrained datasets.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Robust and Efficient Average Consensus with Non-Coherent Over-the-Air Aggregation
Authors:
Yuhang Deng,
Zheng Chen,
Erik G. Larsson
Abstract:
Non-coherent over-the-air (OTA) computation has garnered increasing attention for its advantages in facilitating information aggregation among distributed agents in resource-constrained networks without requiring precise channel estimation. A promising application scenario of this method is distributed average consensus in wireless multi-agent systems. However, in such scenario, non-coherent inter…
▽ More
Non-coherent over-the-air (OTA) computation has garnered increasing attention for its advantages in facilitating information aggregation among distributed agents in resource-constrained networks without requiring precise channel estimation. A promising application scenario of this method is distributed average consensus in wireless multi-agent systems. However, in such scenario, non-coherent interference from concurrent OTA transmissions can introduce bias in the consensus value. To address this issue, we develop a robust distributed average consensus algorithm by formulating the consensus problem as a distributed optimization problem. Using decentralized projected gradient descent (D-PGD), our proposed algorithm can achieve unbiased mean square average consensus even in the presence of non-coherent interference and noise. Additionally, we implement transmit power control and receive scaling mechanisms to further accelerate convergence. Simulation results demonstrate that our method can significantly enhance the convergence speed of the D-PGD algorithm for OTA average consensus without compromising accuracy.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
RAISE: Optimizing RIS Placement to Maximize Task Throughput in Multi-Server Vehicular Edge Computing
Authors:
Yanan Ma,
Zhengru Fang,
Longzhi Yuan,
Yiqin Deng,
Xianhao Chen,
Yuguang Fang
Abstract:
Given the limited computing capabilities on autonomous vehicles, onboard processing of large volumes of latency-sensitive tasks presents significant challenges. While vehicular edge computing (VEC) has emerged as a solution, offloading data-intensive tasks to roadside servers or other vehicles is hindered by large obstacles like trucks/buses and the surge in service demands during rush hours. To a…
▽ More
Given the limited computing capabilities on autonomous vehicles, onboard processing of large volumes of latency-sensitive tasks presents significant challenges. While vehicular edge computing (VEC) has emerged as a solution, offloading data-intensive tasks to roadside servers or other vehicles is hindered by large obstacles like trucks/buses and the surge in service demands during rush hours. To address these challenges, Reconfigurable Intelligent Surface (RIS) can be leveraged to mitigate interference from ground signals and reach more edge servers by elevating RIS adaptively. To this end, we propose RAISE, an optimization framework for RIS placement in multi-server VEC systems. Specifically, RAISE optimizes RIS altitude and tilt angle together with the optimal task assignment to maximize task throughput under deadline constraints. To find a solution, a two-layer optimization approach is proposed, where the inner layer exploits the unimodularity of the task assignment problem to derive the efficient optimal strategy while the outer layer develops a near-optimal hill climbing (HC) algorithm for RIS placement with low complexity. Extensive experiments demonstrate that the proposed RAISE framework consistently outperforms existing benchmarks.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment
Authors:
Ke Wang,
Lei He,
Kun Liu,
Yan Deng,
Wenning Wei,
Sheng Zhao
Abstract:
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment a…
▽ More
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Establishment and Solution of a Multi-Stage Decision Model Based on Hypothesis Testing and Dynamic Programming Algorithm
Authors:
Ziyang Liu,
Yurui Hu,
Yihan Deng
Abstract:
This paper introduces a novel multi-stage decision-making model that integrates hypothesis testing and dynamic programming algorithms to address complex decision-making scenarios.Initially,we develop a sampling inspection scheme that controls for both Type I and Type II errors using a simple random sampling method without replacement,ensuring the randomness and representativeness of the sample whi…
▽ More
This paper introduces a novel multi-stage decision-making model that integrates hypothesis testing and dynamic programming algorithms to address complex decision-making scenarios.Initially,we develop a sampling inspection scheme that controls for both Type I and Type II errors using a simple random sampling method without replacement,ensuring the randomness and representativeness of the sample while minimizing selection bias.Through the application of hypothesis testing theory,a hypothesis testing model concerning the defect rate is established,and formulas for the approximate distribution of the sample defect rate and the minimum sample size required under two different scenarios are derived. Subsequently,a multi-stage dynamic programming decision model is constructed.This involves defining the state transition functions and stage-specific objective functions,followed by obtaining six optimal decision strategies under various conditions through backward recursion.The results demonstrate the model's potent capability for multi-stage decision-making and its high interpretability,offering significant advantages in practical applications.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Goal-Oriented Semantic Communication for Wireless Video Transmission via Generative AI
Authors:
Nan Li,
Yansha Deng,
Dusit Niyato
Abstract:
Efficient video transmission is essential for seamless communication and collaboration within the visually-driven digital landscape. To achieve low latency and high-quality video transmission over a bandwidth-constrained noisy wireless channel, we propose a stable diffusion (SD)-based goal-oriented semantic communication (GSC) framework. In this framework, we first design a semantic encoder that e…
▽ More
Efficient video transmission is essential for seamless communication and collaboration within the visually-driven digital landscape. To achieve low latency and high-quality video transmission over a bandwidth-constrained noisy wireless channel, we propose a stable diffusion (SD)-based goal-oriented semantic communication (GSC) framework. In this framework, we first design a semantic encoder that effectively identify the keyframes from video and extract the relevant semantic information (SI) to reduce the transmission data size. We then develop a semantic decoder to reconstruct the keyframes from the received SI and further generate the full video from the reconstructed keyframes using frame interpolation to ensure high-quality reconstruction. Recognizing the impact of wireless channel noise on SI transmission, we also propose an SD-based denoiser for GSC (SD-GSC) condition on an instantaneous channel gain to remove the channel noise from the received noisy SI under a known channel. For scenarios with an unknown channel, we further propose a parallel SD denoiser for GSC (PSD-GSC) to jointly learn the distribution of channel gains and denoise the received SI. It is shown that, with the known channel, our proposed SD-GSC outperforms state-of-the-art ADJSCC, Latent-Diff DNSC, DeepWiVe and DVST, improving Peak Signal-to-Noise Ratio (PSNR) by 69%, 58%, 33% and 38%, reducing mean squared error (MSE) by 52%, 50%, 41% and 45%, and reducing Fréchet Video Distance (FVD) by 38%, 32%, 22% and 24%, respectively. With the unknown channel, our PSD-GSC achieves a 17% improvement in PSNR, a 29% reduction in MSE, and a 19% reduction in FVD compared to MMSE equalizer-enhanced SD-GSC. These significant performance improvements demonstrate the robustness and superiority of our proposed methods in enhancing video transmission quality and efficiency under various channel conditions.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing
Authors:
Xiang Xiang,
Zhuo Xu,
Yao Deng,
Qinhao Zhou,
Yifan Liang,
Ke Chen,
Qingfang Zheng,
Yaowei Wang,
Xilin Chen,
Wen Gao
Abstract:
In open-world remote sensing, deployed models must continuously adapt to a steady influx of new data, which often exhibits various shifts compared to what the model encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update themselves. These challenges give rise to a variety of open-wo…
▽ More
In open-world remote sensing, deployed models must continuously adapt to a steady influx of new data, which often exhibits various shifts compared to what the model encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update themselves. These challenges give rise to a variety of open-world tasks. However, existing open-world remote sensing studies typically train and test within a single dataset to simulate open-world conditions. Currently, there is a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce OpenEarthSensing, a large-scale fine-grained benchmark for open-world remote sensing. OpenEarthSensing includes 189 scene and objects categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, OpenEarthSensing encompasses five data domains with significant covariate shifts, including two RGB satellite domians, one RGB aerial domian, one MS RGB domian, and one infrared domian. The various domains provide a more comprehensive testbed for evaluating the generalization performance of open-world models. We conduct the baseline evaluation of current mainstream open-world tasks and methods on OpenEarthSensing, demonstrating that it serves as a challenging benchmark for open-world remote sensing.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
When xURLLC Meets NOMA: A Stochastic Network Calculus Perspective
Authors:
Yuang Chen,
Hancheng Lu,
Langtin Qin,
Yansha Deng,
Arumugam Nallanathan
Abstract:
The advent of next-generation ultra-reliable and low-latency communications (xURLLC) presents stringent and unprecedented requirements for key performance indicators (KPIs). As a disruptive technology, non-orthogonal multiple access (NOMA) harbors the potential to fulfill these stringent KPIs essential for xURLLC. However, the immaturity of research on the tail distributions of these KPIs signific…
▽ More
The advent of next-generation ultra-reliable and low-latency communications (xURLLC) presents stringent and unprecedented requirements for key performance indicators (KPIs). As a disruptive technology, non-orthogonal multiple access (NOMA) harbors the potential to fulfill these stringent KPIs essential for xURLLC. However, the immaturity of research on the tail distributions of these KPIs significantly impedes the application of NOMA to xURLLC. Stochastic network calculus (SNC), as a potent methodology, is leveraged to provide dependable theoretical insights into tail distribution analysis and statistical QoS provisioning (SQP). In this article, we develop a NOMA-assisted uplink xURLLC network architecture that incorporates an SNC-based SQP theoretical framework (SNC-SQP) to support tail distribution analysis in terms of delay, age-of-information (AoI), and reliability. Based on SNC-SQP, an SQP-driven power optimization problem is proposed to minimize transmit power while guaranteeing xURLLC's KPIs on delay, AoI, reliability, and power consumption. Extensive simulations validate our proposed theoretical framework and demonstrate that the proposed power allocation scheme significantly reduces uplink transmit power and outperforms conventional schemes in terms of SQP performance.
△ Less
Submitted 11 January, 2025;
originally announced January 2025.
-
A Simplified Algorithm for Joint Real-Time Synchronization, NLoS Identification, and Multi-Agent Localization
Authors:
Yili Deng,
Jie Fan,
Jiguang He,
Baojia Luo,
Miaomiao Dong,
Zhongyi Huang
Abstract:
Real-time, high-precision localization in large-scale wireless networks faces two primary challenges: clock offsets caused by network asynchrony and non-line-of-sight (NLoS) conditions. To tackle these challenges, we propose a low-complexity real-time algorithm for joint synchronization and NLoS identification-based localization. For precise synchronization, we resolve clock offsets based on accum…
▽ More
Real-time, high-precision localization in large-scale wireless networks faces two primary challenges: clock offsets caused by network asynchrony and non-line-of-sight (NLoS) conditions. To tackle these challenges, we propose a low-complexity real-time algorithm for joint synchronization and NLoS identification-based localization. For precise synchronization, we resolve clock offsets based on accumulated time-of-arrival measurements from all the past time instances, modeling it as a large-scale linear least squares (LLS) problem. To alleviate the high computational burden of solving this LLS, we introduce the blockwise recursive Moore-Penrose inverse (BRMP) technique, a generalized recursive least squares approach, and derive a simplified formulation of BRMP tailored specifically for the real-time synchronization problem. Furthermore, we formulate joint NLoS identification and localization as a robust least squares regression (RLSR) problem and address it by using an efficient iterative approach. Simulations show that the proposed algorithm achieves sub-nanosecond synchronization accuracy and centimeter-level localization precision, while maintaining low computational overhead.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
MorphiNet: A Graph Subdivision Network for Adaptive Bi-ventricle Surface Reconstruction
Authors:
Yu Deng,
Yiyang Xu,
Linglong Qian,
Charlene Mauger,
Anastasia Nasopoulou,
Steven Williams,
Michelle Williams,
Steven Niederer,
David Newby,
Andrew McCulloch,
Jeff Omens,
Kuberan Pushprajah,
Alistair Young
Abstract:
Cardiac Magnetic Resonance (CMR) imaging is widely used for heart modelling and digital twin computational analysis due to its ability to visualize soft tissues and capture dynamic functions. However, the anisotropic nature of CMR images, characterized by large inter-slice distances and misalignments from cardiac motion, poses significant challenges to accurate model reconstruction. These limitati…
▽ More
Cardiac Magnetic Resonance (CMR) imaging is widely used for heart modelling and digital twin computational analysis due to its ability to visualize soft tissues and capture dynamic functions. However, the anisotropic nature of CMR images, characterized by large inter-slice distances and misalignments from cardiac motion, poses significant challenges to accurate model reconstruction. These limitations result in data loss and measurement inaccuracies, hindering the capture of detailed anatomical structures. This study introduces MorphiNet, a novel network that enhances heart model reconstruction by leveraging high-resolution Computer Tomography (CT) images, unpaired with CMR images, to learn heart anatomy. MorphiNet encodes anatomical structures as gradient fields, transforming template meshes into patient-specific geometries. A multi-layer graph subdivision network refines these geometries while maintaining dense point correspondence. The proposed method achieves high anatomy fidelity, demonstrating approximately 40% higher Dice scores, half the Hausdorff distance, and around 3 mm average surface error compared to state-of-the-art methods. MorphiNet delivers superior results with greater inference efficiency. This approach represents a significant advancement in addressing the challenges of CMR-based heart model reconstruction, potentially improving digital twin computational analyses of cardiac structure and functions.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
A reconfigurable calibration-free digital-to-time converter based on a high-speed transceiver
Authors:
Dexuan Kong,
Zaiming Fu,
Yujie Deng,
Ruiqi Wang
Abstract:
This paper proposes a high-speed transceiver-based method for implementing a digital-to-time converter (DTC). A real-time decoding technique is introduced to inject time information into high-speed pattern data. The stability of the high-speed clock ensures the high precision of the synthesized timing signal without the need for calibration. The reconfigurability of the clock resources provides th…
▽ More
This paper proposes a high-speed transceiver-based method for implementing a digital-to-time converter (DTC). A real-time decoding technique is introduced to inject time information into high-speed pattern data. The stability of the high-speed clock ensures the high precision of the synthesized timing signal without the need for calibration. The reconfigurability of the clock resources provides the DTC with variable resolution and enhanced flexibility for various applications. Based on this approach, a multifunctional DTC is designed to offer both timing sequence and random timing signal functionalities, catering to a wide range of application scenarios. The timing sequence function generates a continuously variable timing signal stream, while the random timing signal function produces random signals with uniformly distributed time intervals. Experimental results, using a Xilinx Kintex-7 FPGA, validate the effectiveness of the proposed methodology. The system achieves a resolution of 100 ps, a dynamic range from 1 ns to 40 μs, a DNL of -0.02/0.02 LSB, an INL of -0.04/0.03 LSB across the entire range. This approach can be readily adapted to various high-precision timing signal applications.
△ Less
Submitted 10 December, 2024; v1 submitted 9 December, 2024;
originally announced December 2024.
-
Ensuring Safety in Target Pursuit Control: A CBF-Safe Reinforcement Learning Approach
Authors:
Yaosheng Deng,
Junjie Gao,
Jiaping Xiao,
Mir Feroskhan
Abstract:
This paper addresses the target-pursuit problem, aiming to ensure each pursuer's safety regarding collision avoidance, sensing range, and input saturation. An input-constrained CBF is proposed to dynamically regulate the pursuer's control, ensuring effective target pursuit even when the target performs evasive maneuvers. To further ensure safety, two sets of CBF constraints are designed to regulat…
▽ More
This paper addresses the target-pursuit problem, aiming to ensure each pursuer's safety regarding collision avoidance, sensing range, and input saturation. An input-constrained CBF is proposed to dynamically regulate the pursuer's control, ensuring effective target pursuit even when the target performs evasive maneuvers. To further ensure safety, two sets of CBF constraints are designed to regulate the pursuer's position, enabling it to keep the target within the sensing range while avoiding collision in complex environments with external disturbances. These three CBFs collectively form our safety filter, which filters unsafe outputs from RL by solving a Quadratic Program (QP). Finally, the safety filter, combined with a switch strategy that enhances the feasibility of solving its QP, constitutes the Control Barrier Function (CBF)-Safe Reinforcement Learning (CSRL) algorithm, whose solutions are proven to satisfy the Karush-Kuhn-Tucker (KKT) conditions for all safety constraints. Simulation results validate the effectiveness of the CSRL algorithm, demonstrating its ability to handle complex pursuit scenarios while maintaining safety and improving control performance.
△ Less
Submitted 10 December, 2024; v1 submitted 26 November, 2024;
originally announced November 2024.
-
Goal-oriented Semantic Communications for Metaverse Construction via Generative AI and Optimal Transport
Authors:
Zhe Wang,
Nan Li,
Yansha Deng,
A. Hamid Aghvami
Abstract:
The emergence of the metaverse has boosted productivity and creativity, driving real-time updates and personalized content, which will substantially increase data traffic. However, current bit-oriented communication networks struggle to manage this high volume of dynamic information, restricting metaverse applications interactivity. To address this research gap, we propose a goal-oriented semantic…
▽ More
The emergence of the metaverse has boosted productivity and creativity, driving real-time updates and personalized content, which will substantially increase data traffic. However, current bit-oriented communication networks struggle to manage this high volume of dynamic information, restricting metaverse applications interactivity. To address this research gap, we propose a goal-oriented semantic communication (GSC) framework for metaverse. Building on an existing metaverse wireless construction task, our proposed GSC framework includes an hourglass network-based (HgNet) encoder to extract semantic information of objects in the metaverse; and a semantic decoder that uses this extracted information to reconstruct the metaverse content after wireless transmission, enabling efficient communication and real-time object behaviour updates to the scenery for metaverse construction task. To overcome the wireless channel noise at the receiver, we design an optimal transport (OT)-enabled semantic denoiser, which enhances the accuracy of metaverse scenery through wireless communication. Experimental results show that compared to the conventional metaverse construction, our proposed GSC framework significantly reduces wireless metaverse construction latency by 92.6\%, while improving metaverse object status accuracy and viewing experience by 45.6\% and 44.7\%, respectively.
△ Less
Submitted 30 March, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Attentive Contextual Attention for Cloud Removal
Authors:
Wenli Huang,
Ye Deng,
Yang Wu,
Jinjun Wang
Abstract:
Cloud cover can significantly hinder the use of remote sensing images for Earth observation, prompting urgent advancements in cloud removal technology. Recently, deep learning strategies have shown strong potential in restoring cloud-obscured areas. These methods utilize convolution to extract intricate local features and attention mechanisms to gather long-range information, improving the overall…
▽ More
Cloud cover can significantly hinder the use of remote sensing images for Earth observation, prompting urgent advancements in cloud removal technology. Recently, deep learning strategies have shown strong potential in restoring cloud-obscured areas. These methods utilize convolution to extract intricate local features and attention mechanisms to gather long-range information, improving the overall comprehension of the scene. However, a common drawback of these approaches is that the resulting images often suffer from blurriness, artifacts, and inconsistencies. This is partly because attention mechanisms apply weights to all features based on generalized similarity scores, which can inadvertently introduce noise and irrelevant details from cloud-covered areas. To overcome this limitation and better capture relevant distant context, we introduce a novel approach named Attentive Contextual Attention (AC-Attention). This method enhances conventional attention mechanisms by dynamically learning data-driven attentive selection scores, enabling it to filter out noise and irrelevant features effectively. By integrating the AC-Attention module into the DSen2-CR cloud removal framework, we significantly improve the model's ability to capture essential distant information, leading to more effective cloud removal. Our extensive evaluation of various datasets shows that our method outperforms existing ones regarding image reconstruction quality. Additionally, we conducted ablation studies by integrating AC-Attention into multiple existing methods and widely used network architectures. These studies demonstrate the effectiveness and adaptability of AC-Attention and reveal its ability to focus on relevant features, thereby improving the overall performance of the networks. The code is available at \url{https://github.com/huangwenwenlili/ACA-CRNet}.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
Goal-oriented Semantic Communication for Robot Arm Reconstruction in Digital Twin: Feature and Temporal Selections
Authors:
Shutong Chen,
Emmanouil Spyrakos-Papastavridis,
Yichao Jin,
Yansha Deng
Abstract:
As one of the most promising technologies in industry, the Digital Twin (DT) facilitates real-time monitoring and predictive analysis for real-world systems by precisely reconstructing virtual replicas of physical entities. However, this reconstruction faces unprecedented challenges due to the everincreasing communication overhead, especially for digital robot arm reconstruction. To this end, we p…
▽ More
As one of the most promising technologies in industry, the Digital Twin (DT) facilitates real-time monitoring and predictive analysis for real-world systems by precisely reconstructing virtual replicas of physical entities. However, this reconstruction faces unprecedented challenges due to the everincreasing communication overhead, especially for digital robot arm reconstruction. To this end, we propose a novel goal-oriented semantic communication (GSC) framework to extract the GSC information for the robot arm reconstruction task in the DT, with the aim of minimising the communication load under the strict and relaxed reconstruction error constraints. Unlike the traditional reconstruction framework that periodically transmits a reconstruction message for real-time DT reconstruction, our framework implements a feature selection (FS) algorithm to extract the semantic information from the reconstruction message, and a deep reinforcement learning-based temporal selection algorithm to selectively transmit the semantic information over time. We validate our proposed GSC framework through both Pybullet simulations and lab experiments based on the Franka Research 3 robot arm. For a range of distinct robotic tasks, simulation results show that our framework can reduce the communication load by at least 59.5% under strict reconstruction error constraints and 80% under relaxed reconstruction error constraints, compared with traditional communication framework. Also, experimental results confirm the effectiveness of our framework, where the communication load is reduced by 53% in strict constraint case and 74% in relaxed constraint case. The demo is available at: https://youtu.be/2OdeHKxcgnk.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
Goal-Oriented Semantic Communication for Wireless Visual Question Answering
Authors:
Sige Liu,
Nan Li,
Yansha Deng,
Tony Q. S. Quek
Abstract:
The rapid progress of artificial intelligence (AI) and computer vision (CV) has facilitated the development of computation-intensive applications like Visual Question Answering (VQA), which integrates visual perception and natural language processing to generate answers. To overcome the limitations of traditional VQA constrained by local computation resources, edge computing has been incorporated…
▽ More
The rapid progress of artificial intelligence (AI) and computer vision (CV) has facilitated the development of computation-intensive applications like Visual Question Answering (VQA), which integrates visual perception and natural language processing to generate answers. To overcome the limitations of traditional VQA constrained by local computation resources, edge computing has been incorporated to provide extra computation capability at the edge side. Meanwhile, this brings new communication challenges between the local and edge, including limited bandwidth, channel noise, and multipath effects, which degrade VQA performance and user quality of experience (QoE), particularly during the transmission of large high-resolution images. To overcome these bottlenecks, we propose a goal-oriented semantic communication (GSC) framework that focuses on effectively extracting and transmitting semantic information most relevant to the VQA goals, improving the answering accuracy and enhancing the effectiveness and efficiency. The objective is to maximize the answering accuracy, and we propose a bounding box (BBox)-based image semantic extraction and ranking approach to prioritize the semantic information based on the goal of questions. We then extend it by incorporating a scene graphs (SG)-based approach to handle questions with complex relationships. Experimental results demonstrate that our GSC framework improves answering accuracy by up to 49% under AWGN channels and 59% under Rayleigh channels while reducing total latency by up to 65% compared to traditional bit-oriented transmission.
△ Less
Submitted 27 November, 2024; v1 submitted 3 November, 2024;
originally announced November 2024.
-
Equilibrium Adaptation-Based Control for Track Stand of Single-Track Two-Wheeled Robots
Authors:
Boyi Wang,
Yang Deng,
Feilong Jing,
Yiyong Sun,
Zhang Chen,
Bin Liang
Abstract:
Stationary balance control is challenging for single-track two-wheeled (STTW) robots due to the lack of elegant balancing mechanisms and the conflict between the limited attraction domain and external disturbances. To address the absence of balancing mechanisms, we draw inspiration from cyclists and leverage the track stand maneuver, which relies solely on steering and rear-wheel actuation. To ach…
▽ More
Stationary balance control is challenging for single-track two-wheeled (STTW) robots due to the lack of elegant balancing mechanisms and the conflict between the limited attraction domain and external disturbances. To address the absence of balancing mechanisms, we draw inspiration from cyclists and leverage the track stand maneuver, which relies solely on steering and rear-wheel actuation. To achieve accurate tracking in the presence of matched and mismatched disturbances, we propose an equilibrium adaptation-based control (EABC) scheme that can be seamlessly integrated with standard disturbance observers and controllers. This scheme enables adaptation to slow-varying disturbances by utilizing a disturbed equilibrium estimator, effectively handling both matched and mismatched disturbances in a unified manner while ensuring accurate tracking with zero steady-state error. We integrate the EABC scheme with nonlinear model predictive control (MPC) for the track stand of STTW robots and validate its effectiveness through two experimental scenarios. Our method demonstrates significant improvements in tracking accuracy, reducing errors by several orders of magnitude.
△ Less
Submitted 7 November, 2024; v1 submitted 25 October, 2024;
originally announced October 2024.
-
IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers
Authors:
Zihan Fang,
Zheng Lin,
Senkang Hu,
Hangcheng Cao,
Yiqin Deng,
Xianhao Chen,
Yuguang Fang
Abstract:
Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state f…
▽ More
Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.
△ Less
Submitted 21 November, 2024; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Authors:
Ye Bai,
Haonan Chen,
Jitong Chen,
Zhuo Chen,
Yi Deng,
Xiaohong Dong,
Lamtharn Hantrakul,
Weituo Hao,
Qingqing Huang,
Zhongyi Huang,
Dongya Jia,
Feihu La,
Duc Le,
Bochen Li,
Chumin Li,
Hui Li,
Xingxing Li,
Shouda Liu,
Wei-Tsung Lu,
Yiqing Lu,
Andrew Shaw,
Janne Spijkervet,
Yakun Sun,
Bo Wang,
Ju-Chiang Wang
, et al. (13 additional authors not shown)
Abstract:
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music gene…
▽ More
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio.
We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music "https://team.doubao.com/seed-music".
△ Less
Submitted 19 September, 2024; v1 submitted 13 September, 2024;
originally announced September 2024.
-
Smart CSI Processing for Accruate Commodity WiFi-based Humidity Sensing
Authors:
Yirui Deng,
Deepak Mishra,
Shaghik Atakaramians,
Aruna Seneviratne
Abstract:
Indoor humidity is a crucial factor affecting people's health and well-being. Wireless humidity sensing techniques are scalable and low-cost, making them a promising solution for measuring humidity in indoor environments without requiring additional devices. Such, machine learning (ML) assisted WiFi sensing is being envisioned as the key enabler for integrated sensing and communication (ISAC). How…
▽ More
Indoor humidity is a crucial factor affecting people's health and well-being. Wireless humidity sensing techniques are scalable and low-cost, making them a promising solution for measuring humidity in indoor environments without requiring additional devices. Such, machine learning (ML) assisted WiFi sensing is being envisioned as the key enabler for integrated sensing and communication (ISAC). However, the current WiFi-based sensing systems, such as WiHumidity, suffer from low accuracy. We propose an enhanced WiFi-based humidity detection framework to address this issue that utilizes innovative filtering and data processing techniques to exploit humidity-specific channel state information (CSI) signatures during RF sensing. These signals are then fed into ML algorithms for detecting different humidity levels. Specifically, our improved de-noising solution for the CSI captured by commodity hardware for WiFi sensing, combined with the k-th nearest neighbour ML algorithm and resolution tuning technique, helps improve humidity sensing accuracy. Our commercially available hardware-based experiments provide insights into achievable sensing resolution. Our empirical investigation shows that our enhanced framework can improve the accuracy of humidity sensing to 97%.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Two-Stage Hierarchical and Explainable Feature Selection Framework for Dimensionality Reduction in Sleep Staging
Authors:
Yangfan Deng,
Hamad Albidah,
Ahmed Dallal,
Jijun Yin,
Zhi-Hong Mao
Abstract:
Sleep is crucial for human health, and EEG signals play a significant role in sleep research. Due to the high-dimensional nature of EEG signal data sequences, data visualization and clustering of different sleep stages have been challenges. To address these issues, we propose a two-stage hierarchical and explainable feature selection framework by incorporating a feature selection algorithm to impr…
▽ More
Sleep is crucial for human health, and EEG signals play a significant role in sleep research. Due to the high-dimensional nature of EEG signal data sequences, data visualization and clustering of different sleep stages have been challenges. To address these issues, we propose a two-stage hierarchical and explainable feature selection framework by incorporating a feature selection algorithm to improve the performance of dimensionality reduction. Inspired by topological data analysis, which can analyze the structure of high-dimensional data, we extract topological features from the EEG signals to compensate for the structural information loss that happens in traditional spectro-temporal data analysis. Supported by the topological visualization of the data from different sleep stages and the classification results, the proposed features are proven to be effective supplements to traditional features. Finally, we compare the performances of three dimensionality reduction algorithms: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). Among them, t-SNE achieved the highest accuracy of 79.8%, but considering the overall performance in terms of computational resources and metrics, UMAP is the optimal choice.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Safe Adaptive Control for Uncertain Systems with Complex Input Constraints
Authors:
Yaosheng Deng,
Yang Bai,
Yujie Wang,
Masaki Ogura,
Mir Feroskhan
Abstract:
In this paper, we propose a novel adaptive Control Barrier Function (CBF) based controller for nonlinear systems with complex, time-varying input constraints. Conventional CBF approaches often struggle with feasibility issues and stringent assumptions when addressing input constraints. Unlike these methods, our approach converts the input-constraint problem into an output-constraint CBF design. Th…
▽ More
In this paper, we propose a novel adaptive Control Barrier Function (CBF) based controller for nonlinear systems with complex, time-varying input constraints. Conventional CBF approaches often struggle with feasibility issues and stringent assumptions when addressing input constraints. Unlike these methods, our approach converts the input-constraint problem into an output-constraint CBF design. This transformation simplifies the Quadratic Programming (QP) formulation and enhances compatibility with the CBF framework. We design an adaptive CBF-based controller to manage the mismatched uncertainties introduced by this transformation. Our method systematically addresses the challenges of complex, time-varying, and state-dependent input constraints. The efficacy of the proposed approach is validated using numerical examples.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Goal-Oriented UAV Communication Design and Optimization for Target Tracking: A MachineLearning Approach
Authors:
Wenchao Wu,
Yanning Wu,
Yuanqing Yang,
Yansha Deng
Abstract:
To accomplish various tasks, safe and smooth control of unmanned aerial vehicles (UAVs) needs to be guaranteed, which cannot be met by existing ultra-reliable low latency communications (URLLC). This has attracted the attention of the communication field, where most existing work mainly focused on optimizing communication performance (i.e., delay) and ignored the performance of the task (i.e., tra…
▽ More
To accomplish various tasks, safe and smooth control of unmanned aerial vehicles (UAVs) needs to be guaranteed, which cannot be met by existing ultra-reliable low latency communications (URLLC). This has attracted the attention of the communication field, where most existing work mainly focused on optimizing communication performance (i.e., delay) and ignored the performance of the task (i.e., tracking accuracy). To explore the effectiveness of communication in completing a task, in this letter, we propose a goal-oriented communication framework adopting a deep reinforcement learning (DRL) algorithm with a proactive repetition scheme (DeepP) to optimize C&C data selection and the maximum number of repetitions in a real-time target tracking task, where a base station (BS) controls a UAV to track a mobile target. The effectiveness of our proposed approach is validated by comparing it with the traditional proportional integral derivative (PID) algorithm.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Goal-oriented Semantic Communication for the Metaverse Application
Authors:
Zhe Wang,
Nan Li,
Yansha Deng
Abstract:
With the emergence of the metaverse and its role in enabling real-time simulation and analysis of real-world counterparts, an increasing number of personalized metaverse scenarios are being created to influence entertainment experiences and social behaviors. However, compared to traditional image and video entertainment applications, the exact transmission of the vast amount of metaverse-associate…
▽ More
With the emergence of the metaverse and its role in enabling real-time simulation and analysis of real-world counterparts, an increasing number of personalized metaverse scenarios are being created to influence entertainment experiences and social behaviors. However, compared to traditional image and video entertainment applications, the exact transmission of the vast amount of metaverse-associated information significantly challenges the capacity of existing bit-oriented communication networks. Moreover, the current metaverse also witnesses a growing goal shift for transmitting the meaning behind custom-designed content, such as user-designed buildings and avatars, rather than exact copies of physical objects. To meet this growing goal shift and bandwidth challenge, this paper proposes a goal-oriented semantic communication framework for metaverse application (GSCM) to explore and define semantic information through the goal levels. Specifically, we first analyze the traditional image communication framework in metaverse construction and then detail our proposed semantic information along with the end-to-end wireless communication. We then describe the designed modules of the GSCM framework, including goal-oriented semantic information extraction, base knowledge definition, and neural radiance field (NeRF) based metaverse construction. Finally, numerous experiments have been conducted to demonstrate that, compared to image communication, our proposed GSCM framework decreases transmission latency by up to 92.6% and enhances the virtual object operation accuracy and metaverse construction clearance by up to 45.6% and 44.7%, respectively.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Goal-Oriented Semantic Communication for Wireless Image Transmission via Stable Diffusion
Authors:
Nan Li,
Yansha Deng
Abstract:
Efficient image transmission is essential for seamless communication and collaboration within the visually-driven digital landscape. To achieve low latency and high-quality image reconstruction over a bandwidth-constrained noisy wireless channel, we propose a stable diffusion (SD)-based goal-oriented semantic communication (GSC) framework. In this framework, we design a semantic autoencoder that e…
▽ More
Efficient image transmission is essential for seamless communication and collaboration within the visually-driven digital landscape. To achieve low latency and high-quality image reconstruction over a bandwidth-constrained noisy wireless channel, we propose a stable diffusion (SD)-based goal-oriented semantic communication (GSC) framework. In this framework, we design a semantic autoencoder that effectively extracts semantic information (SI) from images to reduce the transmission data size while ensuring high-quality reconstruction. Recognizing the impact of wireless channel noise on SI transmission, we propose an SD-based denoiser for GSC (SD-GSC) conditional on an instantaneous channel gain to remove the channel noise from the received noisy SI under known channel. For scenarios with unknown channel, we further propose a parallel SD denoiser for GSC (PSD-GSC) to jointly learn the distribution of channel gains and denoise the received SI. It is shown that, with the known channel, our SD-GSC outperforms state-of-the-art ADJSCC and Latent-Diff DNSC, improving Peak Signal-to-Noise Ratio (PSNR) by 32% and 21%, and reducing Fréchet Inception Distance (FID) by 40% and 35%, respectively. With the unknown channel, our PSD-GSC improves PSNR by 8% and reduces FID by 17% compared to MMSE equalizer-enhanced SD-GSC.
△ Less
Submitted 28 February, 2025; v1 submitted 1 August, 2024;
originally announced August 2024.
-
Task-oriented and Semantics-aware Communications for Augmented Reality
Authors:
Zhe Wang,
Yansha Deng
Abstract:
Upon the advent of the emerging metaverse and its related applications in Augmented Reality (AR), the current bit-oriented network struggles to support real-time changes for the vast amount of associated information, creating a significant bottleneck in its development. To address the above problem, we present a novel task-oriented and semantics-aware communication framework for augmented reality…
▽ More
Upon the advent of the emerging metaverse and its related applications in Augmented Reality (AR), the current bit-oriented network struggles to support real-time changes for the vast amount of associated information, creating a significant bottleneck in its development. To address the above problem, we present a novel task-oriented and semantics-aware communication framework for augmented reality (TSAR) to enhance communication efficiency and effectiveness significantly. We first present an analysis of the traditional wireless AR point cloud communication framework, followed by a detailed summary of our proposed semantic information extraction within the end-to-end communication. Then, we detail the components of the TSAR framework, incorporating semantics extraction with deep learning, task-oriented base knowledge selection, and avatar pose recovery. Through rigorous experimentation, we demonstrate that our proposed TSAR framework considerably outperforms traditional point cloud communication framework, reducing wireless AR application transmission latency by 95.6% and improving communication effectiveness in geometry and color aspects by up to 82.4% and 20.4%, respectively.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
A Holistic Optimization Framework for Energy Efficient UAV-assisted Fog Computing: Attitude Control, Trajectory Planning and Task Assignment
Authors:
Shuaijun Liu,
Jinqiu Du,
Yaxin Zheng,
Jiaying Yin,
Yuhui Deng,
Jingjin Wu
Abstract:
Unmanned Aerial Vehicles (UAVs) have significantly enhanced fog computing by acting as both flexible computation platforms and communication mobile relays. In this paper, we propose a holistic framework that jointly optimizes the total latency and energy consumption for UAV-assisted fog computing in a three-dimensional spatial domain with varying terrain elevations and dynamic task generations. Ou…
▽ More
Unmanned Aerial Vehicles (UAVs) have significantly enhanced fog computing by acting as both flexible computation platforms and communication mobile relays. In this paper, we propose a holistic framework that jointly optimizes the total latency and energy consumption for UAV-assisted fog computing in a three-dimensional spatial domain with varying terrain elevations and dynamic task generations. Our proposed framework considers three important and interdependent modules: attitude control, trajectory planning, and task assignment. We first establish a fuzzy proportional-integral-derivative control model to determine the UAV's attitude. Then, we propose an enhanced Ant Colony System (ACS) based algorithm, that includes a safety value and a decoupling mechanism to overcome the convergence issue in classical ACS, to compute the optimal UAV trajectory. Finally, we design an algorithm based on the Particle Swarm Optimization technique, to determine where each offloaded task should be executed. Under our proposed framework, the outcome of one module would affect the decision-making in one other, providing a holistic perspective of the system and thus leading to improved solutions. We demonstrate by extensive simulation results that our proposed framework can significantly improve the overall performance, measured by latency and energy consumption, compared to existing baseline approaches.
△ Less
Submitted 5 August, 2024; v1 submitted 20 July, 2024;
originally announced July 2024.
-
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Authors:
Anbai Jiang,
Bing Han,
Zhiqiang Lv,
Yufeng Deng,
Wei-Qiang Zhang,
Xie Chen,
Yanmin Qian,
Jia Liu,
Pingyi Fan
Abstract:
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res…
▽ More
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
Authors:
Jinlong Xue,
Yayue Deng,
Yingming Gao,
Ya Li
Abstract:
Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current pr…
▽ More
Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG method outperforms baselines, and our CA-CLAP achieves better results than text-only retrieval methods.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
Authors:
Jinlong Xue,
Yayue Deng,
Yicheng Han,
Yingming Gao,
Ya Li
Abstract:
Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we intr…
▽ More
Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation
Authors:
Yimin Deng,
Jianzong Wang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issue…
▽ More
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
AI WALKUP: A Computer-Vision Approach to Quantifying MDS-UPDRS in Parkinson's Disease
Authors:
Xiang Xiang,
Zihan Zhang,
Jing Ma,
Yao Deng
Abstract:
Parkinson's Disease (PD) is the second most common neurodegenerative disorder. The existing assessment method for PD is usually the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to assess the severity of various types of motor symptoms and disease progression. However, manual assessment suffers from high subjectivity, lack of consistency, and high cost and low ef…
▽ More
Parkinson's Disease (PD) is the second most common neurodegenerative disorder. The existing assessment method for PD is usually the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to assess the severity of various types of motor symptoms and disease progression. However, manual assessment suffers from high subjectivity, lack of consistency, and high cost and low efficiency of manual communication. We want to use a computer vision based solution to capture human pose images based on a camera, reconstruct and perform motion analysis using algorithms, and extract the features of the amount of motion through feature engineering. The proposed approach can be deployed on different smartphones, and the video recording and artificial intelligence analysis can be done quickly and easily through our APP.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Swarm navigation of cyborg-insects in unknown obstructed soft terrain
Authors:
Yang Bai,
Phuoc Thanh Tran Ngoc,
Huu Duoc Nguyen,
Duc Long Le,
Quang Huy Ha,
Kazuki Kai,
Yu Xiang See To,
Yaosheng Deng,
Jie Song,
Naoki Wakamiya,
Hirotaka Sato,
Masaki Ogura
Abstract:
Cyborg insects refer to hybrid robots that integrate living insects with miniature electronic controllers to enable robotic-like programmable control. These creatures exhibit advantages over conventional robots in adaption to complex terrain and sustained energy efficiency. Nevertheless, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficult…
▽ More
Cyborg insects refer to hybrid robots that integrate living insects with miniature electronic controllers to enable robotic-like programmable control. These creatures exhibit advantages over conventional robots in adaption to complex terrain and sustained energy efficiency. Nevertheless, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficulty in coordinating the movements of a cyborg system under the presence of insects' inherent individual variability in their reactions to control input. Regarding this issue, we propose a swarm navigation algorithm and verify it under experiments. This research advances swarm robotics by integrating biological organisms with control theory to develop intelligent autonomous systems for real-world applications.
△ Less
Submitted 21 December, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Enhancing xURLLC with RSMA-Assisted Massive-MIMO Networks: Performance Analysis and Optimization
Authors:
Yuang Chen,
Hancheng Lu,
Chenwu Zhang,
Yansha Deng,
Arumugam Nallanathan
Abstract:
Massive interconnection has sparked people's envisioning for next-generation ultra-reliable and low-latency communications (xURLLC), prompting the design of customized next-generation advanced transceivers (NGAT). Rate-splitting multiple access (RSMA) has emerged as a pivotal technology for NGAT design, given its robustness to imperfect channel state information (CSI) and resilience to quality of…
▽ More
Massive interconnection has sparked people's envisioning for next-generation ultra-reliable and low-latency communications (xURLLC), prompting the design of customized next-generation advanced transceivers (NGAT). Rate-splitting multiple access (RSMA) has emerged as a pivotal technology for NGAT design, given its robustness to imperfect channel state information (CSI) and resilience to quality of service (QoS). Additionally, xURLLC urgently appeals to large-scale access techniques, thus massive multiple-input multiple-output (mMIMO) is anticipated to integrate with RSMA to enhance xURLLC. In this paper, we develop an innovative RSMA-assisted massive-MIMO xURLLC (RSMA-mMIMO-xURLLC) network architecture tailored to accommodate xURLLC's critical QoS constraints in finite blocklength (FBL) regimes. Leveraging uplink pilot training under imperfect CSI at the transmitter, we estimate channel gains and customize linear precoders for efficient downlink short-packet data transmission. Subsequently, we formulate a joint rate-splitting, beamforming, and transmit antenna selection optimization problem to maximize the total effective transmission rate (ETR). Addressing this multi-variable coupled non-convex problem, we decompose it into three corresponding subproblems and propose a low-complexity joint iterative algorithm for efficient optimization. Extensive simulations substantiate that compared with non-orthogonal multiple access (NOMA) and space division multiple access (SDMA), the developed architecture improves the total ETR by 15.3% and 41.91%, respectively, as well as accommodates larger-scale access.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Federated Reinforcement Learning for Uplink Centric Broadband Communication Optimization over Unlicensed Spectrum
Authors:
Hui Zhou,
Yansha Deng
Abstract:
To provide Uplink Centric Broadband Communication (UCBC), New Radio Unlicensed (NR-U) network has been standardized to exploit the unlicensed spectrum using Listen Before Talk (LBT) scheme to fairly coexist with the incumbent Wireless Fidelity (WiFi) network. Existing access schemes over unlicensed spectrum are required to perform Clear Channel Assessment (CCA) before transmissions, where fixed En…
▽ More
To provide Uplink Centric Broadband Communication (UCBC), New Radio Unlicensed (NR-U) network has been standardized to exploit the unlicensed spectrum using Listen Before Talk (LBT) scheme to fairly coexist with the incumbent Wireless Fidelity (WiFi) network. Existing access schemes over unlicensed spectrum are required to perform Clear Channel Assessment (CCA) before transmissions, where fixed Energy Detection (ED) thresholds are adopted to identify the channel as idle or busy. However, fixed ED thresholds setting prevents devices from accessing the channel effectively and efficiently, which leads to the hidden node (HN) and exposed node (EN) problems. In this paper, we first develop a centralized double Deep Q-Network (DDQN) algorithm to optimize the uplink system throughput, where the agent is deployed at the central server to dynamically adjust the ED thresholds for NR-U and WiFi networks. Considering that heterogeneous NR-U and WiFi networks, in practice, cannot share the raw data with the central server directly, we then develop a federated DDQN algorithm, where two agents are deployed in the NR-U and WiFi networks, respectively. Our results have shown that the uplink system throughput increases by over 100%, where cell throughput of NR-U network rises by 150%, and cell throughput of WiFi network decreases by 30%. To guarantee the cell throughput of WiFi network, we redesign the reward function to punish the agent when the cell throughput of WiFi network is below the threshold, and our revised design can still provide over 50% uplink system throughput gain.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
Authors:
Yimin Deng,
Huaizhen Tang,
Xulong Zhang,
Ning Cheng,
Jing Xiao,
Jianzong Wang
Abstract:
Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these…
▽ More
Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of converted results.
△ Less
Submitted 17 January, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Collaborative Perception for Connected and Autonomous Driving: Challenges, Possible Solutions and Opportunities
Authors:
Senkang Hu,
Zhengru Fang,
Yiqin Deng,
Xianhao Chen,
Yuguang Fang
Abstract:
Autonomous driving has attracted significant attention from both academia and industries, which is expected to offer a safer and more efficient driving system. However, current autonomous driving systems are mostly based on a single vehicle, which has significant limitations which still poses threats to driving safety. Collaborative perception with connected and autonomous vehicles (CAVs) shows a…
▽ More
Autonomous driving has attracted significant attention from both academia and industries, which is expected to offer a safer and more efficient driving system. However, current autonomous driving systems are mostly based on a single vehicle, which has significant limitations which still poses threats to driving safety. Collaborative perception with connected and autonomous vehicles (CAVs) shows a promising solution to overcoming these limitations. In this article, we first identify the challenges of collaborative perception, such as data sharing asynchrony, data volume, and pose errors. Then, we discuss the possible solutions to address these challenges with various technologies, where the research opportunities are also elaborated. Furthermore, we propose a scheme to deal with communication efficiency and latency problems, which is a channel-aware collaborative perception framework to dynamically adjust the communication graph and minimize latency, thereby improving perception performance while increasing communication efficiency. Finally, we conduct experiments to demonstrate the effectiveness of our proposed scheme.
△ Less
Submitted 15 April, 2025; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
Authors:
Jinlong Xue,
Yayue Deng,
Yingming Gao,
Ya Li
Abstract:
Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs.…
▽ More
Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Frame-level emotional state alignment method for speech emotion recognition
Authors:
Qifei Li,
Yingming Gao,
Cong Wang,
Yayue Deng,
Jinlong Xue,
Yichen Han,
Ya Li
Abstract:
Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address th…
▽ More
Speech emotion recognition (SER) systems aim to recognize human emotional state during human-computer interaction. Most existing SER systems are trained based on utterance-level labels. However, not all frames in an audio have affective states consistent with utterance-level label, which makes it difficult for the model to distinguish the true emotion of the audio and perform poorly. To address this problem, we propose a frame-level emotional state alignment method for SER. First, we fine-tune HuBERT model to obtain a SER system with task-adaptive pretraining (TAPT) method, and extract embeddings from its transformer layers to form frame-level pseudo-emotion labels with clustering. Then, the pseudo labels are used to pretrain HuBERT. Hence, the each frame output of HuBERT has corresponding emotional information. Finally, we fine-tune the above pretrained HuBERT for SER by adding an attention layer on the top of it, which can focus only on those frames that are emotionally more consistent with utterance-level label. The experimental results performed on IEMOCAP indicate that our proposed method performs better than state-of-the-art (SOTA) methods.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
Goal-oriented Semantic Communications for Robotic Waypoint Transmission: The Value and Age of Information Approach
Authors:
Wenchao Wu,
Yuanqing Yang,
Yansha Deng,
A. Hamid Aghvami
Abstract:
The ultra-reliable and low-latency communication (URLLC) service of the fifth-generation (5G) mobile communication network struggles to support safe robot operation. Nowadays, the sixth-generation (6G) mobile communication network is proposed to provide hyper-reliable and low-latency communication to enable safer control for robots. However, current 5G/ 6G research mainly focused on improving comm…
▽ More
The ultra-reliable and low-latency communication (URLLC) service of the fifth-generation (5G) mobile communication network struggles to support safe robot operation. Nowadays, the sixth-generation (6G) mobile communication network is proposed to provide hyper-reliable and low-latency communication to enable safer control for robots. However, current 5G/ 6G research mainly focused on improving communication performance, while the robotics community mostly assumed communication to be ideal. To jointly consider communication and robotic control with a focus on the specific robotic task, we propose goal-oriented semantic communication in robotic control (GSRC) to exploit the context of data and its importance in achieving the task at both transmitter and receiver. At the transmitter, we propose a deep reinforcement learning algorithm to generate optimal control and command (C&C) data and a proactive repetition scheme (DeepPro) to increase the successful transmission probability. At the receiver, we design the value of information (VoI) and age of information (AoI) based queue ordering mechanism (VA-QOM) to rank the queue based on the semantic information extracted from AoI and VoI. The simulation results validate that our proposed GSRC framework achieves a 91.5% improvement in the mean square error compared to the traditional unmanned aerial vehicle control framework.
△ Less
Submitted 12 November, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Localization and Discrete Beamforming with a Large Reconfigurable Intelligent Surface
Authors:
Baojia Luo,
Yili Deng,
Miaomiao Dong,
Zhongyi Huang,
Xiang Chen,
Wei Han,
Bo Bai
Abstract:
In millimeter-wave (mmWave) cellular systems, reconfigurable intelligent surfaces (RISs) are foreseeably deployed with a large number of reflecting elements to achieve high beamforming gains. The large-sized RIS will make radio links fall in the near-field localization regime with spatial non-stationarity issues. Moreover, the discrete phase restriction on the RIS reflection coefficient incurs exp…
▽ More
In millimeter-wave (mmWave) cellular systems, reconfigurable intelligent surfaces (RISs) are foreseeably deployed with a large number of reflecting elements to achieve high beamforming gains. The large-sized RIS will make radio links fall in the near-field localization regime with spatial non-stationarity issues. Moreover, the discrete phase restriction on the RIS reflection coefficient incurs exponential complexity for discrete beamforming. It remains an open problem to find the optimal RIS reflection coefficient design in polynomial time. To address these issues, we propose a scalable partitioned-far-field protocol that considers both the near-filed non-stationarity and discrete beamforming. The protocol approximates near-field signal propagation using a partitioned-far-field representation to inherit the sparsity from the sophisticated far-field and facilitate the near-field localization scheme. To improve the theoretical localization performance, we propose a fast passive beamforming (FPB) algorithm that optimally solves the discrete RIS beamforming problem, reducing the search complexity from exponential order to linear order. Furthermore, by exploiting the partitioned structure of RIS, we introduce a two-stage coarse-to-fine localization algorithm that leverages both the time delay and angle information. Numerical results demonstrate that centimeter-level localization precision is achieved under medium and high signal-to-noise ratios (SNR), revealing that RISs can provide support for low-cost and high-precision localization in future cellular systems.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation
Authors:
Yimin Deng,
Xulong Zhang,
Jianzong Wang,
Ning Cheng,
Jing Xiao
Abstract:
Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard ne…
▽ More
Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Acoustic Model Fusion for End-to-end Speech Recognition
Authors:
Zhihong Lei,
Mingbin Xu,
Shiyi Han,
Leo Liu,
Zhen Huang,
Tim Ng,
Yuanyuan Zhang,
Ernest Pusateri,
Mirko Hannemann,
Yaqiao Deng,
Man-Hung Siu
Abstract:
Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, tr…
▽ More
Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain mismatch issue inherent to the internal AM. Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. By implementing this novel approach, we have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets. We also discovered that this AM fusion approach is particularly beneficial in enhancing named entity recognition.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
Authors:
Yimin Deng,
Huaizhen Tang,
Xulong Zhang,
Jianzong Wang,
Ning Cheng,
Jing Xiao
Abstract:
Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosod…
▽ More
Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
Task-Oriented Semantics-Aware Communication for Wireless UAV Control and Command Transmission
Authors:
Yujie Xu,
Zhou Hui,
Yansha Deng
Abstract:
To guarantee the safety and smooth control of Unmanned Aerial Vehicle (UAV) operation, the new control and command (C&C) data type imposes stringent quality of service (QoS) requirements on the cellular network. However, the existing bit-oriented communication framework is already approaching the Shannon capacity limit, which can hardly guarantee the ultra-reliable low latency communications (URLL…
▽ More
To guarantee the safety and smooth control of Unmanned Aerial Vehicle (UAV) operation, the new control and command (C&C) data type imposes stringent quality of service (QoS) requirements on the cellular network. However, the existing bit-oriented communication framework is already approaching the Shannon capacity limit, which can hardly guarantee the ultra-reliable low latency communications (URLLC) service for C&C transmission. To solve the problem, task-oriented semantics-aware (TOSA) communication has been proposed recently by jointly exploiting the context of data and its importance to the UAV control task. However, to the best of our knowledge, an explicit and systematic TOSA communication framework for emerging C&C data type remains unknown. Therefore, in this paper, we propose a TOSA communication framework for C&C transmission and define its value of information based on both the similarity and age of information (AoI) of C&C signals. We also propose a deep reinforcement learning (DRL) algorithm to maximize the TOSA information. Last but not least, we present the simulation results to validate the effectiveness of our proposed TOSA communication framework.
△ Less
Submitted 25 June, 2023;
originally announced June 2023.
-
DIAS: A Dataset and Benchmark for Intracranial Artery Segmentation in DSA sequences
Authors:
Wentao Liu,
Tong Tian,
Lemeng Wang,
Weijin Xu,
Lei Li,
Haoyuan Li,
Wenyi Zhao,
Siyu Tian,
Xipeng Pan,
Huihua Yang,
Feng Gao,
Yiming Deng,
Xin Yang,
Ruisheng Su
Abstract:
The automated segmentation of Intracranial Arteries (IA) in Digital Subtraction Angiography (DSA) plays a crucial role in the quantification of vascular morphology, significantly contributing to computer-assisted stroke research and clinical practice. Current research primarily focuses on the segmentation of single-frame DSA using proprietary datasets. However, these methods face challenges due to…
▽ More
The automated segmentation of Intracranial Arteries (IA) in Digital Subtraction Angiography (DSA) plays a crucial role in the quantification of vascular morphology, significantly contributing to computer-assisted stroke research and clinical practice. Current research primarily focuses on the segmentation of single-frame DSA using proprietary datasets. However, these methods face challenges due to the inherent limitation of single-frame DSA, which only partially displays vascular contrast, thereby hindering accurate vascular structure representation. In this work, we introduce DIAS, a dataset specifically developed for IA segmentation in DSA sequences. We establish a comprehensive benchmark for evaluating DIAS, covering full, weak, and semi-supervised segmentation methods. Specifically, we propose the vessel sequence segmentation network, in which the sequence feature extraction module effectively captures spatiotemporal representations of intravascular contrast, achieving intracranial artery segmentation in 2D+Time DSA sequences. For weakly-supervised IA segmentation, we propose a novel scribble learning-based image segmentation framework, which, under the guidance of scribble labels, employs cross pseudo-supervision and consistency regularization to improve the performance of the segmentation network. Furthermore, we introduce the random patch-based self-training framework, aimed at alleviating the performance constraints encountered in IA segmentation due to the limited availability of annotated DSA data. Our extensive experiments on the DIAS dataset demonstrate the effectiveness of these methods as potential baselines for future research and clinical applications. The dataset and code are publicly available at https://doi.org/10.5281/zenodo.11396520 and https://github.com/lseventeen/DIAS.
△ Less
Submitted 13 June, 2024; v1 submitted 21 June, 2023;
originally announced June 2023.