Search | arXiv e-print repository

ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition

Authors: You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo, Shuchang Lyu, Guangliang Cheng, Qi Zhao

Abstract: Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological… ▽ More Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90\% on the two most common pathological types of ovarian tumor and an overall performance of 85\%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.01348 [pdf, ps, other]

SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech

Authors: Zhuangfei Cheng, Guangyan Zhang, Zehai Tu, Yangyang Song, Shuiyang Mao, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Jiasong Wu

Abstract: Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classif… ▽ More Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classification (CTC) directly into codebook discretization for speech content tokenization. This novel architecture generates tokens with a unique "locality" property, as validated by experiments demonstrating optimal trade-offs among content faithfulness, temporal coherence, and structural recoverability. Then, to address data scarcity for the FAC module, we adopted a multitask learning strategy that jointly trains the FAC and TTS modules. Beyond mitigating data limitations, this approach yielded accelerated convergence and superior speech quality compared to standalone FAC training. Moreover, leveraging the salient properties of our discrete speech representations, we introduce SpeechRestorer, a postprocessing architecture designed to refine LLM-generated outputs. This module effectively mitigates stochastic errors prevalent in LLM inference pipelines while enhancing prosodic continuity, as validated by ablation experiments. △ Less

Submitted 8 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

Comments: 10 pages, includes references, 4 figures, 4 tables

ACM Class: I.2.7

arXiv:2506.22023 [pdf, ps, other]

Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Authors: Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

Abstract: Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention… ▽ More Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: 17 pages, 8 figures, 5 tables

arXiv:2506.21074 [pdf, ps, other]

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Authors: Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Xie Chen, Kai Yu

Abstract: Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address th… ▽ More Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 46% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 16 pages, 5 figures, 9 tables

arXiv:2506.10562 [pdf]

Joint System Modeling Approach for Fault Simulation of Start-er/Generator and Gas Generator in All-Electric APU

Authors: Haotian Mao, Yingqing Guo

Abstract: This paper presents a joint system modeling approach for fault simulation of all-electric auxiliary power unit (APU), integrating starter/generator turn-to-turn short circuit (TTSC) faults with gas generator gas-path faults.To address challenges in electromechanical coupling, simulation precision and computational efficiency balance, we propose a multi-rate continuous-discrete hybrid simulation ar… ▽ More This paper presents a joint system modeling approach for fault simulation of all-electric auxiliary power unit (APU), integrating starter/generator turn-to-turn short circuit (TTSC) faults with gas generator gas-path faults.To address challenges in electromechanical coupling, simulation precision and computational efficiency balance, we propose a multi-rate continuous-discrete hybrid simulation architecture. This architecture treats the starter/generator as a continuous system with variable step size in Simulink, while modeling the gas generator as a discrete system with fixed step size in a dynamic-link library (DLL) environment. For the starter/generator fault modeling, a multi-loop approach is deployed to accurately simulate TTSC faults. For the gas generator, we develop an improved GasTurb-DLL modeling method (IGDM) that enhances uncertainty modeling, state-space representation, and tool chain compatibility. Finally, the proposed methodology above was implemented in a case study based on the APS5000 all-electric APU structure and parameters. Model validation was conducted by comparing simulation results--covering steady-state, transients, healthy, and fault conditions--with reference data from third-party software and literature. The close agreement confirms both the model's accuracy and the effectiveness of our modeling methodology. This work establishes a modeling foundation for investigating the opportunities and challenges in fault detection and isolation (FDI) brought by the all electrification of the APU, including joint fault estimation and diagnosis, coupled electromechanical fault characteristics. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.08404 [pdf, ps, other]

Compact Amplified Laser Power Stabilization Using Robust Active Disturbance Rejection Control with Sensor Noise Decoupling

Authors: Yanpei Shi, Jingxuan Zhang, Zhuo Shi, Chenyao Zhang, Yuze Guo, Rui Feng

Abstract: Laser power instability, encompassing random jitter and slow drift, severely limits the performance of optically pumped magnetometers (OPMs) in detecting ultra-weak magnetic fields, especially in large-scale OPM arrays for magnetoencephalography. Although a unified amplified laser (AL) architecture improves integration, fluctuations in the pump beam progressively degrade performance across all cha… ▽ More Laser power instability, encompassing random jitter and slow drift, severely limits the performance of optically pumped magnetometers (OPMs) in detecting ultra-weak magnetic fields, especially in large-scale OPM arrays for magnetoencephalography. Although a unified amplified laser (AL) architecture improves integration, fluctuations in the pump beam progressively degrade performance across all channels, exacerbated by environmental disturbances and system uncertainties. To address this challenge, this paper presents a compact AL power stabilization approach based on an innovative dual-loop active disturbance rejection control (DLADRC) strategy, while integrating a comprehensive quantitative stability analysis through novel exponential decay estimates for extended state observers (ESOs) and control error dynamics. As validated through physical experimental results, the proposed method significantly improves AL's long-term stability with sensor noise decoupling, achieving an over 85.7% reduction in 1-hour power instability and a tenfold decrease in Allan variance for correlation times 10^2 s--10^3 s, compared to standard ADRC. Crucially, the strategy demonstrates robust effectiveness across diverse operating scenarios, enabling AL-based OPM systems to achieve their full potential in high-sensitivity biomagnetic field detection. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07358 [pdf, ps, other]

Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework

Authors: Kuiyuan Zhang, Wenjie Pei, Rushi Lan, Yifang Guo, Zhongyun Hua

Abstract: Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the… ▽ More Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub-models can result in redundant neural layers, making the overall model inefficient and impractical for resource-constrained environments. In this work, we design a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework. Specifically, we introduce a collaborative audio-visual learning block to efficiently integrate multi-modal information while learning the visual and audio features. By iteratively employing this block, our single-stream network achieves a continuous fusion of multi-modal features across its layers. Thus, our network efficiently captures visual and audio features without the need for excessive block stacking, resulting in a lightweight network design. Furthermore, we propose a multi-modal classification module that can boost the dependence of the visual and audio classifiers on modality content. It also enhances the whole resistance of the video classifier against the mismatches between audio and visual modalities. We conduct experiments on the DF-TIMIT, FakeAVCeleb, and DFDC benchmark datasets. Compared to state-of-the-art audio-visual joint detection methods, our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes, as well as in unseen types of deepfakes. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2506.00358 [pdf, other]

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Authors: Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo

Abstract: While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in… ▽ More While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$. △ Less

Submitted 30 May, 2025; originally announced June 2025.

Comments: Under review. For uniformity, all TTA experiments are done with a batch size of 16

arXiv:2505.23379 [pdf, ps, other]

Vision-Integrated High-Quality Neural Speech Coding

Authors: Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling

Abstract: This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual in… ▽ More This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: Accepted by interspeech2025

arXiv:2505.22515 [pdf, ps, other]

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

Authors: Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu

Abstract: Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two… ▽ More Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two key capabilities: (1) robust reconstruction, which measures the preservation of both speech and non-speech acoustic details, and (2) downstream task consistency, which ensures minimal deviation in downstream signal processing tasks when using reconstructed speech instead of the original. Our comprehensive experiments reveal that complex acoustic environments significantly degrade signal reconstruction and downstream task consistency. This work highlights the limitations of current speech codecs and raises a future direction that improves them for greater environmental resilience. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Initial Upload

arXiv:2505.19539 [pdf, ps, other]

Water Level Sensing via Communication Signals in a Bi-Static System

Authors: Zhongqin Wang, J. Andrew Zhang, Kai Wu, Y. Jay Guo

Abstract: Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for w… ▽ More Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for water level sensing. Our scheme begins with a CSI-power method to eliminate phase offsets caused by clock asynchrony in bi-static systems. We then apply multi-domain filtering across the time (Doppler), frequency (delay), and spatial (Angle-of-Arrival, AoA) domains to extract phase features that finely capture variations in path length over water. To resolve the $2π$ phase ambiguity, we introduce a Kalman filter-based unwrapping technique. Additionally, we exploit transceiver geometry to convert path length variations into water level height changes, even with limited antenna configurations. We validate our framework through controlled experiments with 28 GHz mmWave and 3.1 GHz LTE signals in real time, achieving average height estimation errors of 0.025 cm and 0.198 cm, respectively. Moreover, real-world river monitoring with 2.6 GHz LTE signals achieves an average error of 4.8 cm for a 1-meter water level change, demonstrating its effectiveness in practical deployments. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.18641 [pdf, ps, other]

FDMA-Based Passive Multiple Users SWIPT Utilizing Resonant Beams

Authors: Yixuan Guo, Mingliang Xiong, Wen Fang, Qingwei Jiang, Qingwen Liu, Gang Yan

Abstract: The rapid development of IoT technology has led to a shortage of spectrum resources and energy, giving rise to simultaneous wireless information and power transfer (SWIPT) technology. However, traditional multiple input multiple output (MIMO)-based SWIPT faces challenges in target detection. We have designed a passive multi-user resonant beam system (MU-RBS) that can achieve efficient power transf… ▽ More The rapid development of IoT technology has led to a shortage of spectrum resources and energy, giving rise to simultaneous wireless information and power transfer (SWIPT) technology. However, traditional multiple input multiple output (MIMO)-based SWIPT faces challenges in target detection. We have designed a passive multi-user resonant beam system (MU-RBS) that can achieve efficient power transfer and communication through adaptive beam alignment. The frequency division multiple access (FDMA) is employed in the downlink (DL) channel, while frequency conversion is utilized in the uplink (UL) channel to avoid echo interference and co-channel interference, and the system architecture design and corresponding mathematical model are presented. The simulation results show that MU-RBS can achieve adaptive beam-forming without the target transmitting pilot signals, has high directivity, and as the number of iterations increases, the power transmission efficiency, signal-to-noise ratio and spectral efficiency of the UL and DL are continuously optimized until the system reaches the optimal state. △ Less

Submitted 24 May, 2025; originally announced May 2025.

arXiv:2505.16845 [pdf, ps, other]

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate

Authors: Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu

Abstract: Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In t… ▽ More Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.16091 [pdf, ps, other]

OSCAR: One-Step Diffusion Codec for Image Compression Across Multiple Bit-rates

Authors: Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, Yulun Zhang

Abstract: Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial… ▽ More Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at https://github.com/jp-guo/OSCAR. △ Less

Submitted 28 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.11516 [pdf, other]

SELECT: A Submodular Approach for Active LiDAR Semantic Segmentation

Authors: Ruiyu Mao, Sarthak Kumar Maharana, Xulong Tang, Yunhui Guo

Abstract: LiDAR-based semantic segmentation plays a vital role in autonomous driving by enabling detailed understanding of 3D environments. However, annotating LiDAR point clouds is extremely costly and requires assigning semantic labels to millions of points with complex geometric structures. Active Learning (AL) has emerged as a promising approach to reduce labeling costs by querying only the most informa… ▽ More LiDAR-based semantic segmentation plays a vital role in autonomous driving by enabling detailed understanding of 3D environments. However, annotating LiDAR point clouds is extremely costly and requires assigning semantic labels to millions of points with complex geometric structures. Active Learning (AL) has emerged as a promising approach to reduce labeling costs by querying only the most informative samples. Yet, existing AL methods face critical challenges when applied to large-scale 3D data: outdoor scenes contain an overwhelming number of points and suffer from severe class imbalance, where rare classes have far fewer points than dominant classes. To address these issues, we propose SELECT, a voxel-centric submodular approach tailored for active LiDAR semantic segmentation. Our method targets both scalability problems and class imbalance through three coordinated stages. First, we perform Voxel-Level Submodular Subset Selection, which efficiently identifies representative voxels without pairwise comparisons, ensuring scalability. Second, we estimate Voxel-Level Model Uncertainty using Monte Carlo dropout, aggregating point-wise uncertainties to identify informative voxels. Finally, we introduce Submodular Maximization for Point-Level Class Balancing, which selects a subset of points that enhances label diversity, explicitly mitigating class imbalance. Experiments on SemanticPOSS, SemanticKITTI, and nuScenes benchmarks demonstrate that SELECT achieves superior performance compared to prior active learning approaches for 3D semantic segmentation. △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.10577 [pdf, ps, other]

GRNN:Recurrent Neural Network based on Ghost Features for Video Super-Resolution

Authors: Yutong Guo

Abstract: Modern video super-resolution (VSR) systems based on convolutional neural networks (CNNs) require huge computational costs. The problem of feature redundancy is present in most models in many domains, but is rarely discussed in VSR. We experimentally observe that many features in VSR models are also similar to each other, so we propose to use "Ghost features" to reduce this redundancy. We also ana… ▽ More Modern video super-resolution (VSR) systems based on convolutional neural networks (CNNs) require huge computational costs. The problem of feature redundancy is present in most models in many domains, but is rarely discussed in VSR. We experimentally observe that many features in VSR models are also similar to each other, so we propose to use "Ghost features" to reduce this redundancy. We also analyze the so-called "gradient disappearance" phenomenon generated by the conventional recurrent convolutional network (RNN) model, and combine the Ghost module with RNN to complete the modeling on time series. The current frame is used as input to the model together with the next frame, the output of the previous frame and the hidden state. Extensive experiments on several benchmark models and datasets show that the PSNR and SSIM of our proposed modality are improved to some extent. Some texture details in the video are also better preserved. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: Accepted by 2023 IEEE International Conference on Multimedia and Expo (ICME 2023)

arXiv:2505.03244 [pdf, other]

SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation

Authors: Yu-Ren Guo, Wen-Kai Tai

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) and multimodal learning, with successful applications in text generation and speech synthesis, enabling a deeper understanding and generation of multimodal content. In the field of sound effects (SFX) generation, LLMs have been leveraged to orchestrate multiple models for audio synthesis. Ho… ▽ More Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) and multimodal learning, with successful applications in text generation and speech synthesis, enabling a deeper understanding and generation of multimodal content. In the field of sound effects (SFX) generation, LLMs have been leveraged to orchestrate multiple models for audio synthesis. However, due to the scarcity of annotated datasets, and the complexity of temproal modeling. current SFX generation techniques still fall short in achieving high-fidelity audio. To address these limitations, this paper introduces a novel framework that integrates LLMs with existing sound effect databases, allowing for the retrieval, recombination, and synthesis of audio based on user requirements. By leveraging this approach, we enhance the diversity and quality of generated sound effects while eliminating the need for additional recording costs, offering a flexible and efficient solution for sound design and application. △ Less

Submitted 13 May, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

Comments: 8 pages, 5 figures

arXiv:2504.10836 [pdf, other]

Uplink Assisted Joint Channel Estimation and CSI Feedback: An Approach Based on Deep Joint Source-Channel Coding

Authors: Yiran Guo, Wei Chen, Bo Ai

Abstract: In frequency division duplex (FDD) multiple-input multiple-output (MIMO) wireless communication systems, the acquisition of downlink channel state information (CSI) is essential for maximizing spatial resource utilization and improving system spectral efficiency. The separate design of modules in AI-based CSI feedback architectures under traditional modular communication frameworks, including chan… ▽ More In frequency division duplex (FDD) multiple-input multiple-output (MIMO) wireless communication systems, the acquisition of downlink channel state information (CSI) is essential for maximizing spatial resource utilization and improving system spectral efficiency. The separate design of modules in AI-based CSI feedback architectures under traditional modular communication frameworks, including channel estimation (CE), CSI compression and feedback, leads to sub-optimal performance. In this paper, we propose an uplink assisted joint CE and and CSI feedback approach via deep learning for downlink CSI acquisition, which mitigates performance degradation caused by distribution bias across separately trained modules in traditional modular communication frameworks. The proposed network adopts a deep joint source-channel coding (DJSCC) architecture to mitigate the cliff effect encountered in the conventional separate source-channel coding. Furthermore, we exploit the uplink CSI as auxiliary information to enhance CSI reconstruction accuracy by leveraging the partial reciprocity between the uplink and downlink channels in FDD systems, without introducing additional overhead. The effectiveness of uplink CSI as assisted information and the necessity of an end-toend multi-module joint training architecture is validated through comprehensive ablation and scalability experiments. △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.07119 [pdf, other]

UAV-Assisted MEC for Disaster Response: Stackelberg Game-Based Resource Optimization

Authors: Yafei Guo, Ziye Jia, Lei Zhang, Jia He, Yu Zhang, Qihui Wu

Abstract: The unmanned aerial vehicle assisted multi-access edge computing (UAV-MEC) technology has been widely applied in the sixth-generation era. However, due to the limitations of energy and computing resources in disaster areas, how to efficiently offload the tasks of damaged user equipments (UEs) to UAVs is a key issue. In this work, we consider a multiple UAVMECs assisted task offloading scenario, wh… ▽ More The unmanned aerial vehicle assisted multi-access edge computing (UAV-MEC) technology has been widely applied in the sixth-generation era. However, due to the limitations of energy and computing resources in disaster areas, how to efficiently offload the tasks of damaged user equipments (UEs) to UAVs is a key issue. In this work, we consider a multiple UAVMECs assisted task offloading scenario, which is deployed inside the three-dimensional corridors and provide computation services for UEs. In detail, a ground UAV controller acts as the central decision-making unit for deploying the UAV-MECs and allocates the computational resources. Then, we model the relationship between the UAV controller and UEs based on the Stackelberg game. The problem is formulated to maximize the utility of both the UAV controller and UEs. To tackle the problem, we design a K-means based UAV localization and availability response mechanism to pre-deploy the UAV-MECs. Then, a chess-like particle swarm optimization probability based strategy selection learning optimization algorithm is proposed to deal with the resource allocation. Finally, extensive simulation results verify that the proposed scheme can significantly improve the utility of the UAV controller and UEs in various scenarios compared with baseline schemes. △ Less

Submitted 26 March, 2025; originally announced April 2025.

arXiv:2504.02061 [pdf, other]

Aligned Better, Listen Better for Audio-Visual Large Language Models

Authors: Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak un… ▽ More Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: Accepted to ICLR 2025

arXiv:2504.01038 [pdf, other]

An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

Authors: Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong

Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One C… ▽ More Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git. △ Less

Submitted 31 March, 2025; originally announced April 2025.

Comments: 26 pages, 4 figures, 6 tables

arXiv:2503.24086 [pdf, other]

Distributed AC Optimal Power Flow: A Scalable Solution for Large-Scale Problems

Authors: Xinliang Dai, Yuning Jiang, Yi Guo, Colin N. Jones, Moritz Diehl, Veit Hagenmeyer

Abstract: This paper introduces a novel distributed optimization framework for large-scale AC Optimal Power Flow (OPF) problems, offering both theoretical convergence guarantees and rapid convergence in practice. By integrating smoothing techniques and the Schur complement, the proposed approach addresses the scalability challenges and reduces communication overhead in distributed AC OPF. Additionally, opti… ▽ More This paper introduces a novel distributed optimization framework for large-scale AC Optimal Power Flow (OPF) problems, offering both theoretical convergence guarantees and rapid convergence in practice. By integrating smoothing techniques and the Schur complement, the proposed approach addresses the scalability challenges and reduces communication overhead in distributed AC OPF. Additionally, optimal network decomposition enables efficient parallel processing under the single program multiple data (SPMD) paradigm. Extensive simulations on large-scale benchmarks across various operating scenarios indicate that the proposed framework outperforms the state-of-the-art centralized solver IPOPT on modest hardware. This paves the way for more scalable and efficient distributed optimization in future power system applications. △ Less

Submitted 4 April, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

arXiv:2503.21942 [pdf, other]

Enhancing Mobile Crowdsensing Efficiency: A Coverage-aware Resource Allocation Approach

Authors: Yaru Fu, Yue Zhang, Zheng Shi, Yongna Guo, Yalin Liu

Abstract: In this study, we investigate the resource management challenges in next-generation mobile crowdsensing networks with the goal of minimizing task completion latency while ensuring coverage performance, i.e., an essential metric to ensure comprehensive data collection across the monitored area, yet it has been commonly overlooked in existing studies. To this end, we formulate a weighted latency and… ▽ More In this study, we investigate the resource management challenges in next-generation mobile crowdsensing networks with the goal of minimizing task completion latency while ensuring coverage performance, i.e., an essential metric to ensure comprehensive data collection across the monitored area, yet it has been commonly overlooked in existing studies. To this end, we formulate a weighted latency and coverage gap minimization problem via jointly optimizing user selection, subchannel allocation, and sensing task allocation. The formulated minimization problem is a non-convex mixed-integer programming issue. To facilitate the analysis, we decompose the original optimization problem into two subproblems. One focuses on optimizing sensing task and subband allocation under fixed sensing user selection, which is optimally solved by the Hungarian algorithm via problem reformulation. Building upon these findings, we introduce a time-efficient two-sided swapping method to refine the scheduled user set and enhance system performance. Extensive numerical results demonstrate the effectiveness of our proposed approach compared to various benchmark strategies. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.14986 [pdf]

Enhancing Fault Detection and Isolation in an All-Electric Auxiliary Power Unit (APU) Gas Generator by Utilizing Starter/Generator Signal

Authors: Haotian Mao, Khashayar Khorasani, Yingqing Guo

Abstract: This study proposes a novel paradigm for enhancing fault detection and isolation (FDI) of gas generators in all-electric auxiliary power unit (APU) by utilizing shaft power information from the starter/generator. First, we conduct a pioneering investigation into the challenges and opportunities for FDI brought about by APU electrification. Our analysis reveals that the electrification of APU opens… ▽ More This study proposes a novel paradigm for enhancing fault detection and isolation (FDI) of gas generators in all-electric auxiliary power unit (APU) by utilizing shaft power information from the starter/generator. First, we conduct a pioneering investigation into the challenges and opportunities for FDI brought about by APU electrification. Our analysis reveals that the electrification of APU opens up new possibilities for utilizing shaft power estimates from starter/generator to improve gas generator FDI. We then provide comprehensive theoretical and analytical evidence demonstrating why, how, and to what extent, the shaft power information from the starter/generator can fundamentally enhance the estimation accuracy of system states and health parameters of the gas generator, while also identifying the key factors influencing these improvements in FDI performance. The effectiveness of the proposed paradigm and its theoretical foundations are validated through extensive Monte Carlo simulations. Furthermore, through comprehensive comparative analysis with state-of-the-art gas generator fault diagnosis methods, our experimental results not only demonstrate the superior performance of the proposed approach but also validate that the diagnostic capabilities of existing advanced FDI techniques can be substantially enhanced by incorporating shaft power information. And the observed performance improvement patterns strongly align with our theoretical analysis, verifying both the effectiveness and guiding significance of our theoretical framework. These research findings provide a unique perspective in answering three fundamental questions: why joint fault diagnosis of the starter/generator and gas generator is essential, how it can be implemented, and what factors determine its effectiveness, thereby opening up promising new avenues for FDI technologies in all-electric APU systems. △ Less

Submitted 19 March, 2025; originally announced March 2025.

arXiv:2503.14892 [pdf, other]

Degradation Alchemy: Self-Supervised Unknown-to-Known Transformation for Blind Hyperspectral Image Fusion

Authors: He Huang, Yong Chen, Yujun Guo, Wei He

Abstract: Hyperspectral image (HSI) fusion is an efficient technique that combines low-resolution HSI (LR-HSI) and high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Existing supervised learning methods (SLMs) can yield promising results when test data degradation matches the training ones, but they face challenges in generalizing to unknown degradations. To unleash the… ▽ More Hyperspectral image (HSI) fusion is an efficient technique that combines low-resolution HSI (LR-HSI) and high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Existing supervised learning methods (SLMs) can yield promising results when test data degradation matches the training ones, but they face challenges in generalizing to unknown degradations. To unleash the potential and generalization ability of SLMs, we propose a novel self-supervised unknown-to-known degradation transformation framework (U2K) for blind HSI fusion, which adaptively transforms unknown degradation into the same type of degradation as those handled by pre-trained SLMs. Specifically, the proposed U2K framework consists of: (1) spatial and spectral Degradation Wrapping (DW) modules that map HR-HSI to unknown degraded HR-MSI and LR-HSI, and (2) Degradation Transformation (DT) modules that convert these wrapped data into predefined degradation patterns. The transformed HR-MSI and LR-HSI pairs are then processed by a pre-trained network to reconstruct the target HR-HSI. We train the U2K framework in a self-supervised manner using consistency loss and greedy alternating optimization, significantly improving the flexibility of blind HSI fusion. Extensive experiments confirm the effectiveness of our proposed U2K framework in boosting the adaptability of five existing SLMs under various degradation settings and surpassing state-of-the-art blind methods. △ Less

Submitted 19 March, 2025; originally announced March 2025.

arXiv:2503.10522 [pdf, other]

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Authors: Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

Abstract: Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anyt… ▽ More Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/ △ Less

Submitted 23 April, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

Comments: The code and datasets will be available at https://zeyuet.github.io/AudioX/

arXiv:2503.08638 [pdf, other]

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Authors: Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang , et al. (32 additional authors not shown)

Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate… ▽ More We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: https://github.com/multimodal-art-projection/YuE

arXiv:2503.05843 [pdf, other]

Decadal analysis of sea surface temperature patterns, climatology, and anomalies in temperate coastal waters with Landsat-8 TIRS observations

Authors: Yiqing Guo, Nagur Cherukuru, Eric Lehmann, Xiubin Qi, Mark Doubelld, S. L. Kesav Unnithan, Ming Feng

Abstract: Sea surface temperature (SST) is a fundamental physical parameter characterising the thermal state of sea surface. Due to the intricate thermal interactions between land, sea, and atmosphere, the spatial gradients of SST in coastal waters often appear at finer spatial scales than those in open ocean waters. The Thermal Infrared Sensor (TIRS) onboard Landsat-8, with its 100-meter spatial resolution… ▽ More Sea surface temperature (SST) is a fundamental physical parameter characterising the thermal state of sea surface. Due to the intricate thermal interactions between land, sea, and atmosphere, the spatial gradients of SST in coastal waters often appear at finer spatial scales than those in open ocean waters. The Thermal Infrared Sensor (TIRS) onboard Landsat-8, with its 100-meter spatial resolution, offers a unique opportunity to uncover fine-scale coastal SST patterns that would otherwise be overlooked by coarser-resolution thermal sensors. In this study, we first analysed the spatiotemporal patterns of SST in South Australia's temperate coastal waters from 2014 to 2023 by developing an operational approach for SST retrieval from the Landsat-8 TIRS sensor. A buoy was deployed off the coast of Port Lincoln, South Australia, to validate the quality of SST retrievals. Then the daily baseline climatology of SST with 100 m resolution was constructed, which allowed for the detection and analysis of anomalous SST events. Our results suggest the following: (1) the satellite-derived SST data aligned well with the in-situ measured SST values; (2) the semi-enclosed, shallow regions of Upper Spencer Gulf and Upper St Vincent Gulf showed higher temperatures during summer and cooler temperatures during winter than waters closer to the open ocean, resulting in a higher seasonal variation in SST; (3) the near-shore shallow areas in Spencer Gulf and St Vincent Gulf, and regions surrounding Kangaroo Island, were identified to have a higher probability of SST anomalies compared to the rest of the study area; and (4) anomalous SST events were more likely to happen during the warm months than the cool months. We hope these findings would be helpful in supporting the fishing and aquaculture industries in the coastal waters of South Australia. △ Less

Submitted 13 May, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: Submitted to GIScience & Remote Sensing

arXiv:2503.04157 [pdf, other]

Deep Joint CSI Estimation-Feedback-Precoding for MU-MIMO OFDM Systems

Authors: Yiran Guo, Wei Chen, Bo Ai, Lun Li

Abstract: As the number of antennas in frequency-division duplex (FDD) multiple-input multiple-output (MIMO) systems increases, acquiring channel state information (CSI) becomes increasingly challenging due to limited spectral resources and feedback overhead. In this paper, we propose an end-to-end network that conducts joint design with pilot design, CSI estimation, CSI feedback, and precoding design in th… ▽ More As the number of antennas in frequency-division duplex (FDD) multiple-input multiple-output (MIMO) systems increases, acquiring channel state information (CSI) becomes increasingly challenging due to limited spectral resources and feedback overhead. In this paper, we propose an end-to-end network that conducts joint design with pilot design, CSI estimation, CSI feedback, and precoding design in the multi-user MIMO orthogonal frequency-division multiplexing (OFDM) scenario. Multiple communication modules are jointly designed and trained with a common optimization objective to prevent mismatches between modules and discrepancies between individual module objectives and the final system goal. Experimental results demonstrate that, under the same feedback and CE overheads, the proposed joint multi-module end-to-end network achieves a higher multi-user downlink spectral efficiency than traditional algorithms based on separate architecture and partially separated artificial intelligence-based network architectures under comparable channel quality. Furthermore, compared to conventional separate architecture, the proposed network architecture with joint architecture reduces the computational burden and model storage overhead at the UE side, facilitating the deployment of low-overhead multi-module joint architectures in practice. While slightly increasing storage requirements at the base station, it reduces computational complexity and precoding design delay, effectively reducing the effects of channel aging challenges. △ Less

Submitted 6 March, 2025; originally announced March 2025.

arXiv:2503.01710 [pdf, other]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Authors: Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue

Abstract: Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a sin… ▽ More Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS. △ Less

Submitted 3 March, 2025; originally announced March 2025.

Comments: Submitted to ACL 2025

arXiv:2503.01265 [pdf, other]

Interactive Gadolinium-Free MRI Synthesis: A Transformer with Localization Prompt Learning

Authors: Linhao Li, Changhui Su, Yu Guo, Huimao Zhang, Dong Liang, Kun Shang

Abstract: Contrast-enhanced magnetic resonance imaging (CE-MRI) is crucial for tumor detection and diagnosis, but the use of gadolinium-based contrast agents (GBCAs) in clinical settings raises safety concerns due to potential health risks. To circumvent these issues while preserving diagnostic accuracy, we propose a novel Transformer with Localization Prompts (TLP) framework for synthesizing CE-MRI from no… ▽ More Contrast-enhanced magnetic resonance imaging (CE-MRI) is crucial for tumor detection and diagnosis, but the use of gadolinium-based contrast agents (GBCAs) in clinical settings raises safety concerns due to potential health risks. To circumvent these issues while preserving diagnostic accuracy, we propose a novel Transformer with Localization Prompts (TLP) framework for synthesizing CE-MRI from non-contrast MR images. Our architecture introduces three key innovations: a hierarchical backbone that uses efficient Transformer to process multi-scale features; a multi-stage fusion system consisting of Local and Global Fusion modules that hierarchically integrate complementary information via spatial attention operations and cross-attention mechanisms, respectively; and a Fuzzy Prompt Generation (FPG) module that enhances the TLP model's generalization by emulating radiologists' manual annotation through stochastic feature perturbation. The framework uniquely enables interactive clinical integration by allowing radiologists to input diagnostic prompts during inference, synergizing artificial intelligence with medical expertise. This research establishes a new paradigm for contrast-free MRI synthesis while addressing critical clinical needs for safer diagnostic procedures. Codes are available at https://github.com/ChanghuiSu/TLP. △ Less

Submitted 3 March, 2025; originally announced March 2025.

arXiv:2502.18913 [pdf, other]

CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

Authors: Jiaming Zhou, Yujie Guo, Shiwan Zhao, Haoqin Sun, Hui Wang, Jiabei He, Aobo Kong, Shiyao Wang, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

Abstract: Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for r… ▽ More Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and establish benchmark ASR performance using state-of-the-art models. Our experiments, using Transformer, Conformer, and Branchformer, demonstrate the challenges of code-switching ASR, and show that existing pre-trained models such as Whisper still have the space to improve. The CS-Dialogue dataset will be made freely available for all academic purposes. △ Less

Submitted 11 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

arXiv:2502.16584 [pdf, other]

Audio-FLAN: A Preliminary Release

Authors: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin… ▽ More Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated. △ Less

Submitted 23 February, 2025; originally announced February 2025.

arXiv:2502.06490 [pdf, other]

Recent Advances in Discrete Speech Tokens: A Review

Authors: Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

Abstract: The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framewor… ▽ More The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens. △ Less

Submitted 16 February, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

Comments: 23 pages, 8 figures, 3 tables. Work in progress

arXiv:2502.05471 [pdf, other]

Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

Authors: Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao

Abstract: This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous metho… ▽ More This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page https://speechai-demo.github.io/PFlow-VC/. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: Accepted by ICASSP 2025

arXiv:2502.04988 [pdf, other]

CMamba: Learned Image Compression with State Space Models

Authors: Zhuojie Wu, Heming Du, Shuyun Wang, Ming Lu, Haiyang Sun, Yandong Guo, Xin Yu

Abstract: Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propos… ▽ More Learned Image Compression (LIC) has explored various architectures, such as Convolutional Neural Networks (CNNs) and transformers, in modeling image content distributions in order to achieve compression effectiveness. However, achieving high rate-distortion performance while maintaining low computational complexity (\ie, parameters, FLOPs, and latency) remains challenging. In this paper, we propose a hybrid Convolution and State Space Models (SSMs) based image compression framework, termed \textit{CMamba}, to achieve superior rate-distortion performance with low computational complexity. Specifically, CMamba introduces two key components: a Content-Adaptive SSM (CA-SSM) module and a Context-Aware Entropy (CAE) module. First, we observed that SSMs excel in modeling overall content but tend to lose high-frequency details. In contrast, CNNs are proficient at capturing local details. Motivated by this, we propose the CA-SSM module that can dynamically fuse global content extracted by SSM blocks and local details captured by CNN blocks in both encoding and decoding stages. As a result, important image content is well preserved during compression. Second, our proposed CAE module is designed to reduce spatial and channel redundancies in latent representations after encoding. Specifically, our CAE leverages SSMs to parameterize the spatial content in latent representations. Benefiting from SSMs, CAE significantly improves spatial compression efficiency while reducing spatial content redundancies. Moreover, along the channel dimension, CAE reduces inter-channel redundancies of latent representations via an autoregressive manner, which can fully exploit prior knowledge from previous channels without sacrificing efficiency. Experimental results demonstrate that CMamba achieves superior rate-distortion performance. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.04128 [pdf, other]

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

Abstract: Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a pa… ▽ More Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available. △ Less

Submitted 22 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

arXiv:2502.03496 [pdf, other]

FreqPrior: Improving Video Diffusion Models with Frequency Filtering Gaussian Noise

Authors: Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei Zhang, Hang Xu, Li Zhang

Abstract: Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs the Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this co… ▽ More Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs the Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this context. However, it often generates videos that lack motion dynamics and imaging details. In this work, we provide a comprehensive theoretical analysis of the variance decay issue present in existing methods, contributing to the loss of details and motion dynamics. Recognizing the critical impact of noise distribution on generation quality, we introduce FreqPrior, a novel noise initialization strategy that refines noise in the frequency domain. Our method features a novel filtering technique designed to address different frequency signals while maintaining the noise prior distribution that closely approximates a standard Gaussian distribution. Additionally, we propose a partial sampling process by perturbing the latent at an intermediate timestep during finding the noise prior, significantly reducing inference time without compromising quality. Extensive experiments on VBench demonstrate that our method achieves the highest scores in both quality and semantic assessments, resulting in the best overall total score. These results highlight the superiority of our proposed noise prior. △ Less

Submitted 19 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

Comments: ICLR 2025

arXiv:2502.03338 [pdf, other]

Optimal PMU Placement for Kalman Filtering of DAE Power System Models

Authors: Milos Katanic, Yi Guo, John Lygeros, Gabriela Hug

Abstract: Optimal sensor placement is essential for minimizing costs and ensuring accurate state estimation in power systems. This paper introduces a novel method for optimal sensor placement for dynamic state estimation of power systems modeled by differential-algebraic equations. The method identifies optimal sensor locations by minimizing the steady-state covariance matrix of the Kalman filter, thus mini… ▽ More Optimal sensor placement is essential for minimizing costs and ensuring accurate state estimation in power systems. This paper introduces a novel method for optimal sensor placement for dynamic state estimation of power systems modeled by differential-algebraic equations. The method identifies optimal sensor locations by minimizing the steady-state covariance matrix of the Kalman filter, thus minimizing the error of joint differential and algebraic state estimation. The problem is reformulated as a mixed-integer semidefinite program and effectively solved using off-the-shelf numerical solvers. Numerical results demonstrate the merits of the proposed approach by benchmarking its performance in phasor measurement unit placement in comparison to greedy algorithms. △ Less

Submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.00699 [pdf, other]

Measurement and Analysis of Scattering From Building Surfaces at Millimeter-Wave Frequency

Authors: Yulu Guo, Tongjia Zhang, Shu Sun, Meixia Tao, Ruifeng Gao

Abstract: In future air-to-ground integrated networks, the scattering effects from ground-based scatterers, such as buildings, cannot be neglected in millimeter-wave and higher frequency bands, and have a significant impact on channel characteristics. However, current scattering measurement studies primarily focus on single incident angles within the incident plane, leading to insufficient characterization… ▽ More In future air-to-ground integrated networks, the scattering effects from ground-based scatterers, such as buildings, cannot be neglected in millimeter-wave and higher frequency bands, and have a significant impact on channel characteristics. However, current scattering measurement studies primarily focus on single incident angles within the incident plane, leading to insufficient characterization of scattering properties. In this paper, we present scattering measurements conducted at 28 GHz on various real-world building surfaces with multiple incident angles and three-dimensional (3D) receiving angles. The measured data are analyzed in conjunction with parameterized scattering models in ray tracing and numerical simulations. Results indicate that for millimeter-wave channel modeling near building surfaces, it is crucial to account not only for surface materials but also for the scattering properties of the building surfaces with respect to the incident angle and receiving positions in 3D space. △ Less

Submitted 2 February, 2025; originally announced February 2025.

Comments: 6 pages, 7 figures. 2025 IEEE Wireless Communications and Networking Conference Workshops (WCNC Wkshps), Milan, Italy, 2025

arXiv:2502.00358 [pdf, other]

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Authors: Jia Li, Wenjie Zhao, Ziru Huang, Yunhui Guo, Yapeng Tian

Abstract: Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models… ▽ More Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that state-of-theart AVS methods consistently fail under negative audio conditions, demonstrating the prevalence of visual bias. In contrast, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving highquality segmentation performance. △ Less

Submitted 20 February, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

arXiv:2502.00043 [pdf, other]

Mitigating Traffic Oscillations in Mixed Traffic Flow with Scalable Deep Koopman Predictive Control

Authors: Hao Lyu, Yanyong Guo, Pan Liu, Nan Zheng, Ting Wang, Quansheng Yue

Abstract: The use of connected automated vehicle (CAV) is advocated to mitigate traffic oscillations in mixed traffic flow consisting of CAVs and human driven vehicles (HDVs). This study proposes an adaptive deep Koopman predictive control framework (AdapKoopPC) for regulating mixed traffic flow. Firstly, a Koopman theory-based adaptive trajectory prediction deep network (AdapKoopnet) is designed for modeli… ▽ More The use of connected automated vehicle (CAV) is advocated to mitigate traffic oscillations in mixed traffic flow consisting of CAVs and human driven vehicles (HDVs). This study proposes an adaptive deep Koopman predictive control framework (AdapKoopPC) for regulating mixed traffic flow. Firstly, a Koopman theory-based adaptive trajectory prediction deep network (AdapKoopnet) is designed for modeling HDVs car-following behavior. AdapKoopnet enables the representation of HDVs behavior by a linear model in a high-dimensional space. Secondly, the model predictive control is employed to smooth the mixed traffic flow, where the combination of the linear dynamic model of CAVs and linear prediction blocks from AdapKoopnet is embedded as the predictive model into the AdapKoopPC. Finally, the predictive performance of the prosed AdapKoopnet is verified using the HighD naturalistic driving dataset. Furthermore, the control performance of AdapKoopPC is validated by the numerical simulations. Results demonstrate that the AdapKoopnet provides more accuracy HDVs predicted trajectories than the baseline nonlinear models. Moreover, the proposed AdapKoopPC exhibits more effective control performance with less computation cost compared with baselines in mitigating traffic oscillations, especially at the low CAVs penetration rates. The code of proposed AdapKoopPC is open source. △ Less

Submitted 22 April, 2025; v1 submitted 27 January, 2025; originally announced February 2025.

arXiv:2501.18878 [pdf, ps, other]

Integrated Sensing and Communication System Based on Radio Frequency Resonance Beam

Authors: Yixuan Guo, Shuaifan Xia, Mingliang Xiong, Qingwen Liu, Wen Fang, Qingwei Jiang, Gang Yan, Jiangchuan Mu

Abstract: To address the complex beam control in traditional multiple-input multiple-output (MIMO) systems, researchers have proposed adaptive beam alignment using retro-directive antenna (RDA) arrays. This approach creates echo resonance between the base station (BS) and user equipment (UE), significantly reducing computational load. However, conventional resonant beam systems (RBS) suffer from echo interf… ▽ More To address the complex beam control in traditional multiple-input multiple-output (MIMO) systems, researchers have proposed adaptive beam alignment using retro-directive antenna (RDA) arrays. This approach creates echo resonance between the base station (BS) and user equipment (UE), significantly reducing computational load. However, conventional resonant beam systems (RBS) suffer from echo interference due to the shared uplink and downlink frequency. Therefore, this paper proposes an innovative resonance beam-based integrated sensing and communication (RB-ISAC) system designed for efficient passive sensing and bidirectional communication. In this system, the UE operates passively, with both the BS and UE utilizing a phase conjugation and frequency conversion structure to decouple uplink and downlink carrier frequencies, ensuring continuous electromagnetic wave oscillation between the two ends. Effective compensation for signal propagation loss enables resonance after multiple oscillations. At this point, the beam's field forms a low-diffraction-loss, highly focused pattern, automatically aligning the transmitter and receiver. This enables high-precision passive positioning alongside robust uplink and downlink communication. Simulation results demonstrate the proposed system achieves resonance within multiple iterations, supporting uplink and downlink communication up to 5 m, and enabling passive direction of arrival (DOA) estimation with an error under 2$^\circ$ . △ Less

Submitted 5 June, 2025; v1 submitted 30 January, 2025; originally announced January 2025.

arXiv:2501.16471 [pdf, other]

SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

Authors: Simon Dahan, Gabriel Bénédict, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Robert Leech, Emma C. Robinson

Abstract: Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subjec… ▽ More Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code & pre-trained models will be made available at https://github.com/metrics-lab/sim, processed data for training will be available upon request at https://gin.g-node.org/Sdahan30/sim. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: 27 pages, accepted to ICLR 2025

arXiv:2501.14367 [pdf, other]

Joint System Latency and Data Freshness Optimization for Cache-enabled Mobile Crowdsensing Networks

Authors: Kexin Shi, Yaru Fu, Yongna Guo, Fu Lee Wang, Yan Zhang

Abstract: Mobile crowdsensing (MCS) networks enable large-scale data collection by leveraging the ubiquity of mobile devices. However, frequent sensing and data transmission can lead to significant resource consumption. To mitigate this issue, edge caching has been proposed as a solution for storing recently collected data. Nonetheless, this approach may compromise data freshness. In this paper, we investig… ▽ More Mobile crowdsensing (MCS) networks enable large-scale data collection by leveraging the ubiquity of mobile devices. However, frequent sensing and data transmission can lead to significant resource consumption. To mitigate this issue, edge caching has been proposed as a solution for storing recently collected data. Nonetheless, this approach may compromise data freshness. In this paper, we investigate the trade-off between re-using cached task results and re-sensing tasks in cache-enabled MCS networks, aiming to minimize system latency while maintaining information freshness. To this end, we formulate a weighted delay and age of information (AoI) minimization problem, jointly optimizing sensing decisions, user selection, channel selection, task allocation, and caching strategies. The problem is a mixed-integer non-convex programming problem which is intractable. Therefore, we decompose the long-term problem into sequential one-shot sub-problems and design a framework that optimizes system latency, task sensing decision, and caching strategy subproblems. When one task is re-sensing, the one-shot problem simplifies to the system latency minimization problem, which can be solved optimally. The task sensing decision is then made by comparing the system latency and AoI. Additionally, a Bayesian update strategy is developed to manage the cached task results. Building upon this framework, we propose a lightweight and time-efficient algorithm that makes real-time decisions for the long-term optimization problem. Extensive simulation results validate the effectiveness of our approach. △ Less

Submitted 24 January, 2025; originally announced January 2025.

arXiv:2501.08139 [pdf, other]

EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

Authors: Zirui Wang, Zhenxi Song, Yi Guo, Yuxin Liu, Guoyang Xu, Min Zhang, Zhiguo Zhang

Abstract: The development of EEG decoding algorithms confronts challenges such as data sparsity, subject variability, and the need for precise annotations, all of which are vital for advancing brain-computer interfaces and enhancing the diagnosis of diseases. To address these issues, we propose a novel two-stage approach named Self-Supervised State Reconstruction-Primed Riemannian Dynamics (EEG-ReMinD) , wh… ▽ More The development of EEG decoding algorithms confronts challenges such as data sparsity, subject variability, and the need for precise annotations, all of which are vital for advancing brain-computer interfaces and enhancing the diagnosis of diseases. To address these issues, we propose a novel two-stage approach named Self-Supervised State Reconstruction-Primed Riemannian Dynamics (EEG-ReMinD) , which mitigates reliance on supervised learning and integrates inherent geometric features. This approach efficiently handles EEG data corruptions and reduces the dependency on labels. EEG-ReMinD utilizes self-supervised and geometric learning techniques, along with an attention mechanism, to analyze the temporal dynamics of EEG features within the framework of Riemannian geometry, referred to as Riemannian dynamics. Comparative analyses on both intact and corrupted datasets from two different neurodegenerative disorders underscore the enhanced performance of EEG-ReMinD. △ Less

Submitted 14 January, 2025; originally announced January 2025.

arXiv:2501.07057 [pdf, other]

Optimization with Multi-sourced Reference Information and Unknown Trust: A Distributionally Robust Approach

Authors: Yanru Guo, Ruiwei Jiang, Siqian Shen

Abstract: In problems that involve input parameter information gathered from multiple data sources with varying reliability, incorporating users' trust about different sources in decision-optimization models can potentially improve solution performance and reliability. In this work, we propose a novel multi-reference distributionally robust optimization (MR-DRO) framework, where the model inputs are uncerta… ▽ More In problems that involve input parameter information gathered from multiple data sources with varying reliability, incorporating users' trust about different sources in decision-optimization models can potentially improve solution performance and reliability. In this work, we propose a novel multi-reference distributionally robust optimization (MR-DRO) framework, where the model inputs are uncertain and their probability distributions can be statistically inferred from multiple data sources. Via nonparametric data fusion, we construct a Wasserstein ambiguity set to minimize the worst-case expected value of a stochastic objective function, accounting for both uncertainty and unknown reliability of information sources. We reformulate the MR-DRO model as a linear program given linear objective and constraints in the original problem. We also incorporate a dynamic trust update mechanism that adjusts the trust for each source based on its performance over time. In addition, we introduce the concept of probability dominance to identify sources with dominant trust. Via solving instances of resource allocation and portfolio optimization, we demonstrate the effectiveness of the trust-informed MR-DRO approach compared to traditional optimization frameworks relying on a single data source. Our results highlight the significance of integrating (dynamic) user trust in decision making under uncertainty, particularly when given diverse and potentially conflicting input data. △ Less

Submitted 12 January, 2025; originally announced January 2025.

Comments: 38 pages, 9 figures, 7 tables

arXiv:2501.01684 [pdf]

Millimeter-Wave Energy-Efficient Hybrid Beamforming Architecture and Algorithm

Authors: Hongpu Zhang, Yulu Guo, Liuxun Xue, Xingchen Liu, Shu Sun, Ruifeng Gao, Xianghao Yu, Meixia Tao

Abstract: This paper studies energy-efficient hybrid beamforming architectures and its algorithm design in millimeter-wave communication systems, aiming to address the challenges faced by existing hybrid beamforming due to low hardware flexibility and high power consumption. To solve the problems of existing hybrid beamforming, a novel energy-efficient hybrid beamforming architecture is proposed, where radi… ▽ More This paper studies energy-efficient hybrid beamforming architectures and its algorithm design in millimeter-wave communication systems, aiming to address the challenges faced by existing hybrid beamforming due to low hardware flexibility and high power consumption. To solve the problems of existing hybrid beamforming, a novel energy-efficient hybrid beamforming architecture is proposed, where radio-frequency (RF) switch networks are introduced at the front and rear ends of the phase shifter network, enabling dynamic connections between the RF chains and the phase shifter array as well as the antenna array. The system model of the proposed architecture is established, including digital precoding and analog precoding processes, and the practical hardware limitations such as quantization errors of the digital-to-analog converter (DAC) and phase shifter resolution. In order to maximize the energy efficiency, this paper derives an energy efficiency model including spectral efficiency and system power consumption, and a hybrid precoding algorithm is proposed based on block coordinate descent to iteratively optimize the digital precoding matrix, analog precoding matrix, and DAC resolution. Simulation results under the NYUSIM-generated millimeter-wave channels show that the proposed hybrid beamforming architecture and precoding algorithm have higher energy efficiency than existing representative architectures and precoding algorithms under complete and partial channel state information, while the loss of spectral efficiency compared to fully connected architecture is less than 20% △ Less

Submitted 3 January, 2025; originally announced January 2025.

Comments: 21 pages, in Chinese language, 8 figures, published to Mobile Communications

Journal ref: Mobile Communications, vol. 48, no. 12, pp. 86-96, December 2024

arXiv:2412.17048 [pdf, other]

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Authors: Hankun Wang, Haoran Wang, Yiwei Guo, Zhihan Li, Chenpeng Du, Xie Chen, Kai Yu

Abstract: Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much long… ▽ More Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs. △ Less

Submitted 22 December, 2024; originally announced December 2024.

arXiv:2412.15597 [pdf, other]

Resonant Beam Multi-Target DOA Estimation

Authors: Yixuan Guo, Qingwei Jiang, Mingliang Xiong, Wen Fang, Mingqing Liu, Qingqing Zhang, Qingwen Liu, Gang Yan

Abstract: With the increasing demand for internet of things (IoT) applications, especially for location-based services, how to locate passive mobile targets (MTs) with minimal beam control has become a challenge. Resonant beam systems are considered promising IoT technologies with advantages such as beam self-alignment and energy concentration. To establish a resonant system in the radio frequency (RF) band… ▽ More With the increasing demand for internet of things (IoT) applications, especially for location-based services, how to locate passive mobile targets (MTs) with minimal beam control has become a challenge. Resonant beam systems are considered promising IoT technologies with advantages such as beam self-alignment and energy concentration. To establish a resonant system in the radio frequency (RF) band and achieve multi-target localization, this paper designs a multi-target resonant system architecture, allowing a single base station (BS) to independently connect with multiple MTs. By employing a retro-directive array, a multi-channel cyclic model is established to realize one-to-many electromagnetic wave propagation and MT direction-of-arrival (DOA) estimation through echo resonance. Simulation results show that the proposed system supports resonant establishment between the BS and multiple MTs. This helps the BS to still have high DOA estimation accuracy in the face of multiple passive MTs, and can ensure that the DOA error is less than 1 degree within a range of 6 meters at a 50degree field of view, with higher accuracy than active beamforming localization systems. △ Less

Submitted 13 February, 2025; v1 submitted 20 December, 2024; originally announced December 2024.

Showing 1–50 of 389 results for author: Guo, Y