-
Predicting Asphalt Pavement Friction Using Texture-Based Image Indicator
Authors:
Bingjie Lu,
Zhengyang Lu,
Yijiashun Qi,
Hanzhe Guo,
Tianyao Sun,
Zunduo Zhao
Abstract:
Pavement skid resistance is of vital importance for road safety. The objective of this study is to propose and validate a texture-based image indicator to predict pavement friction. This index enables pavement friction to be measured easily and inexpensively using digital images. Three different types of asphalt surfaces (dense-graded asphalt mix, open-grade friction course, and chip seal) were ev…
▽ More
Pavement skid resistance is of vital importance for road safety. The objective of this study is to propose and validate a texture-based image indicator to predict pavement friction. This index enables pavement friction to be measured easily and inexpensively using digital images. Three different types of asphalt surfaces (dense-graded asphalt mix, open-grade friction course, and chip seal) were evaluated subject to various tire polishing cycles. Images were taken with corresponding friction measured using Dynamic Friction Tester (DFT) in the laboratory. The aggregate protrusion area is proposed as the indicator. Statistical models are established for each asphalt surface type to correlate the proposed indicator with friction coefficients. The results show that the adjusted R-square values of all relationships are above 0.90. Compared to other image-based indicators in the literature, the proposed image indicator more accurately reflects the changes in pavement friction with the number of polishing cycles, proving its cost-effective use for considering pavement friction in mix design stage.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
Opportunistic Osteoporosis Diagnosis via Texture-Preserving Self-Supervision, Mixture of Experts and Multi-Task Integration
Authors:
Jiaxing Huang,
Heng Guo,
Le Lu,
Fan Yang,
Minfeng Xu,
Ge Yang,
Wei Luo
Abstract:
Osteoporosis, characterized by reduced bone mineral density (BMD) and compromised bone microstructure, increases fracture risk in aging populations. While dual-energy X-ray absorptiometry (DXA) is the clinical standard for BMD assessment, its limited accessibility hinders diagnosis in resource-limited regions. Opportunistic computed tomography (CT) analysis has emerged as a promising alternative f…
▽ More
Osteoporosis, characterized by reduced bone mineral density (BMD) and compromised bone microstructure, increases fracture risk in aging populations. While dual-energy X-ray absorptiometry (DXA) is the clinical standard for BMD assessment, its limited accessibility hinders diagnosis in resource-limited regions. Opportunistic computed tomography (CT) analysis has emerged as a promising alternative for osteoporosis diagnosis using existing imaging data. Current approaches, however, face three limitations: (1) underutilization of unlabeled vertebral data, (2) systematic bias from device-specific DXA discrepancies, and (3) insufficient integration of clinical knowledge such as spatial BMD distribution patterns. To address these, we propose a unified deep learning framework with three innovations. First, a self-supervised learning method using radiomic representations to leverage unlabeled CT data and preserve bone texture. Second, a Mixture of Experts (MoE) architecture with learned gating mechanisms to enhance cross-device adaptability. Third, a multi-task learning framework integrating osteoporosis diagnosis, BMD regression, and vertebra location prediction. Validated across three clinical sites and an external hospital, our approach demonstrates superior generalizability and accuracy over existing methods for opportunistic osteoporosis screening and diagnosis.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Sensing-Aware Transmit Waveform/Receive Filter Design for OFDM-MBS Systems
Authors:
Xinghe Li,
Kainan Cheng,
Hongzhi Guo,
Huiyong Li,
Ziyang Cheng
Abstract:
In this letter, we study the problem of cooperative sensing design for an orthogonal frequency division multiplexing (OFDM) multiple base stations (MBS) system. We consider a practical scenario where the base stations (BSs) exploit certain subcarriers to realize a sensing function. Since the high sidelobe level (SLL) of OFDM waveforms degrades radar detection for weak targets, and the cross-correl…
▽ More
In this letter, we study the problem of cooperative sensing design for an orthogonal frequency division multiplexing (OFDM) multiple base stations (MBS) system. We consider a practical scenario where the base stations (BSs) exploit certain subcarriers to realize a sensing function. Since the high sidelobe level (SLL) of OFDM waveforms degrades radar detection for weak targets, and the cross-correlation generated by other BSs further exacerbates detection performance, we devise a joint design scheme for OFDM sequence and receive filter by minimizing the integrated sidelobe level (ISL) while satisfying mainlobe level, peak-to-average power ratio (PAPR) and spectrum allocation constraints. To address this non-convex problem, we propose an alternating optimization (AO)-based algorithm. Numerical simulations validate the effectiveness of the proposed method, demonstrating the superiority of SSL reduction in the MBS system over the matched filtering method.
△ Less
Submitted 30 June, 2025; v1 submitted 25 June, 2025;
originally announced June 2025.
-
DCD: A Semantic Segmentation Model for Fetal Ultrasound Four-Chamber View
Authors:
Donglian Li,
Hui Guo,
Minglang Chen,
Huizhen Chen,
Jialing Chen,
Bocheng Liang,
Pengchen Liang,
Ying Tan
Abstract:
Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workl…
▽ More
Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workload of sonographers and enhance segmentation accuracy, we propose DCD, an advanced deep learning-based model for automatic segmentation of key anatomical structures in the fetal A4C view. Our model incorporates a Dense Atrous Spatial Pyramid Pooling (Dense ASPP) module, enabling superior multi-scale feature extraction, and a Convolutional Block Attention Module (CBAM) to enhance adaptive feature representation. By effectively capturing both local and global contextual information, DCD achieves precise and robust segmentation, contributing to improved prenatal cardiac assessment.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
SpeechVerifier: Robust Acoustic Fingerprint against Tampering Attacks via Watermarking
Authors:
Lingfeng Yao,
Chenpei Huang,
Shengyao Wang,
Junpei Xue,
Hanqing Guo,
Jiang Liu,
Xun Chen,
Miao Pan
Abstract:
With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle…
▽ More
With the surge of social media, maliciously tampered public speeches, especially those from influential figures, have seriously affected social stability and public trust. Existing speech tampering detection methods remain insufficient: they either rely on external reference data or fail to be both sensitive to attacks and robust to benign operations, such as compression and resampling. To tackle these challenges, we introduce SpeechVerifer to proactively verify speech integrity using only the published speech itself, i.e., without requiring any external references. Inspired by audio fingerprinting and watermarking, SpeechVerifier can (i) effectively detect tampering attacks, (ii) be robust to benign operations and (iii) verify the integrity only based on published speeches. Briefly, SpeechVerifier utilizes multiscale feature extraction to capture speech features across different temporal resolutions. Then, it employs contrastive learning to generate fingerprints that can detect modifications at varying granularities. These fingerprints are designed to be robust to benign operations, but exhibit significant changes when malicious tampering occurs. To enable speech verification in a self-contained manner, the generated fingerprints are then embedded into the speech signal by segment-wise watermarking. Without external references, SpeechVerifier can retrieve the fingerprint from the published audio and check it with the embedded watermark to verify the integrity of the speech. Extensive experimental results demonstrate that the proposed SpeechVerifier is effective in detecting tampering attacks and robust to benign operations.
△ Less
Submitted 1 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition
Authors:
Yuhang Dai,
He Wang,
Xingchen Li,
Zihan Zhang,
Shuiyuan Wang,
Lei Xie,
Xin Xu,
Hongxiao Guo,
Shaoji Zhang,
Hui Bu,
Wei Chen
Abstract:
This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car do…
▽ More
This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech recognition module that accurately transcribes the content of each individual speaker. Experimental results demonstrate the challenges faced by various mainstream ASR models when evaluated on the AISHELL-5. We firmly believe the AISHELL-5 dataset will significantly advance the research on ASR systems under complex driving scenarios by establishing the first publicly available in-car ASR benchmark.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Physics-Informed Neural Network for Cross-Domain Predictive Control of Tapered Amplifier Thermal Stabilization
Authors:
Yanpei Shi,
Bo Feng,
Yuxin Zhong,
Haochen Guo,
Bangcheng Han,
Rui Feng
Abstract:
Thermally induced laser noise poses a critical limitation to the sensitivity of quantum sensor arrays employing ultra-stable amplified lasers, primarily stemming from nonlinear gain-temperature coupling effects in tapered amplifiers (TAs). To address this challenge, we present a robust intelligent control strategy that synergistically integrates an encoder-decoder physics-informed gated recurrent…
▽ More
Thermally induced laser noise poses a critical limitation to the sensitivity of quantum sensor arrays employing ultra-stable amplified lasers, primarily stemming from nonlinear gain-temperature coupling effects in tapered amplifiers (TAs). To address this challenge, we present a robust intelligent control strategy that synergistically integrates an encoder-decoder physics-informed gated recurrent unit (PI-GRU) network with a model predictive control (MPC) framework. Our methodology incorporates physical soft constraints into the neural network architecture, yielding a predictive model with enhanced physical consistency that demonstrates robust extrapolation capabilities beyond the training data distribution. Leveraging the PI-GRU model's accurate multi-step predictive performance, we implement a hierarchical parallel MPC architecture capable of real-time thermal instability compensation. This hybrid approach achieves cross-domain consistent thermal stabilization in TAs under diverse laser power operations. Remarkably, while trained exclusively on low-power operational data, our system demonstrates exceptional generalization, improving prediction accuracy by 58.2% and temperature stability by 69.1% in previously unseen high-power operating regimes, as experimentally validated. The novel synchronization of physics-informed neural networks with advanced MPC frameworks presented in this work establishes a groundbreaking paradigm for addressing robustness challenges in cross-domain predictive control applications, overcoming conventional modeling limitations.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Extremum Seeking for PDE Systems using Physics-Informed Neural Networks
Authors:
Haojin Guo,
Zongyi Guo,
Jianguo Guo,
Tiago Roux Oliveira
Abstract:
Extremum Seeking (ES) is an effective real-time optimization method for PDE systems in cascade with nonlinear quadratic maps. To address PDEs in the feedback loop, a boundary control law and a re-design of the additive probing signal are mandatory. The latter, commonly called "trajectory generation" or "motion planning," involves designing perturbation signals that anticipate their propagation thr…
▽ More
Extremum Seeking (ES) is an effective real-time optimization method for PDE systems in cascade with nonlinear quadratic maps. To address PDEs in the feedback loop, a boundary control law and a re-design of the additive probing signal are mandatory. The latter, commonly called "trajectory generation" or "motion planning," involves designing perturbation signals that anticipate their propagation through PDEs. Specifically, this requires solving motion planning problems for systems governed by parabolic and hyperbolic PDEs. Physics-Informed Neural Networks (PINN) is a powerful tool for solving PDEs by embedding physical laws as constraints in the neural network's loss function, enabling efficient solutions for high-dimensional, nonlinear, and complex problems. This paper proposes a novel construction integrating PINN and ES, automating the motion planning process for specific PDE systems and eliminating the need for case-by-case analytical derivations. The proposed strategy efficiently extracts perturbation signals, optimizing the PDE system.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Predicting Neo-Adjuvant Chemotherapy Response in Triple-Negative Breast Cancer Using Pre-Treatment Histopathologic Images
Authors:
Hikmat Khan,
Ziyu Su,
Huina Zhang,
Yihong Wang,
Bohan Ning,
Shi Wei,
Hua Guo,
Zaibo Li,
Muhammad Khalid Khan Niazi
Abstract:
Triple-negative breast cancer (TNBC) is an aggressive subtype defined by the lack of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) expression, resulting in limited targeted treatment options. Neoadjuvant chemotherapy (NACT) is the standard treatment for early-stage TNBC, with pathologic complete response (pCR) serving as a key prognostic ma…
▽ More
Triple-negative breast cancer (TNBC) is an aggressive subtype defined by the lack of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) expression, resulting in limited targeted treatment options. Neoadjuvant chemotherapy (NACT) is the standard treatment for early-stage TNBC, with pathologic complete response (pCR) serving as a key prognostic marker; however, only 40-50% of patients with TNBC achieve pCR. Accurate prediction of NACT response is crucial to optimize therapy, avoid ineffective treatments, and improve patient outcomes. In this study, we developed a deep learning model to predict NACT response using pre-treatment hematoxylin and eosin (H&E)-stained biopsy images. Our model achieved promising results in five-fold cross-validation (accuracy: 82%, AUC: 0.86, F1-score: 0.84, sensitivity: 0.85, specificity: 0.81, precision: 0.80). Analysis of model attention maps in conjunction with multiplexed immunohistochemistry (mIHC) data revealed that regions of high predictive importance consistently colocalized with tumor areas showing elevated PD-L1 expression, CD8+ T-cell infiltration, and CD163+ macrophage density - all established biomarkers of treatment response. Our findings indicate that incorporating IHC-derived immune profiling data could substantially improve model interpretability and predictive performance. Furthermore, this approach may accelerate the discovery of novel histopathological biomarkers for NACT and advance the development of personalized treatment strategies for TNBC patients.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Cross-layer Integrated Sensing and Communication: A Joint Industrial and Academic Perspective
Authors:
Henk Wymeersch,
Nuutti Tervo,
Stefan Wänstedt,
Sharief Saleh,
Joerg Ahlendorf,
Ozgur Akgul,
Vasileios Tsekenis,
Sokratis Barmpounakis,
Liping Bai,
Martin Beale,
Rafael Berkvens,
Nabeel Nisar Bhat,
Hui Chen,
Shrayan Das,
Claude Desset,
Antonio de la Oliva,
Prajnamaya Dass,
Jeroen Famaey,
Hamed Farhadi,
Gerhard P. Fettweis,
Yu Ge,
Hao Guo,
Rreze Halili,
Katsuyuki Haneda,
Abdur Rahman Mohamed Ismail
, et al. (18 additional authors not shown)
Abstract:
Integrated sensing and communication (ISAC) enables radio systems to simultaneously sense and communicate with their environment. This paper, developed within the Hexa-X-II project funded by the European Union, presents a comprehensive cross-layer vision for ISAC in 6G networks, integrating insights from physical-layer design, hardware architectures, AI-driven intelligence, and protocol-level inno…
▽ More
Integrated sensing and communication (ISAC) enables radio systems to simultaneously sense and communicate with their environment. This paper, developed within the Hexa-X-II project funded by the European Union, presents a comprehensive cross-layer vision for ISAC in 6G networks, integrating insights from physical-layer design, hardware architectures, AI-driven intelligence, and protocol-level innovations. We begin by revisiting the foundational principles of ISAC, highlighting synergies and trade-offs between sensing and communication across different integration levels. Enabling technologies, such as multiband operation, massive and distributed MIMO, non-terrestrial networks, reconfigurable intelligent surfaces, and machine learning, are analyzed in conjunction with hardware considerations including waveform design, synchronization, and full-duplex operation. To bridge implementation and system-level evaluation, we introduce a quantitative cross-layer framework linking design parameters to key performance and value indicators. By synthesizing perspectives from both academia and industry, this paper outlines how deeply integrated ISAC can transform 6G into a programmable and context-aware platform supporting applications from reliable wireless access to autonomous mobility and digital twinning.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Bayesian Deep End-to-End Learning for MIMO-OFDM System with Delay-Domain Sparse Precoder
Authors:
Nilesh Kumar Jha,
Huayan Guo,
Vincent K. N. Lau
Abstract:
This paper introduces a novel precoder design aimed at reducing pilot overhead for effective channel estimation in multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) applications utilizing high-order modulation. We propose an innovative demodulation reference signal scheme that achieves up to an 8x reduction in overhead by implementing a delay-domain sparsity con…
▽ More
This paper introduces a novel precoder design aimed at reducing pilot overhead for effective channel estimation in multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) applications utilizing high-order modulation. We propose an innovative demodulation reference signal scheme that achieves up to an 8x reduction in overhead by implementing a delay-domain sparsity constraint on the precoder. Furthermore, we present a deep neural network (DNN)-based end-to-end architecture that integrates a propagation channel estimation module, a precoder design module, and an effective channel estimation module. Additionally, we propose a Bayesian model-assisted training framework that incorporates domain knowledge, resulting in an interpretable datapath design. Simulation results demonstrate that our proposed solution significantly outperforms various baseline schemes while exhibiting substantially lower computational complexity.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Authors:
Xin Li,
Yeying Jin,
Xin Jin,
Zongwei Wu,
Bingchen Li,
Yufei Wang,
Wenhan Yang,
Yu Li,
Zhibo Chen,
Bihan Wen,
Robby T. Tan,
Radu Timofte,
Qiyu Rong,
Hongyuan Jing,
Mengmeng Zhang,
Jinglong Li,
Xiangyu Lu,
Yi Ren,
Yuting Liu,
Meng Zhang,
Xiang Chen,
Qiyuan Guan,
Jiangxin Dong,
Jinshan Pan,
Conglin Gou
, et al. (112 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includ…
▽ More
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
△ Less
Submitted 19 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution
Authors:
Shuangfan Zhou,
Chu Zhou,
Youwei Lyu,
Heng Guo,
Zhanyu Ma,
Boxin Shi,
Imari Sato
Abstract:
Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that m…
▽ More
Polarization cameras can capture multiple polarized images with different polarizer angles in a single shot, bringing convenience to polarization-based downstream tasks. However, their direct outputs are color-polarization filter array (CPFA) raw images, requiring demosaicing to reconstruct full-resolution, full-color polarized images; unfortunately, this necessary step introduces artifacts that make polarization-related parameters such as the degree of polarization (DoP) and angle of polarization (AoP) prone to error. Besides, limited by the hardware design, the resolution of a polarization camera is often much lower than that of a conventional RGB camera. Existing polarized image demosaicing (PID) methods are limited in that they cannot enhance resolution, while polarized image super-resolution (PISR) methods, though designed to obtain high-resolution (HR) polarized images from the demosaicing results, tend to retain or even amplify errors in the DoP and AoP introduced by demosaicing artifacts. In this paper, we propose PIDSR, a joint framework that performs complementary Polarized Image Demosaicing and Super-Resolution, showing the ability to robustly obtain high-quality HR polarized images with more accurate DoP and AoP from a CPFA raw image in a direct manner. Experiments show our PIDSR not only achieves state-of-the-art performance on both synthetic and real data, but also facilitates downstream tasks.
△ Less
Submitted 22 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Panoptic: True Joint mmWave Communication and Sensing with Compressive Sidelobe Forming
Authors:
Heyu Guo,
Ruiyi Shen,
Florian Kosterhon,
Yasaman Ghasempour
Abstract:
The integration of communication and sensing functions within mmWave systems has gained attention due to the potential for enhanced passive sensing and improved communication reliability. State-of-the-art techniques separate these two functions in frequency, use of hardware, or time, i.e., sending known preambles for channel sensing or unknown symbols for communications. In this paper, we introduc…
▽ More
The integration of communication and sensing functions within mmWave systems has gained attention due to the potential for enhanced passive sensing and improved communication reliability. State-of-the-art techniques separate these two functions in frequency, use of hardware, or time, i.e., sending known preambles for channel sensing or unknown symbols for communications. In this paper, we introduce Panoptic, a novel system architecture for integrated communication and sensing sharing the same hardware, frequency, and time resources. Panoptic jointly detects unknown symbols and channel components from data-modulated signals. The core idea is a new beam manipulation technique, which we call compressive sidelobe forming, that maintains a directional mainlobe toward the intended communication nodes while acquiring unique spatial information through pseudorandom sidelobe perturbations. We implemented Panoptic on 60 GHz mmWave radios and conducted extensive over-the-air experiments. Our results show that Panoptic achieves reflector angular localization error of less than 2°while at the same time supporting mmWave data communication with a negligible BER penalty when compared with conventional communication-only mmWave systems.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
RapidPD: Rapid Human and Pet Presence Detection System for Smart Vehicles via Wi-Fi
Authors:
Hancheng Guo,
Zhen Chen,
Mo Huang,
Xiu Yin Zhang
Abstract:
Heatstroke and life threatening incidents resulting from the retention of children and animals in vehicles pose a critical global safety issue. Current presence detection solutions often require specialized hardware or suffer from detection delays that do not meet safety standards. To tackle this issue, by re-modeling channel state information (CSI) with theoretical analysis of path propagation, t…
▽ More
Heatstroke and life threatening incidents resulting from the retention of children and animals in vehicles pose a critical global safety issue. Current presence detection solutions often require specialized hardware or suffer from detection delays that do not meet safety standards. To tackle this issue, by re-modeling channel state information (CSI) with theoretical analysis of path propagation, this study introduces RapidPD, an innovative system utilizing CSI in subcarrier dimension to detect the presence of humans and pets in vehicles. The system models the impact of motion on CSI and introduces motion statistics in subcarrier dimension using a multi-layer autocorrelation method to quantify environmental changes. RapidPD is implemented using commercial Wi-Fi chipsets and tested in real vehicle environments with data collected from 10 living organisms. Experimental results demonstrate that RapidPD achieves a detection accuracy of 99.05% and a true positive rate of 99.32% within a 1-second time window at a low sampling rate of 20 Hz. These findings represent a significant advancement in vehicle safety and provide a foundation for the widespread adoption of presence detection systems.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System
Authors:
Hao-Han Guo,
Yao Hu,
Fei-Yu Shen,
Xu Tang,
Yi-Chen Wu,
Feng-Long Xie,
Kun Xie
Abstract:
In this work, we upgrade FireRedTTS to a new version, FireRedTTS-1S, a high-quality streaming foundation text-to-speech system. FireRedTTS-1S achieves streaming speech generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from th…
▽ More
In this work, we upgrade FireRedTTS to a new version, FireRedTTS-1S, a high-quality streaming foundation text-to-speech system. FireRedTTS-1S achieves streaming speech generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from the text via a language model in an auto-regressive manner. Meanwhile, the semantic-to-acoustic decoding module simultaneously translates generated semantic tokens into the speech signal in a streaming way. We implement two approaches to achieve this module: 1) a chunk-wise streamable flow-matching approach, and 2) a multi-stream language model-based approach. They both present high-quality and streamable speech generation but differ in real-time factor (RTF) and latency. Specifically, flow-matching decoding can generate speech by chunks, presenting a lower RTF of 0.1 but a higher latency of 300ms. Instead, the multi-stream language model generates speech by frames in an autoregressive manner, presenting a higher RTF of 0.3 but a low latency of 150ms. In experiments on zero-shot voice cloning, the objective results validate FireRedTTS-1S as a high-quality foundation model with comparable intelligibility and speaker similarity over industrial baseline systems. Furthermore, the subjective score of FireRedTTS-1S highlights its impressive synthesis performance, achieving comparable quality to the ground-truth recordings. These results validate FireRedTTS-1S as a high-quality streaming foundation TTS system.
△ Less
Submitted 26 May, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation
Authors:
Yumin Zhang,
Yan Gao,
Haoran Duan,
Hanqing Guo,
Tejal Shah,
Rajiv Ranjan,
Bo Wei
Abstract:
Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader ap…
▽ More
Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader application. Integrating federated learning (FL) with foundation models (FLFM) fine-tuning offers a potential solution to these challenges by enabling collaborative model training without data sharing, thus allowing FMs to take advantage of a diverse pool of sensitive medical image data across hospitals/clients. However, non-independent and identically distributed (non-IID) data among clients, paired with computational and communication constraints in federated environments, presents an additional challenge that limits further performance improvements and remains inadequately addressed in existing studies. In this work, we propose a novel FLFM fine-tuning framework, \underline{\textbf{Fed}}erated tuning with \underline{\textbf{S}}imilarity-guided \underline{\textbf{C}}ollaborative \underline{\textbf{A}}ggregation (FedSCA), encompassing all phases of the FL process. This includes (1) specially designed parameter-efficient fine-tuning (PEFT) for local client training to enhance computational efficiency; (2) partial low-level adapter transmission for communication efficiency; and (3) similarity-guided collaborative aggregation (SGCA) on the server side to address non-IID issues. Extensive experiments on three FL benchmarks for medical image segmentation demonstrate the effectiveness of our proposed FedSCA, establishing new SOTA performance.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
A Continual Learning-driven Model for Accurate and Generalizable Segmentation of Clinically Comprehensive and Fine-grained Whole-body Anatomies in CT
Authors:
Dazhou Guo,
Zhanghexuan Ji,
Yanzhou Su,
Dandan Zheng,
Heng Guo,
Puyang Wang,
Ke Yan,
Yirui Wang,
Qinji Yu,
Zi Li,
Minfeng Xu,
Jianfeng Zhang,
Haoshen Li,
Jia Ge,
Tsung-Ying Ho,
Bing-Shen Huang,
Tashan Ai,
Kuaile Zhao,
Na Shen,
Qifeng Wang,
Yun Bian,
Tingyu Wu,
Peng Du,
Hua Zhang,
Feng-Ming Kong
, et al. (9 additional authors not shown)
Abstract:
Precision medicine in the quantitative management of chronic diseases and oncology would be greatly improved if the Computed Tomography (CT) scan of any patient could be segmented, parsed and analyzed in a precise and detailed way. However, there is no such fully annotated CT dataset with all anatomies delineated for training because of the exceptionally high manual cost, the need for specialized…
▽ More
Precision medicine in the quantitative management of chronic diseases and oncology would be greatly improved if the Computed Tomography (CT) scan of any patient could be segmented, parsed and analyzed in a precise and detailed way. However, there is no such fully annotated CT dataset with all anatomies delineated for training because of the exceptionally high manual cost, the need for specialized clinical expertise, and the time required to finish the task. To this end, we proposed a novel continual learning-driven CT model that can segment complete anatomies presented using dozens of previously partially labeled datasets, dynamically expanding its capacity to segment new ones without compromising previously learned organ knowledge. Existing multi-dataset approaches are not able to dynamically segment new anatomies without catastrophic forgetting and would encounter optimization difficulty or infeasibility when segmenting hundreds of anatomies across the whole range of body regions. Our single unified CT segmentation model, CL-Net, can highly accurately segment a clinically comprehensive set of 235 fine-grained whole-body anatomies. Composed of a universal encoder, multiple optimized and pruned decoders, CL-Net is developed using 13,952 CT scans from 20 public and 16 private high-quality partially labeled CT datasets of various vendors, different contrast phases, and pathologies. Extensive evaluation demonstrates that CL-Net consistently outperforms the upper limit of an ensemble of 36 specialist nnUNets trained per dataset with the complexity of 5% model size and significantly surpasses the segmentation accuracy of recent leading Segment Anything-style medical image foundation models by large margins. Our continual learning-driven CL-Net model would lay a solid foundation to facilitate many downstream tasks of oncology and chronic diseases using the most widely adopted CT imaging.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Dual-domain Modulation Network for Lightweight Image Super-Resolution
Authors:
Wenjie Li,
Heng Guo,
Yuefeng Hou,
Guangwei Gao,
Zhanyu Ma
Abstract:
Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show…
▽ More
Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a dual-domain modulation network that utilize wavelet-domain modulation self-Transformer (WMT) plus Fourier supervision to modulate frequency features in addition to spatial domain modulation. Compared to existing frequency-based SR modules, our WMT is more suitable for frequency learning in lightweight SR. Experimental results show that our method achieves a comparable PSNR of SRFormer and MambaIR while with less than 50% and 60% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Codes will be released upon acceptance.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Model-Agnostic Uncertainty Quantification for Fast NFC Tag Identification using RF Fingerprinting
Authors:
Dickson Akuoko Sarpong,
Adam Kamrath,
Rohit Bhusal,
Hongzhi Guo
Abstract:
Near Field Communication (NFC) is widely used in security applications such as door access systems and ID cards. However, clone attacks can replicate digital information, enabling unauthorized access. RF fingerprinting offers a robust defense by extracting unique physical-layer features from NFC cards that cannot be cloned. While RF fingerprinting has been extensively applied to Internet of Things…
▽ More
Near Field Communication (NFC) is widely used in security applications such as door access systems and ID cards. However, clone attacks can replicate digital information, enabling unauthorized access. RF fingerprinting offers a robust defense by extracting unique physical-layer features from NFC cards that cannot be cloned. While RF fingerprinting has been extensively applied to Internet of Things (IoT) device authentication, NFC tags present distinct characteristics that require specialized approaches. This paper focuses on RF fingerprinting for the ISO15693 NFC tag, which is a widely used international standard, by leveraging multi-channel, multi-rate data sampling to enhance accuracy. Deep learning and Random Forest models are employed to identify NFC tags, while uncertainty quantification, particularly Conformal Prediction, accelerates the identification process with high confidence and precision. A software-defined radio (SDR) testbed is developed to transmit customized commands and collect multi-channel multi-rate NFC signals. The multi-channel multi-rate NFC signals are progressively collected to ensure fast and accurate identification. Experimental results demonstrate that the proposed system achieves high accuracy by adaptively utilizing the optimal combination of NFC signals. The developed solution is model-agnostic which can be utilized for any machine learning-based NFC tag identification.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Reinforcement Learning Based Symbolic Regression for Load Modeling
Authors:
Ding Lin,
Han Guo,
Jianhui Wang,
Meng Yue,
Tianqiao Zhao
Abstract:
With the increasing penetration of renewable energy sources, growing demand variability, and evolving grid control strategies, accurate and efficient load modeling has become a critical yet challenging task. Traditional methods, such as fixed-form parametric models and data-driven approaches, often struggle to balance accuracy, computational efficiency, and interpretability. This paper introduces…
▽ More
With the increasing penetration of renewable energy sources, growing demand variability, and evolving grid control strategies, accurate and efficient load modeling has become a critical yet challenging task. Traditional methods, such as fixed-form parametric models and data-driven approaches, often struggle to balance accuracy, computational efficiency, and interpretability. This paper introduces a novel symbolic regression algorithm based on the Actor-Critic reinforcement learning framework, specifically tailored for dynamic load modeling. The algorithm employs a trainable expression tree with controlled depth and a predefined set of operators to generate compact and interpretable mathematical expressions. The Actor network probabilistically selects operators for the symbolic expression, while the Critic evaluates the resulting expression tree through a loss function. To further enhance performance, a candidate pool mechanism is implemented to store high-performing expressions, which are subsequently fine-tuned using gradient descent. By focusing on simplicity and precision, the proposed method significantly reduces computational complexity while preserving interpretability. Experimental results validate its superior performance compared to existing benchmarks, which offers a robust and scalable solution for dynamic load modeling and system analysis in modern power systems.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Diffusion Model Based Probabilistic Day-ahead Load Forecasting
Authors:
Ding Lin,
Han Guo,
Jianhui Wang
Abstract:
Accurate probabilistic load forecasting is crucial for maintaining the safety and stability of power systems. However, the mainstream approach, multi-step prediction, must be improved by cumulative errors and latency issues, which limits its effectiveness in probabilistic day-ahead load forecasting (PDALF). To overcome these challenges, we introduce DALNet, a novel denoising diffusion model design…
▽ More
Accurate probabilistic load forecasting is crucial for maintaining the safety and stability of power systems. However, the mainstream approach, multi-step prediction, must be improved by cumulative errors and latency issues, which limits its effectiveness in probabilistic day-ahead load forecasting (PDALF). To overcome these challenges, we introduce DALNet, a novel denoising diffusion model designed to generate load curves rather than relying on direct prediction. By shifting the focus to curve generation, DALNet captures the complex distribution of actual load time-series data under specific conditions with greater fidelity. To further enhance DALNet, we propose the temporal multi-scale attention block (TMSAB), a mechanism designed to integrate both positional and temporal information for improved forecasting precision. Furthermore, we utilize kernel density estimation (KDE) to reconstruct the distribution of generated load curves and employ KL divergence to compare them with the actual data distribution. Experimental results demonstrate that DALNet excels in load forecasting accuracy and offers a novel perspective for other predictive tasks within power systems.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Vision-Based Cooperative MAV-Capturing-MAV
Authors:
Canlun Zheng,
Yize Mi,
Hanqing Guo,
Huaben Chen,
Shiyu Zhao
Abstract:
MAV-capturing-MAV (MCM) is one of the few effective methods for physically countering misused or malicious MAVs.This paper presents a vision-based cooperative MCM system, where multiple pursuer MAVs equipped with onboard vision systems detect, localize, and pursue a target MAV. To enhance robustness, a distributed state estimation and control framework enables the pursuer MAVs to autonomously coor…
▽ More
MAV-capturing-MAV (MCM) is one of the few effective methods for physically countering misused or malicious MAVs.This paper presents a vision-based cooperative MCM system, where multiple pursuer MAVs equipped with onboard vision systems detect, localize, and pursue a target MAV. To enhance robustness, a distributed state estimation and control framework enables the pursuer MAVs to autonomously coordinate their actions. Pursuer trajectories are optimized using Model Predictive Control (MPC) and executed via a low-level SO(3) controller, ensuring smooth and stable pursuit. Once the capture conditions are satisfied, the pursuer MAVs automatically deploy a flying net to intercept the target. These capture conditions are determined based on the predicted motion of the net. To enable real-time decision-making, we propose a lightweight computational method to approximate the net motion, avoiding the prohibitive cost of solving the full net dynamics. The effectiveness of the proposed system is validated through simulations and real-world experiments. In real-world tests, our approach successfully captures a moving target traveling at 4 meters per second with an acceleration of 1 meter per square second, achieving a success rate of 64.7 percent.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
PodAgent: A Comprehensive Framework for Podcast Generation
Authors:
Yujia Xiao,
Lei He,
Haohan Guo,
Fenglong Xie,
Tan Lee
Abstract:
Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-ag…
▽ More
Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
Audio-FLAN: A Preliminary Release
Authors:
Liumeng Xue,
Ziya Zhou,
Jiahao Pan,
Zixuan Li,
Shuai Fan,
Yinghao Ma,
Sitong Cheng,
Dongchao Yang,
Haohan Guo,
Yujia Xiao,
Xinsheng Wang,
Zixuan Shen,
Chuanbo Zhu,
Xinshen Zhang,
Tianchi Liu,
Ruibin Yuan,
Zeyue Tian,
Haohe Liu,
Emmanouil Benetos,
Ge Zhang,
Yike Guo,
Wei Xue
Abstract:
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learnin…
▽ More
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Safe Reinforcement Learning-based Control for Hydrogen Diesel Dual-Fuel Engines
Authors:
Vasu Sharma,
Alexander Winkler,
Armin Norouzi,
Jakob Andert,
David Gordon,
Hongsheng Guo
Abstract:
The urgent energy transition requirements towards a sustainable future stretch across various industries and are a significant challenge facing humanity. Hydrogen promises a clean, carbon-free future, with the opportunity to integrate with existing solutions in the transportation sector. However, adding hydrogen to existing technologies such as diesel engines requires additional modeling effort. R…
▽ More
The urgent energy transition requirements towards a sustainable future stretch across various industries and are a significant challenge facing humanity. Hydrogen promises a clean, carbon-free future, with the opportunity to integrate with existing solutions in the transportation sector. However, adding hydrogen to existing technologies such as diesel engines requires additional modeling effort. Reinforcement Learning (RL) enables interactive data-driven learning that eliminates the need for mathematical modeling. The algorithms, however, may not be real-time capable and need large amounts of data to work in practice. This paper presents a novel approach which uses offline model learning with RL to demonstrate safe control of a 4.5 L Hydrogen Diesel Dual-Fuel (H2DF) engine. The controllers are demonstrated to be constraint compliant and can leverage a novel state-augmentation approach for sample-efficient learning. The offline policy is subsequently experimentally validated on the real engine where the control algorithm is executed on a Raspberry Pi controller and requires 6 times less computation time compared to online Model Predictive Control (MPC) optimization.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech
Authors:
Xin Wang,
Héctor Delgado,
Hemlata Tak,
Jee-weon Jung,
Hye-jin Shim,
Massimiliano Todisco,
Ivan Kukanov,
Xuechen Liu,
Md Sahidullah,
Tomi Kinnunen,
Nicholas Evans,
Kong Aik Lee,
Junichi Yamagishi,
Myeonghun Jeong,
Ge Zhu,
Yongyi Zang,
You Zhang,
Soumi Maiti,
Florian Lux,
Nicolas Müller,
Wangyou Zhang,
Chengzhe Sun,
Shuwei Hou,
Siwei Lyu,
Sébastien Le Maguer
, et al. (4 additional authors not shown)
Abstract:
ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier…
▽ More
ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier). The database contains attacks generated with 32 different algorithms, also crowdsourced, and optimised to varying degrees using new surrogate detection models. Among them are attacks generated with a mix of legacy and contemporary text-to-speech synthesis and voice conversion models, in addition to adversarial attacks which are incorporated for the first time. ASVspoof 5 protocols comprise seven speaker-disjoint partitions. They include two distinct partitions for the training of different sets of attack models, two more for the development and evaluation of surrogate detection models, and then three additional partitions which comprise the ASVspoof 5 training, development and evaluation sets. An auxiliary set of data collected from an additional 30k speakers can also be used to train speaker encoders for the implementation of attack algorithms. Also described herein is an experimental validation of the new ASVspoof 5 database using a set of automatic speaker verification and spoof/deepfake baseline detectors. With the exception of protocols and tools for the generation of spoofed/deepfake speech, the resources described in this paper, already used by participants of the ASVspoof 5 challenge in 2024, are now all freely available to the community.
△ Less
Submitted 24 April, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
A Cooperative Bearing-Rate Approach for Observability-Enhanced Target Motion Estimation
Authors:
Canlun Zheng,
Hanqing Guo,
Shiyu Zhao
Abstract:
Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the \emph{bearing rate} i…
▽ More
Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the \emph{bearing rate} information that has not been well explored in the literature. The main contribution of this paper is to propose a new cooperative estimator called STT-R (Spatial-Temporal Triangulation with bearing Rate), which is designed under the framework of distributed recursive least squares. This theoretical result is further verified by numerical simulation and real-world experiments. It is shown that the proposed STT-R algorithm can effectively generate more accurate estimations and effectively reduce the lag in velocity estimation, enabling tracking of more maneuverable targets.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Spatial-Angular Representation Learning for High-Fidelity Continuous Super-Resolution in Diffusion MRI
Authors:
Ruoyou Wu,
Jian Cheng,
Cheng Li,
Juan Zou,
Wenxin Fan,
Hua Guo,
Yong Liang,
Shanshan Wang
Abstract:
Diffusion magnetic resonance imaging (dMRI) often suffers from low spatial and angular resolution due to inherent limitations in imaging hardware and system noise, adversely affecting the accurate estimation of microstructural parameters with fine anatomical details. Deep learning-based super-resolution techniques have shown promise in enhancing dMRI resolution without increasing acquisition time.…
▽ More
Diffusion magnetic resonance imaging (dMRI) often suffers from low spatial and angular resolution due to inherent limitations in imaging hardware and system noise, adversely affecting the accurate estimation of microstructural parameters with fine anatomical details. Deep learning-based super-resolution techniques have shown promise in enhancing dMRI resolution without increasing acquisition time. However, most existing methods are confined to either spatial or angular super-resolution, limiting their effectiveness in capturing detailed microstructural features. Furthermore, traditional pixel-wise loss functions struggle to recover intricate image details essential for high-resolution reconstruction. To address these challenges, we propose SARL-dMRI, a novel Spatial-Angular Representation Learning framework for high-fidelity, continuous super-resolution in dMRI. SARL-dMRI explores implicit neural representations and spherical harmonics to model continuous spatial and angular representations, simultaneously enhancing both spatial and angular resolution while improving microstructural parameter estimation accuracy. To further preserve image fidelity, a data-fidelity module and wavelet-based frequency loss are introduced, ensuring the super-resolved images remain consistent with the original input and retain fine details. Extensive experiments demonstrate that, compared to five other state-of-the-art methods, our method significantly enhances dMRI data resolution, improves the accuracy of microstructural parameter estimation, and provides better generalization capabilities. It maintains stable performance even under a 45$\times$ downsampling factor.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Baichuan-Omni-1.5 Technical Report
Authors:
Yadong Li,
Jun Liu,
Tao Zhang,
Tao Zhang,
Song Chen,
Tianpeng Li,
Zehuan Li,
Lijun Liu,
Lingfeng Ming,
Guosheng Dong,
Da Pan,
Chong Li,
Yuanbo Fang,
Dongdong Kuang,
Mingrui Wang,
Chenglin Zhu,
Youwei Zhang,
Hongyu Guo,
Fengyu Zhang,
Yuran Wang,
Bowen Ding,
Wei Song,
Xu Li,
Yuqi Huo,
Zheng Liang
, et al. (68 additional authors not shown)
Abstract:
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pip…
▽ More
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Authors:
Liang Chen,
Zekun Wang,
Shuhuai Ren,
Lei Li,
Haozhe Zhao,
Yunshui Li,
Zefan Cai,
Hongcheng Guo,
Lei Zhang,
Yizhe Xiong,
Yichi Zhang,
Ruoyu Wu,
Qingxiu Dong,
Ge Zhang,
Jian Yang,
Lingwei Meng,
Shujie Hu,
Yulong Chen,
Junyang Lin,
Shuai Bai,
Andreas Vlachos,
Xu Tan,
Minjia Zhang,
Wen Xiao,
Aaron Yee
, et al. (2 additional authors not shown)
Abstract:
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks f…
▽ More
Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \& evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
△ Less
Submitted 29 December, 2024; v1 submitted 16 December, 2024;
originally announced December 2024.
-
Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
Authors:
Hongming Guo,
Ruibo Fu,
Yizhong Geng,
Shuai Liu,
Shuchen Shi,
Tao Wang,
Chunyu Qiang,
Chenxing Li,
Ya Li,
Zhengqi Wen,
Yukun Liu,
Xuefei Liu
Abstract:
Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. I…
▽ More
Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. In this paper, we begin by investigating the critical role of U-Net in Mel-spectrogram generation. Our analysis shows that in U-Net structure, high-frequency components in skip-connections and the backbone influence texture and detail, while low-frequency components in the backbone are critical for the diffusion denoising process. We further propose ``Mel-Refine'', a plug-and-play approach that enhances Mel-spectrogram texture and detail by adjusting different component weights during inference. Our method requires no additional training or fine-tuning and is fully compatible with any diffusion-based TTA architecture. Experimental results show that our approach boosts performance metrics of the latest TTA model Tango2 by 25\%, demonstrating its effectiveness.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Diff5T: Benchmarking Human Brain Diffusion MRI with an Extensive 5.0 Tesla K-Space and Spatial Dataset
Authors:
Shanshan Wang,
Shoujun Yu,
Jian Cheng,
Sen Jia,
Changjun Tie,
Jiayu Zhu,
Haohao Peng,
Yijing Dong,
Jianzhong He,
Fan Zhang,
Yaowen Xing,
Xiuqin Jia,
Qi Yang,
Qiyuan Tian,
Hua Guo,
Guobin Li,
Hairong Zheng
Abstract:
Diffusion magnetic resonance imaging (dMRI) provides critical insights into the microstructural and connectional organization of the human brain. However, the availability of high-field, open-access datasets that include raw k-space data for advanced research remains limited. To address this gap, we introduce Diff5T, a first comprehensive 5.0 Tesla diffusion MRI dataset focusing on the human brain…
▽ More
Diffusion magnetic resonance imaging (dMRI) provides critical insights into the microstructural and connectional organization of the human brain. However, the availability of high-field, open-access datasets that include raw k-space data for advanced research remains limited. To address this gap, we introduce Diff5T, a first comprehensive 5.0 Tesla diffusion MRI dataset focusing on the human brain. This dataset includes raw k-space data and reconstructed diffusion images, acquired using a variety of imaging protocols. Diff5T is designed to support the development and benchmarking of innovative methods in artifact correction, image reconstruction, image preprocessing, diffusion modelling and tractography. The dataset features a wide range of diffusion parameters, including multiple b-values and gradient directions, allowing extensive research applications in studying human brain microstructure and connectivity. With its emphasis on open accessibility and detailed benchmarks, Diff5T serves as a valuable resource for advancing human brain mapping research using diffusion MRI, fostering reproducibility, and enabling collaboration across the neuroscience and medical imaging communities.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
Near-Field Measurement System for the Upper Mid-Band
Authors:
Ali Rasteh,
Raghavendra Palayam Hari,
Hao Guo,
Marco Mezzavilla,
Sundeep Rangan
Abstract:
The upper mid-band (or FR3, spanning 6-24 GHz) is a crucial frequency range for next-generation mobile networks, offering a favorable balance between coverage and spectrum efficiency. From another perspective, the systems operating in the near-field in both indoor environment and outdoor environments can support line-of-sight multiple input multiple output (MIMO) communications and be beneficial f…
▽ More
The upper mid-band (or FR3, spanning 6-24 GHz) is a crucial frequency range for next-generation mobile networks, offering a favorable balance between coverage and spectrum efficiency. From another perspective, the systems operating in the near-field in both indoor environment and outdoor environments can support line-of-sight multiple input multiple output (MIMO) communications and be beneficial from the FR3 bands. In this paper, a novel method is proposed to measure the near-field parameters leveraging a recently developed reflection model where the near-field paths can be described by their image points. We show that these image points can be accurately estimated via triangulation from multiple measurements with a small number of antennas in each measurement, thus affording a low-cost procedure for near-field multi-path parameter extraction. A preliminary experimental apparatus is presented comprising 2 transmit and 2 receive antennas mounted on a linear track to measure the 2x2 MIMO channel at various displacements. The system uses a recently-developed wideband radio frequency (RF) transceiver board with fast frequency switching, an FPGA for fast baseband processing, and a new parameter extraction method to recover paths and spherical characteristics from the multiple 2x2 measurements.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
On the Surprising Effectiveness of Spectrum Clipping in Learning Stable Linear Dynamics
Authors:
Hanyao Guo,
Yunhai Han,
Harish Ravichandar
Abstract:
When learning stable linear dynamical systems from data, three important properties are desirable: i) predictive accuracy, ii) provable stability, and iii) computational efficiency. Unconstrained minimization of reconstruction errors leads to high accuracy and efficiency but cannot guarantee stability. Existing methods to enforce stability often preserve accuracy, but do so only at the cost of inc…
▽ More
When learning stable linear dynamical systems from data, three important properties are desirable: i) predictive accuracy, ii) provable stability, and iii) computational efficiency. Unconstrained minimization of reconstruction errors leads to high accuracy and efficiency but cannot guarantee stability. Existing methods to enforce stability often preserve accuracy, but do so only at the cost of increased computation. In this work, we investigate if a straightforward approach can simultaneously offer all three desiderata of learning stable linear systems. Specifically, we consider a post-hoc approach that manipulates the spectrum of the learned system matrix that was computed using unconstrained least squares. We call this approach spectrum clipping (SC) as it involves eigen decomposition and subsequent reconstruction of the system matrix after clipping any eigenvalues that are larger than one to one (without altering the eigenvectors). Through comprehensive experiments involving two different applications and publicly available benchmark datasets, we show that this simple technique can efficiently learn highly-accurate linear systems that are provably-stable. Notably, we find that SC can match or outperform strong baselines while being orders-of-magnitude faster. We also show that SC can be readily combined with Koopman operators to learn stable nonlinear dynamics, such as those underlying complex dexterous manipulation skills involving multi-fingered robotic hands. Finally, we find that SC can learn stable robot policies even when the training data includes unsuccessful or truncated demonstrations. Our codes and dataset can be found at https://github.com/GT-STAR-Lab/spec_clip.
△ Less
Submitted 17 May, 2025; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Computation-power Coupled Modeling for IDCs and Collaborative Optimization in ADNs
Authors:
Chuyi Li,
Kedi Zheng,
Hongye Guo,
Chongqing Kang,
Qixin Chen
Abstract:
The batch and online workload of Internet data centers (IDCs) offer temporal and spatial scheduling flexibility. Given that power generation costs vary over time and location, harnessing the flexibility of IDCs' energy consumption through workload regulation can optimize the power flow within the system. This paper focuses on multi-geographically distributed IDCs managed by an Internet service com…
▽ More
The batch and online workload of Internet data centers (IDCs) offer temporal and spatial scheduling flexibility. Given that power generation costs vary over time and location, harnessing the flexibility of IDCs' energy consumption through workload regulation can optimize the power flow within the system. This paper focuses on multi-geographically distributed IDCs managed by an Internet service company (ISC), which are aggregated as a controllable load. The load flexibility resulting from spatial load regulation of online workload is taken into account. A two-step workload scheduling mechanism is adopted, and a computation-power coupling model of ISC is established to facilitate collaborative optimization in active distribution networks (ADNs). To address the model-solving problem based on the assumption of scheduling homogeneity, a model reconstruction method is proposed. An efficient iterative algorithm is designed to solve the reconstructed model. Furthermore, the Nash bargaining solution is employed to coordinate the different optimization objectives of ISC and power system operators, thereby avoiding subjective arbitrariness. Experimental cases based on a 33-node distribution system are designed to verify the effectiveness of the model and algorithm in optimizing ISC's energy consumption and power flow within the system.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
MambaIRv2: Attentive State Space Restoration
Authors:
Hang Guo,
Yong Guo,
Yaohua Zha,
Yulun Zhang,
Wenbo Li,
Tao Dai,
Shu-Tao Xia,
Yawei Li
Abstract:
The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration…
▽ More
The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by even 0.35dB PSNR for lightweight SR even with 9.3\% less parameters and suppresses HAT on classic SR by up to 0.29dB. Code is available at https://github.com/csguoh/MambaIR.
△ Less
Submitted 10 March, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
A Data-Driven Pool Strategy for Price-Makers Under Imperfect Information
Authors:
Kedi Zheng,
Hongye Guo,
Qixin Chen
Abstract:
This paper studies the pool strategy for price-makers under imperfect information. In this occasion, market participants cannot obtain essential transmission parameters of the power system. Thus, price-makers should estimate the market results with respect to their offer curves using available historical information. The linear programming model of economic dispatch is analyzed with the theory of…
▽ More
This paper studies the pool strategy for price-makers under imperfect information. In this occasion, market participants cannot obtain essential transmission parameters of the power system. Thus, price-makers should estimate the market results with respect to their offer curves using available historical information. The linear programming model of economic dispatch is analyzed with the theory of rim multi-parametric linear programming (rim-MPLP). The characteristics of system patterns (combinations of status flags for generating units and transmission lines) are revealed. A multi-class classification model based on support vector machine (SVM) is trained to map the offer curves to system patterns, which is then integrated into the decision framework of the price-maker. The performance of the proposed method is validated on the IEEE 30-bus system, Illinois synthetic 200-bus system, and South Carolina synthetic 500-bus system.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
An Experimental Multi-Band Channel Characterization in the Upper Mid-Band
Authors:
Roberto Bomfin,
Ahmad Bazzi,
Hao Guo,
Hyeongtaek Lee,
Marco Mezzavilla,
Sundeep Rangan,
Junil Choi,
Marwa Chafii
Abstract:
The following paper provides a multi-band channel measurement analysis on the frequency range (FR)3. This study focuses on the FR3 low frequencies 6.5 GHz and 8.75 GHz with a setup tailored to the context of integrated sensing and communication (ISAC), where the data are collected with and without the presence of a target. A method based on multiple signal classification (MUSIC) is used to refine…
▽ More
The following paper provides a multi-band channel measurement analysis on the frequency range (FR)3. This study focuses on the FR3 low frequencies 6.5 GHz and 8.75 GHz with a setup tailored to the context of integrated sensing and communication (ISAC), where the data are collected with and without the presence of a target. A method based on multiple signal classification (MUSIC) is used to refine the delays of the channel impulse response estimates. The results reveal that the channel at the lower frequency 6.5 GHz has additional distinguishable multipath components in the presence of the target, while the one associated with the higher frequency 8.75 GHz has more blockage. The set of results reported in this paper serves as a benchmark for future multi-band studies in the FR3 spectrum.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Retinal Vessel Segmentation via Neuron Programming
Authors:
Tingting Wu,
Ruyi Min,
Peixuan Song,
Hengtao Guo,
Tieyong Zeng,
Feng-Lei Fan
Abstract:
The accurate segmentation of retinal blood vessels plays a crucial role in the early diagnosis and treatment of various ophthalmic diseases. Designing a network model for this task requires meticulous tuning and extensive experimentation to handle the tiny and intertwined morphology of retinal blood vessels. To tackle this challenge, Neural Architecture Search (NAS) methods are developed to fully…
▽ More
The accurate segmentation of retinal blood vessels plays a crucial role in the early diagnosis and treatment of various ophthalmic diseases. Designing a network model for this task requires meticulous tuning and extensive experimentation to handle the tiny and intertwined morphology of retinal blood vessels. To tackle this challenge, Neural Architecture Search (NAS) methods are developed to fully explore the space of potential network architectures and go after the most powerful one. Inspired by neuronal diversity which is the biological foundation of all kinds of intelligent behaviors in our brain, this paper introduces a novel and foundational approach to neural network design, termed ``neuron programming'', to automatically search neuronal types into a network to enhance a network's representation ability at the neuronal level, which is complementary to architecture-level enhancement done by NAS. Additionally, to mitigate the time and computational intensity of neuron programming, we develop a hypernetwork that leverages the search-derived architectural information to predict optimal neuronal configurations. Comprehensive experiments validate that neuron programming can achieve competitive performance in retinal blood segmentation, demonstrating the strong potential of neuronal diversity in medical image analysis.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Debatts: Zero-Shot Debating Text-to-Speech Synthesis
Authors:
Yiqiao Huang,
Yuancheng Wang,
Jiaqi Li,
Haotian Guo,
Haorui He,
Shunsi Zhang,
Zhizheng Wu
Abstract:
In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opp…
▽ More
In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opposing side (i.e. opponent) and one from the speaker. The prompt from the opponent is supposed to provide debating style prosody, and the prompt from the speaker provides identity information. In particular, we pretrain the Debatts system from in-the-wild dataset, and integrate an additional reference encoder to take debating prompt for style. In addition, we also create a debating dataset to develop Debatts. In this setting, Debatts can generate a debating-style speech in rebuttal for any voices. Experimental results confirm the effectiveness of the proposed system in comparison with the classic zero-shot TTS systems.
△ Less
Submitted 4 December, 2024; v1 submitted 10 November, 2024;
originally announced November 2024.
-
Site-Specific Outdoor Propagation Assessment and Ray-Tracing Analysis for Wireless Digital Twins
Authors:
Morteza Ghaderi Aram,
Hao Guo,
Mingsheng Yin,
Tommy Svensson
Abstract:
Digital twinning is becoming increasingly vital in the design and real-time control of future wireless networks by providing precise cost-effective simulations, predictive insights, and real-time data integration. This paper explores the application of digital twinning in optimizing wireless communication systems within urban environments, where building arrangements can critically impact network…
▽ More
Digital twinning is becoming increasingly vital in the design and real-time control of future wireless networks by providing precise cost-effective simulations, predictive insights, and real-time data integration. This paper explores the application of digital twinning in optimizing wireless communication systems within urban environments, where building arrangements can critically impact network performances. We develop a digital twin platform to simulate and analyze how factors such as building positioning, base station placement, and antenna design influence wireless propagation. The ray-tracing software package of Matlab is compared with Remcom Wireless InSite. Using a realistic radiation pattern of a base transceiver station (BTS) antenna, ray tracing simulations for signal propagation and interactions in urban landscapes are then extensively examined. By analyzing radio heat maps alongside antenna patterns, we gain valuable insights into optimizing wireless deployment strategies. This study highlights the potential of digital twinning as a critical tool for urban planners and network engineers.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
Reinforcement Learning Based Bidding Framework with High-dimensional Bids in Power Markets
Authors:
Jinyu Liu,
Hongye Guo,
Yun Li,
Qinghu Tang,
Fuquan Huang,
Tunan Chen,
Haiwang Zhong,
Qixin Chen
Abstract:
Over the past decade, bidding in power markets has attracted widespread attention. Reinforcement Learning (RL) has been widely used for power market bidding as a powerful AI tool to make decisions under real-world uncertainties. However, current RL methods mostly employ low dimensional bids, which significantly diverge from the N price-power pairs commonly used in the current power markets. The N-…
▽ More
Over the past decade, bidding in power markets has attracted widespread attention. Reinforcement Learning (RL) has been widely used for power market bidding as a powerful AI tool to make decisions under real-world uncertainties. However, current RL methods mostly employ low dimensional bids, which significantly diverge from the N price-power pairs commonly used in the current power markets. The N-pair bidding format is denoted as High Dimensional Bids (HDBs), which has not been fully integrated into the existing RL-based bidding methods. The loss of flexibility in current RL bidding methods could greatly limit the bidding profits and make it difficult to tackle the rising uncertainties brought by renewable energy generations. In this paper, we intend to propose a framework to fully utilize HDBs for RL-based bidding methods. First, we employ a special type of neural network called Neural Network Supply Functions (NNSFs) to generate HDBs in the form of N price-power pairs. Second, we embed the NNSF into a Markov Decision Process (MDP) to make it compatible with most existing RL methods. Finally, experiments on Energy Storage Systems (ESSs) in the PJM Real-Time (RT) power market show that the proposed bidding method with HDBs can significantly improve bidding flexibility, thereby improving the profit of the state-of-the-art RL bidding methods.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
Authors:
Haohan Guo,
Fenglong Xie,
Dongchao Yang,
Xixin Wu,
Helen Meng
Abstract:
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding…
▽ More
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech significantly outperforms single-scale baseline systems on naturalness and speaker similarity in zero-shot TTS. The analysis of multi-scale coding demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete speech representations while keeping high-quality speech reconstruction. The coarse-to-fine multi-scale generation, especially for the stack-of-scale approach, is also validated as a crucial approach in pursuing a high-quality neural codec language model for TTS.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Speaker Contrastive Learning for Source Speaker Tracing
Authors:
Qing Wang,
Hongmei Guo,
Jian Kang,
Mengjie Du,
Jie Li,
Xiao-Lei Zhang,
Lei Xie
Abstract:
As a form of biometric authentication technology, the security of speaker verification systems is of utmost importance. However, SV systems are inherently vulnerable to various types of attacks that can compromise their accuracy and reliability. One such attack is voice conversion, which modifies a persons speech to sound like another person by altering various vocal characteristics. This poses a…
▽ More
As a form of biometric authentication technology, the security of speaker verification systems is of utmost importance. However, SV systems are inherently vulnerable to various types of attacks that can compromise their accuracy and reliability. One such attack is voice conversion, which modifies a persons speech to sound like another person by altering various vocal characteristics. This poses a significant threat to SV systems. To address this challenge, the Source Speaker Tracing Challenge in IEEE SLT2024 aims to identify the source speaker information in manipulated speech signals. Specifically, SSTC focuses on source speaker verification against voice conversion to determine whether two converted speech samples originate from the same source speaker. In this study, we propose a speaker contrastive learning-based approach for source speaker tracing to learn the latent source speaker information in converted speech. To learn a more source-speaker-related representation, we employ speaker contrastive loss during the training of the embedding extractor. This speaker contrastive loss helps identify the true source speaker embedding among several distractor speaker embeddings, enabling the embedding extractor to learn the potentially possessing source speaker information present in the converted speech. Experiments demonstrate that our proposed speaker contrastive learning system achieves the lowest EER of 16.788% on the challenge test set, securing first place in the challenge.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
Authors:
Hao-Han Guo,
Yao Hu,
Kun Liu,
Fei-Yu Shen,
Xu Tang,
Yi-Chen Wu,
Feng-Long Xie,
Kun Xie,
Kai-Tuo Xu
Abstract:
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS data…
▽ More
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.
△ Less
Submitted 11 April, 2025; v1 submitted 5 September, 2024;
originally announced September 2024.
-
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
Authors:
Haohan Guo,
Fenglong Xie,
Kun Xie,
Dongchao Yang,
Dake Guo,
Xixin Wu,
Helen Meng
Abstract:
The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It compresses speech into a shorter, multi-stream discrete semantic sequence with multiple tokens at each frame. Meanwhile, the ordered product quantization is proposed…
▽ More
The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It compresses speech into a shorter, multi-stream discrete semantic sequence with multiple tokens at each frame. Meanwhile, the ordered product quantization is proposed to constrain this sequence into an ordered representation. It can be applied with a multi-stream delayed LM to achieve better autoregressive generation along both time and stream axes in TTS. The experimental result strongly demonstrates the effectiveness of the proposed approach, achieving superior performance over baseline systems even if compressing the frameshift of speech from 20ms to 240ms (12x). The ablation studies further validate the importance of learning the proposed ordered multi-stream semantic representation in pursuing shorter speech sequences for efficient LM-based TTS.
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Authors:
Yuancheng Wang,
Haoyue Zhan,
Liwei Liu,
Ruihong Zeng,
Haotian Guo,
Jiachen Zheng,
Qiang Zhang,
Xueyao Zhang,
Shunsi Zhang,
Zhizheng Wu
Abstract:
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguist…
▽ More
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at https://maskgct.github.io/. We release our code and model checkpoints at https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct.
△ Less
Submitted 20 October, 2024; v1 submitted 1 September, 2024;
originally announced September 2024.
-
Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
Authors:
Yidi Li,
Yihan Li,
Yixin Guo,
Bin Ren,
Zhenhuan Xu,
Hao Guo,
Hong Liu,
Nicu Sebe
Abstract:
In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of…
▽ More
In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio channels. By transferring knowledge from teacher to student, the student network can better adapt to complex dynamic scenes with incomplete observations. In the student network, a global feature reconstruction module based on the generative adversarial network is constructed to reconstruct global features from feature embedding with missing local information. Furthermore, a multi-modal multi-level fusion attention is introduced to integrate the incomplete feature and the reconstructed feature, leveraging the complementarity and consistency of audio-visual and global-local features. Experimental results on the AV16.3 dataset demonstrate that the proposed GLDTracker outperforms existing state-of-the-art audio-visual trackers and achieves leading performance on both standard and incomplete modalities datasets, highlighting its superiority and robustness in complex conditions. The code and models will be available.
△ Less
Submitted 17 February, 2025; v1 submitted 26 August, 2024;
originally announced August 2024.