Skip to main content

Showing 1–50 of 80 results for author: Guo, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.06971  [pdf, ps, other

    cs.CV cs.RO eess.IV

    Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting

    Authors: Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo, Jiale Wei, Kunyu Peng, Jiaming Zhang, Kailun Yang

    Abstract: Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have… ▽ More

    Submitted 9 July, 2025; v1 submitted 9 July, 2025; originally announced July 2025.

    Comments: The source code will be publicly available at https://github.com/Bryant-Teng/Percep360

  2. arXiv:2505.04522  [pdf, ps, other

    eess.IV cs.CV

    Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model

    Authors: Pengfei Guo, Can Zhao, Dong Yang, Yufan He, Vishwesh Nath, Ziyue Xu, Pedro R. A. S. Bassi, Zongwei Zhou, Benjamin D. Simon, Stephanie Anne Harmon, Baris Turkbey, Daguang Xu

    Abstract: Generating 3D CT volumes from descriptive free-text inputs presents a transformative opportunity in diagnostics and research. In this paper, we introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions using the diffusion model. Unlike previous methods that rely on fixed-format text input, Text2CT employs a novel prompt formulation that enables generation from di… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  3. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  4. arXiv:2504.01375  [pdf, other

    eess.SP

    Simultaneous Pre-compensation for Bandwidth Limitation and Fiber Dispersion in Cost-Sensitive IM/DD Transmission Systems

    Authors: Zhe Zhao, Aiying Yang, Xiaoqian Huang, Peng Guo, Shuhua Zhao, Tianjia Xu, Wenkai Wan, Tianwai Bo, Zhongwei Tan, Yi Dong, Yaojun Qiao

    Abstract: We propose a pre-compensation scheme for bandwidth limitation and fiber dispersion (pre-BL-EDC) based on the modified Gerchberg-Saxton (GS) algorithm. Experimental results demonstrate 1.0/1.0/2.0 dB gains compared to modified GS pre-EDC for 20/28/32 Gbit/s bandwidth-limited systems.

    Submitted 2 April, 2025; originally announced April 2025.

  5. arXiv:2503.12482  [pdf, other

    eess.SP

    Fuzzy Clustering for Low-Complexity Time Domain Chromatic Dispersion Compensation Scheme in Coherent Optical Fiber Communication Systems

    Authors: Wenkai Wan, Aiying Yang, Peng Guo, Zhe Zhao, Tianjia Xu, Jinxuan Wu, Zhiheng Liu

    Abstract: Chromatic dispersion compensation (CDC), implemented in either the time-domain or frequency-domain, is crucial for enhancing power efficiency in the digital signal processing of modern optical fiber communication systems. Developing low-complexity CDC schemes is essential for hardware implemention, particularly for high-speed and long-haul optical fiber communication systems. In this work, we prop… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

  6. arXiv:2503.11133  [pdf, other

    cs.CV eess.IV

    SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets

    Authors: Hao Liu, Pengyu Guo, Siyuan Yang, Zeqing Jiang, Qinglei Hu, Dongyu Li

    Abstract: With the continuous advancement of human exploration into deep space, intelligent perception and high-precision segmentation technology for on-orbit multi-spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

  7. arXiv:2502.05845  [pdf

    eess.SY

    Exploiting the Hidden Capacity of MMC Through Accurate Quantification of Modulation Indices

    Authors: Qianhao Sun, Jingwei Meng, Ruofan Li, Mingchao Xia, Qifang Chen, Jiejie Zhou, Meiqi Fan, Peiqian Guo

    Abstract: The modular multilevel converter (MMC) has become increasingly important in voltage-source converter-based high-voltage direct current (VSC-HVDC) systems. Direct and indirect modulation are widely used as mainstream modulation techniques in MMCs. However, due to the challenge of quantitatively evaluating the operation of different modulation schemes, the academic and industrial communities still h… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

  8. arXiv:2501.13306  [pdf, other

    cs.SD cs.CL eess.AS

    OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

    Authors: Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie

    Abstract: Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover… ▽ More

    Submitted 16 February, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

    Comments: OSUM Technical Report v2. The experimental results reported herein differ from those in v1 because of adding new data and training in more steps

  9. arXiv:2501.05127  [pdf, other

    cs.SD eess.AS

    DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification

    Authors: Qing Wang, Jixun Yao, Zhaokai Sun, Pengcheng Guo, Lei Xie, John H. L. Hansen

    Abstract: Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for both humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach that exploits the capabi… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

    Comments: 5 pages,4 figures, accepted by ICASSP 2025

  10. arXiv:2412.18589  [pdf, other

    eess.IV cs.CV

    Text-Driven Tumor Synthesis

    Authors: Xinran Li, Yi Shuai, Chen Liu, Qi Chen, Qilong Wu, Pengfei Guo, Dong Yang, Can Zhao, Pedro R. A. S. Bassi, Daguang Xu, Kang Wang, Yang Yang, Alan Yuille, Zongwei Zhou

    Abstract: Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

  11. arXiv:2412.05589  [pdf, other

    eess.AS cs.SD

    SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

    Authors: Pengcheng Guo, Xuankai Chang, Hang Lv, Shinji Watanabe, Lei Xie

    Abstract: Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study,… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

    Comments: Accepted by IEEE/ACM TASLP

  12. arXiv:2409.19660  [pdf, other

    cs.CV eess.IV

    All-in-One Image Coding for Joint Human-Machine Vision with Multi-Path Aggregation

    Authors: Xu Zhang, Peiyao Guo, Ming Lu, Zhan Ma

    Abstract: Image coding for multi-task applications, catering to both human perception and machine vision, has been extensively investigated. Existing methods often rely on multiple task-specific encoder-decoder pairs, leading to high overhead of parameter and bitrate usage, or face challenges in multi-objective optimization under a unified representation, failing to achieve both performance and efficiency.… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

    Comments: NeurIPS 2024

  13. arXiv:2409.11169  [pdf, other

    eess.IV cs.AI cs.CV

    MAISI: Medical AI for Synthetic Imaging

    Authors: Pengfei Guo, Can Zhao, Dong Yang, Ziyue Xu, Vishwesh Nath, Yucheng Tang, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu

    Abstract: Medical imaging analysis faces challenges such as data scarcity, high annotation costs, and privacy concerns. This paper introduces the Medical AI for Synthetic Imaging (MAISI), an innovative approach using the diffusion model to generate synthetic 3D computed tomography (CT) images to address those challenges. MAISI leverages the foundation volume compression network and the latent diffusion mode… ▽ More

    Submitted 29 October, 2024; v1 submitted 13 September, 2024; originally announced September 2024.

    Comments: WACV25 accepted. https://monai.io/research/maisi

  14. arXiv:2409.10076  [pdf, other

    cs.SD cs.HC eess.AS

    Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

    Authors: Shuiyun Liu, Yuxiang Kong, Pengcheng Guo, Weiji Zhuang, Peng Gao, Yujun Wang, Lei Xie

    Abstract: Speech has emerged as a widely embraced user interface across diverse applications. However, for individuals with dysarthria, the inherent variability in their speech poses significant challenges. This paper presents an end-to-end Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) system for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. Specifically, our s… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

    Comments: 8 pages, Accepted to SLT 2024

  15. arXiv:2409.04173  [pdf, other

    eess.AS

    NPU-NTU System for Voice Privacy 2024 Challenge

    Authors: Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper,… ▽ More

    Submitted 4 February, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

    Comments: System description for VPC 2024

  16. arXiv:2408.10680  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper

    Authors: Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie

    Abstract: Pre-trained multilingual speech foundation models, like Whisper, have shown impressive performance across different languages. However, adapting these models to new or specific languages is computationally extensive and faces catastrophic forgetting problems. Addressing these issues, our study investigates strategies to enhance the model on new languages in the absence of original training data, w… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  17. arXiv:2408.02085  [pdf, other

    cs.CV cs.AI cs.CL eess.SP

    Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

    Authors: Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun

    Abstract: Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and… ▽ More

    Submitted 28 December, 2024; v1 submitted 4 August, 2024; originally announced August 2024.

    Comments: Accepted to TMLR with Survey Certificate, review, survey, 37 pages, 5 figures, 4 tables

  18. arXiv:2407.11629  [pdf, other

    eess.AS

    MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement

    Authors: Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Yuguang Yang, Yu Pan, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution designed to conceal the speaker's identity while preserving the linguistic content and para-linguistic information of the original speech. While most prior studies focus solely on a single language, an ideal speaker anonymization system should be capable of handling multiple languages. This paper proposes MUSA, a Multi-lingual Speak… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Submitted to TASLP

  19. arXiv:2407.03307  [pdf, other

    eess.IV cs.CV

    HoloHisto: End-to-end Gigapixel WSI Segmentation with 4K Resolution Sequential Tokenization

    Authors: Yucheng Tang, Yufan He, Vishwesh Nath, Pengfeig Guo, Ruining Deng, Tianyuan Yao, Quan Liu, Can Cui, Mengmeng Yin, Ziyue Xu, Holger Roth, Daguang Xu, Haichun Yang, Yuankai Huo

    Abstract: In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process: initially segmenting high-resolution whole slide images (WSI) into smaller patches (e.g., 256x256, 512x512, 1024x1024) and subsequently reconstructing them to their original scale. This method often struggles to capture the complex details and vast scope of WSIs. In this… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  20. arXiv:2405.19366  [pdf, other

    eess.SP cs.AI

    ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text

    Authors: Han Yu, Peikun Guo, Akane Sano

    Abstract: The utilization of deep learning on electrocardiogram (ECG) analysis has brought the advanced accuracy and efficiency of cardiac healthcare diagnostics. By leveraging the capabilities of deep learning in semantic understanding, especially in feature extraction and representation learning, this study introduces a new multimodal contrastive pretaining framework that aims to improve the quality and r… ▽ More

    Submitted 23 October, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

  21. arXiv:2405.10786  [pdf, other

    eess.AS

    Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

    Authors: Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution that aims to conceal the speaker's identity while preserving the naturalness and distinctiveness of the original speech. Mainstream approaches use an utterance-level vector from a pre-trained automatic speaker verification (ASV) model to represent speaker identity, which is then averaged or modified for anonymization. However, these… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  22. arXiv:2405.02132  [pdf, other

    cs.SD cs.CL eess.AS

    Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

    Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

    Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More

    Submitted 4 November, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  23. arXiv:2405.00259   

    physics.med-ph eess.IV

    Optimization of Dark-Field CT for Lung Imaging

    Authors: Peiyuan Guo, Simon Spindler, Li Zhang, Zhentian Wang

    Abstract: Background: X-ray grating-based dark-field imaging can sense the small angle scattering caused by an object's micro-structure. This technique is sensitive to lung's porous alveoli and is able to detect lung disease at an early stage. Up to now, a human-scale dark-field CT has been built for lung imaging. Purpose: This study aimed to develop a more thorough optimization method for dark-field lung C… ▽ More

    Submitted 1 May, 2024; v1 submitted 30 April, 2024; originally announced May 2024.

    Comments: There is a mistake in subsection 2.3, where the content is not correct because of the incorrect parameter we set, which leads to the following calculations in the following sections potentially incorrect

  24. arXiv:2401.06788  [pdf, other

    eess.AS cs.AI cs.SD

    The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

    Authors: He Wang, Pengcheng Guo, Wei Chen, Pan Zhou, Lei Xie

    Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce… ▽ More

    Submitted 29 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Included in CNVSRC Workshop 2023, NCMMSC 2023

  25. arXiv:2401.04148  [pdf, other

    cs.LG cs.AI eess.SP

    Online Test-Time Adaptation of Spatial-Temporal Traffic Flow Forecasting

    Authors: Pengxin Guo, Pengrong Jin, Ziyue Li, Lei Bai, Yu Zhang

    Abstract: Accurate spatial-temporal traffic flow forecasting is crucial in aiding traffic managers in implementing control measures and assisting drivers in selecting optimal travel routes. Traditional deep-learning based methods for traffic flow forecasting typically rely on historical data to train their models, which are then used to make predictions on future data. However, the performance of the traine… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  26. arXiv:2401.03697  [pdf, other

    cs.SD eess.AS

    An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

    Authors: Runduo Han, Xiaopeng Yan, Weiming Xu, Pengcheng Guo, Jiayao Sun, He Wang, Quan Lu, Ning Jiang, Lei Xie

    Abstract: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-en… ▽ More

    Submitted 6 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  27. arXiv:2401.03473  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

    Authors: He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, Binbin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

    Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours… ▽ More

    Submitted 20 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  28. MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

    Authors: He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

    Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the… ▽ More

    Submitted 8 April, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures Accepted at ICASSP 2024

  29. arXiv:2312.09746  [pdf, other

    cs.SD eess.AS

    Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

    Authors: Bingshen Mu, Pengcheng Guo, Dake Guo, Pan Zhou, Wei Chen, Lei Xie

    Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition perf… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  30. Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

    Authors: Qijie Shao, Pengcheng Guo, Jinghao Yan, Pengfei Hu, Lei Xie

    Abstract: Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related a… ▽ More

    Submitted 17 November, 2023; v1 submitted 12 November, 2023; originally announced November 2023.

    Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing (TASLP)

  31. CDSD: Chinese Dysarthria Speech Database

    Authors: Yan Wang, Mengyi Sun, Xinchen Kang, Jingting Li, Pengfei Guo, Ming Gao, Su-Jing Wang

    Abstract: Dysarthric speech poses significant challenges for individuals with dysarthria, impacting their ability to communicate socially. Despite the widespread use of Automatic Speech Recognition (ASR), accurately recognizing dysarthric speech remains a formidable task, largely due to the limited availability of dysarthric speech data. To address this gap, we developed the Chinese Dysarthria Speech Databa… ▽ More

    Submitted 13 February, 2025; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: Accepted at INTERSPEECH 2024 Yan Wang and Mengyi Sun contributed equally to this research

    Journal ref: Interspeech 2024

  32. arXiv:2310.04863  [pdf, other

    cs.SD eess.AS

    SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

    Authors: Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

    Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we intro… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

  33. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  34. arXiv:2309.00929  [pdf, other

    cs.SD eess.AS

    Timbre-reserved Adversarial Attack in Speaker Identification

    Authors: Qing Wang, Jixun Yao, Li Zhang, Pengcheng Guo, Lei Xie

    Abstract: As a type of biometric identification, a speaker identification (SID) system is confronted with various kinds of attacks. The spoofing attacks typically imitate the timbre of the target speakers, while the adversarial attacks confuse the SID system by adding a well-designed adversarial perturbation to an arbitrary speech. Although the spoofing attack copies a similar timbre as the victim, it does… ▽ More

    Submitted 2 September, 2023; originally announced September 2023.

    Comments: 11 pages, 8 figures

  35. arXiv:2306.00804  [pdf, other

    cs.SD cs.CL eess.AS

    Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

    Authors: Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao Li, Changru Chen, Chao Li, Lei Xie

    Abstract: By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual bias… ▽ More

    Submitted 15 August, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

  36. arXiv:2305.19020  [pdf, other

    cs.SD eess.AS

    Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

    Authors: Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie

    Abstract: In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pse… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 5 pages

  37. arXiv:2305.13716  [pdf, other

    cs.SD cs.CL eess.AS

    BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

    Authors: Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen, Lei Xie

    Abstract: The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the d… ▽ More

    Submitted 5 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  38. TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition

    Authors: Hongfei Xue, Qijie Shao, Peikun Chen, Pengcheng Guo, Lei Xie, Jie Liu

    Abstract: UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this… ▽ More

    Submitted 8 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures. Accepted by INTERSPEECH 2023

  39. arXiv:2305.12493  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

    Authors: Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, Lei Xie

    Abstract: Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context… ▽ More

    Submitted 12 July, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted by interspeech2023

  40. arXiv:2303.06341  [pdf, other

    eess.AS

    The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

    Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen

    Abstract: This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. Specifically, the weighted prediction error (WPE) and guided source separation (GSS) techniques are used to reduce reverberation and generate clean signals for each single speaker first. Then, we explore the effectivenes… ▽ More

    Submitted 11 March, 2023; originally announced March 2023.

    Comments: 2 pages, accepted by ICASSP 2023

  41. arXiv:2302.13523  [pdf, other

    cs.SD eess.AS

    VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

    Authors: Ao Zhang, He Wang, Pengcheng Guo, Yihui Fu, Lei Xie, Yingying Gao, Shilei Zhang, Junlan Feng

    Abstract: The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the e… ▽ More

    Submitted 14 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: 5 pages. Accepted at ICASSP2023

  42. arXiv:2211.13443  [pdf, other

    cs.SD eess.AS

    TESSP: Text-Enhanced Self-Supervised Speech Pre-training

    Authors: Zhuoyuan Yao, Shuo Ren, Sanyuan Chen, Ziyang Ma, Pengcheng Guo, Lei Xie

    Abstract: Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

    Comments: 9 pages, 4 figures

  43. arXiv:2211.03038  [pdf, other

    eess.AS cs.CR cs.SD

    Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling

    Authors: Jixun Yao, Qing Wang, Yi Lei, Pengcheng Guo, Lei Xie, Namin Wang, Jie Liu

    Abstract: Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker ver… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  44. arXiv:2211.03036  [pdf, other

    eess.AS cs.SD

    Preserving background sound in noise-robust voice conversion via multi-task learning

    Authors: Jixun Yao, Yi Lei, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie, Hai Li, Junhui Liu, Danming Xie

    Abstract: Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and th… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  45. arXiv:2210.05265  [pdf, other

    cs.SD eess.AS

    MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, Yuxiao Lin, Lei Xie

    Abstract: Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone arr… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT 2022

  46. arXiv:2209.11969  [pdf, other

    eess.AS cs.SD

    NWPU-ASLP System for the VoicePrivacy 2022 Challenge

    Authors: Jixun Yao, Qing Wang, Li Zhang, Pengcheng Guo, Yuhao Liang, Lei Xie

    Abstract: This paper presents the NWPU-ASLP speaker anonymization system for VoicePrivacy 2022 Challenge. Our submission does not involve additional Automatic Speaker Verification (ASV) model or x-vector pool. Our system consists of four modules, including feature extractor, acoustic model, anonymization module, and neural vocoder. First, the feature extractor extracts the Phonetic Posteriorgram (PPG) and p… ▽ More

    Submitted 24 September, 2022; originally announced September 2022.

    Comments: VoicePrivacy 2022 Challenge

  47. arXiv:2207.00883  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

    Authors: Kun Wei, Pengcheng Guo, Ning Jiang

    Abstract: Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect con… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  48. arXiv:2206.06065  [pdf

    eess.IV cs.CV

    Deep ensemble learning for segmenting tuberculosis-consistent manifestations in chest radiographs

    Authors: Sivaramakrishnan Rajaraman, Feng Yang, Ghada Zamzmi, Peng Guo, Zhiyun Xue, Sameer K Antani

    Abstract: Automated segmentation of tuberculosis (TB)-consistent lesions in chest X-rays (CXRs) using deep learning (DL) methods can help reduce radiologist effort, supplement clinical decision-making, and potentially result in improved patient treatment. The majority of works in the literature discuss training automatic segmentation models using coarse bounding box annotations. However, the granularity of… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: 13 pages, 6 figures

    MSC Class: 68T07

  49. arXiv:2204.11669  [pdf

    eess.IV cs.AI physics.med-ph

    Deep-learning-enabled Brain Hemodynamic Mapping Using Resting-state fMRI

    Authors: Xirui Hou, Pengfei Guo, Puyang Wang, Peiying Liu, Doris D. M. Lin, Hongli Fan, Yang Li, Zhiliang Wei, Zixuan Lin, Dengrong Jiang, Jin Jin, Catherine Kelly, Jay J. Pillai, Judy Huang, Marco C. Pinho, Binu P. Thomas, Babu G. Welch, Denise C. Park, Vishal M. Patel, Argye E. Hillis, Hanzhang Lu

    Abstract: Cerebrovascular disease is a leading cause of death globally. Prevention and early intervention are known to be the most effective forms of its management. Non-invasive imaging methods hold great promises for early stratification, but at present lack the sensitivity for personalized prognosis. Resting-state functional magnetic resonance imaging (rs-fMRI), a powerful tool previously used for mappin… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Journal ref: npj Digital Medicine (2023) 116

  50. arXiv:2204.03398  [pdf, other

    cs.SD eess.AS

    Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

    Authors: Qijie Shao, Jinghao Yan, Jian Kang, Pengcheng Guo, Xian Shi, Pengfei Hu, Lei Xie

    Abstract: General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the acce… ▽ More

    Submitted 1 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022