Skip to main content

Showing 1–50 of 51 results for author: Deng, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2509.18579  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation

    Authors: Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

    Abstract: While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: 5 pages; submitted to ICASSP 2026

  2. arXiv:2509.18570  [pdf, ps, other

    eess.AS cs.CL cs.SD

    HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

    Authors: Yuke Si, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

    Abstract: Recent advances in large language models have facilitated the development of unified speech language models (SLMs) capable of supporting multiple speech tasks within a shared architecture. However, tasks such as automatic speech recognition (ASR) and speech emotion recognition (SER) rely on distinct types of information: ASR primarily depends on linguistic content, whereas SER requires the integra… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: 5 pages; submitted to ICASSP 2026

  3. arXiv:2509.12508  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    FunAudio-ASR Technical Report

    Authors: Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan , et al. (7 additional authors not shown)

    Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale… ▽ More

    Submitted 17 September, 2025; v1 submitted 15 September, 2025; originally announced September 2025.

    Comments: Authors are listed in alphabetical order

  4. arXiv:2509.04685  [pdf, ps, other

    eess.AS cs.SD

    Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding

    Authors: Rui-Chen Zheng, Wenrui Liu, Hui-Peng Du, Qinglin Zhang, Chong Deng, Qian Chen, Wen Wang, Yang Ai, Zhen-Hua Ling

    Abstract: Existing speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure of speech, where information is distributed unevenly over time. To address this, we propose VARSTok, a VAriable-frame-Rate Speech Tokenizer that adapts token all… ▽ More

    Submitted 4 September, 2025; originally announced September 2025.

  5. arXiv:2508.10456  [pdf, ps, other

    eess.AS

    Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems

    Authors: Mingyu Cui, Mengzhe Geng, Jiajun Deng, Chengxi Deng, Jiawen Kang, Shujie Hu, Guinan Li, Tianzi Wang, Zhaoqing Li, Xie Chen, Xunying Liu

    Abstract: This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv) a novel chunk-based approach applied to C-T models for the first time. An effici… ▽ More

    Submitted 14 August, 2025; originally announced August 2025.

  6. arXiv:2507.12019  [pdf, ps, other

    cs.IT eess.SP

    The Role of Rank in Mismatched Low-Rank Symmetric Matrix Estimation

    Authors: Panpan Niu, Yuhao Liu, Teng Fu, Jie Fan, Chaowen Deng, Zhongyi Huang

    Abstract: We investigate the performance of a Bayesian statistician tasked with recovering a rank-\(k\) signal matrix \(\bS \bS^{\top} \in \mathbb{R}^{n \times n}\), corrupted by element-wise additive Gaussian noise. This problem lies at the core of numerous applications in machine learning, signal processing, and statistics. We derive an analytic expression for the asymptotic mean-square error (MSE) of the… ▽ More

    Submitted 16 July, 2025; originally announced July 2025.

  7. arXiv:2506.04682   

    cs.CV eess.SP

    MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

    Authors: Chuyun Deng, Na Liu, Wei Xie, Lianming Xu, Li Wang

    Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting ge… ▽ More

    Submitted 8 July, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: The authors withdraw this submission to substantially revise the introduction and experimental sections and incorporate new content. The manuscript has not been submitted or published elsewhere. A revised version may be submitted in the future

  8. arXiv:2505.24224  [pdf, ps, other

    eess.AS

    MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech Recognition

    Authors: Chengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu

    Abstract: This paper proposes a novel Mixture of Prompt-Experts based Speaker Adaptation approach (MOPSA) for elderly speech recognition. It allows zero-shot, real-time adaptation to unseen speakers, and leverages domain knowledge tailored to elderly speakers. Top-K most distinctive speaker prompt clusters derived using K-means serve as experts. A router network is trained to dynamically combine clustered p… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  9. arXiv:2505.23236  [pdf, ps, other

    cs.SD cs.HC eess.AS

    Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

    Authors: Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu

    Abstract: This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used t… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH2025

  10. arXiv:2505.22608  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

    Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu

    Abstract: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr c… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Submitted to Interspeech 2025

  11. arXiv:2505.22072  [pdf, other

    cs.SD eess.AS

    On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

    Authors: Shujie HU, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu

    Abstract: This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergenc… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  12. arXiv:2505.13826  [pdf, ps, other

    eess.AS cs.SD

    Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score Normalization

    Authors: Yafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang

    Abstract: Developing robust speaker verification (SV) systems without speaker labels has been a longstanding challenge. Earlier research has highlighted a considerable performance gap between self-supervised and fully supervised approaches. In this paper, we enhance the non-contrastive self-supervised framework, Self-Distillation Prototypes Network (SDPN), by introducing dimension regularization that explic… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  13. arXiv:2502.06156  [pdf, ps, other

    hep-ph eess.SY

    Axial current as the origin of quantum intrinsic orbital angular momentum

    Authors: Orkash Amat, Nurimangul Nurmamat, Yong-Feng Huang, Cheng-Ming Li, Jin-Jun Geng, Chen-Ran Hu, Ze-Cheng Zou, Xiao-Fei Dong, Chen Deng, Fan Xu, Xiao-li Zhang, Chen Du

    Abstract: We show that it is impossible to experimentally observe the quantum intrinsic orbital angular momentum (IOAM) effect without its axial current. Broadly speaking, we argue that the spiral or interference characteristics of the axial current density determine the occurrence of nonlinear or tunneling effects in any spacetimedependent quantum systems. Our findings offer a comprehensive theoretical fra… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: 5 pages, 2 figures

  14. arXiv:2501.06282  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

    Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan , et al. (11 additional authors not shown)

    Abstract: Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence le… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  15. arXiv:2412.10117  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Authors: Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou

    Abstract: In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progr… ▽ More

    Submitted 25 December, 2024; v1 submitted 13 December, 2024; originally announced December 2024.

    Comments: Tech report, work in progress

  16. arXiv:2410.17799  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

    Authors: Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, Shiliang Zhang

    Abstract: Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backch… ▽ More

    Submitted 3 January, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: Work in progress

  17. arXiv:2409.13292  [pdf, other

    eess.AS cs.SD

    Exploring Text-Queried Sound Event Detection with Audio Source Separation

    Authors: Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen, Rohan Kumar Das, Chong Deng, Jianfeng Chen

    Abstract: In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks cor… ▽ More

    Submitted 10 January, 2025; v1 submitted 20 September, 2024; originally announced September 2024.

    Comments: Accepted by ICASSP 2025

  18. arXiv:2408.00365  [pdf, other

    cs.AI cs.CV eess.IV

    Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

    Authors: Hai Yu, Chong Deng, Qinglin Zhang, Jiaqing Liu, Qian Chen, Wen Wang

    Abstract: The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions… ▽ More

    Submitted 29 December, 2024; v1 submitted 1 August, 2024; originally announced August 2024.

  19. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  20. arXiv:2406.09444  [pdf, other

    eess.AS cs.CL cs.SD

    GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model

    Authors: Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

    Abstract: Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowled… ▽ More

    Submitted 21 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2310.13418

  21. arXiv:2406.07801  [pdf, other

    cs.CL cs.SD eess.AS

    PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models

    Authors: Runyan Yang, Huibao Yang, Xiqing Zhang, Tiantian Ye, Ying Liu, Yingying Gao, Shilei Zhang, Chao Deng, Junlan Feng

    Abstract: Recently, there have been attempts to integrate various speech processing tasks into a unified model. However, few previous works directly demonstrated that joint optimization of diverse tasks in multitask speech models has positive influence on the performance of individual tasks. In this paper we present a multitask speech model -- PolySpeech, which supports speech recognition, speech synthesis,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures

  22. arXiv:2405.10463  [pdf, other

    physics.optics eess.IV physics.bio-ph

    Single-shot volumetric fluorescence imaging with neural fields

    Authors: Oumeng Zhang, Haowen Zhou, Brandon Y. Feng, Elin M. Larsson, Reinaldo E. Alcalde, Siyuan Yin, Catherine Deng, Changhuei Yang

    Abstract: Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution. The key challenges in SVF imaging include requiring sparsity constraints, eliminating depth ambiguity in the reconstruction, and maintaining high resolution across a… ▽ More

    Submitted 21 January, 2025; v1 submitted 16 May, 2024; originally announced May 2024.

  23. arXiv:2403.19971  [pdf, other

    eess.AS eess.SP

    3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization

    Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Tinglong Zhu, Rongjie Huang, Chong Deng, Qian Chen, Shiliang Zhang, Wen Wang, Xihao Li

    Abstract: We introduce 3D-Speaker-Toolkit, an open-source toolkit for multimodal speaker verification and diarization, designed for meeting the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic modu… ▽ More

    Submitted 26 December, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

  24. arXiv:2402.12746  [pdf, ps, other

    eess.AS cs.SD

    Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network

    Authors: Yanan Chen, Zihao Cui, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

    Abstract: The expectation to deploy a universal neural network for speech enhancement, with the aim of improving noise robustness across diverse speech processing tasks, faces challenges due to the existing lack of awareness within static speech enhancement frameworks regarding the expected speech in downstream modules. These limitations impede the effectiveness of static speech enhancement approaches in ac… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  25. arXiv:2401.14421  [pdf, other

    cs.LG cs.MA eess.SY stat.ML

    Multi-Agent Based Transfer Learning for Data-Driven Air Traffic Applications

    Authors: Chuhao Deng, Hong-Cheol Choi, Hyunsang Park, Inseok Hwang

    Abstract: Research in developing data-driven models for Air Traffic Management (ATM) has gained a tremendous interest in recent years. However, data-driven models are known to have long training time and require large datasets to achieve good performance. To address the two issues, this paper proposes a Multi-Agent Bidirectional Encoder Representations from Transformers (MA-BERT) model that fully considers… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: 12 pages, 8 figures, submitted for IEEE Transactions on Intelligent Transportation System

  26. arXiv:2311.04534  [pdf, other

    cs.CL cs.SD eess.AS

    Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

    Authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang

    Abstract: Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Mask… ▽ More

    Submitted 4 February, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 5 pages, accepted by ICASSP 2024

  27. arXiv:2310.17664  [pdf, other

    cs.LG eess.AS eess.SP

    Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search

    Authors: Yingying Gao, Shilei Zhang, Zihao Cui, Chao Deng, Junlan Feng

    Abstract: Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end casc… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  28. arXiv:2310.13418  [pdf, other

    eess.AS eess.SP

    GenDistiller: Distilling Pre-trained Language Models based on Generative Models

    Authors: Yingying Gao, Shilei Zhang, Zihao Cui, Yanhan Xu, Chao Deng, Junlan Feng

    Abstract: Self-supervised pre-trained models such as HuBERT and WavLM leverage unlabeled speech data for representation learning and offer significantly improve for numerous downstream tasks. Despite the success of these methods, their large memory and strong computational requirements hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge d… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  29. arXiv:2308.02774  [pdf, other

    eess.AS cs.SD

    Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

    Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Chong Deng, Shiliang Zhang, Wen Wang

    Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persistent challenge. In this paper, we propose a novel self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of a… ▽ More

    Submitted 26 December, 2024; v1 submitted 4 August, 2023; originally announced August 2023.

    Comments: arXiv admin note: text overlap with arXiv:2211.04168

  30. arXiv:2305.10821  [pdf, other

    eess.AS

    Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

    Authors: Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for… ▽ More

    Submitted 2 June, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech 2023. arXiv admin note: substantial text overlap with arXiv:2212.03401

  31. arXiv:2303.13932  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

    Authors: Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

    Abstract: ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings. MUG includes five tracks, including topic segmentation, topic-level and session-level extractive summarization, topi… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece

  32. arXiv:2303.00952  [pdf, other

    cs.CV cs.RO eess.IV

    Towards Activated Muscle Group Estimation in the Wild

    Authors: Kunyu Peng, David Schneider, Alina Roitberg, Kailun Yang, Jiaming Zhang, Chen Deng, Kaiyu Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen

    Abstract: In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabil… ▽ More

    Submitted 5 August, 2024; v1 submitted 1 March, 2023; originally announced March 2023.

    Comments: Accepted to ACM MM 2024. The database and code can be found at https://github.com/KPeng9510/MuscleMap

  33. arXiv:2212.03401  [pdf, other

    eess.AS cs.LG cs.SD

    MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

    Authors: Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  34. arXiv:2211.00206  [pdf

    eess.SY

    A Primary Frequency Control Strategy for Variable-Speed Pumped-Storage Plant in Power Generation Based on Adaptive Model Predictive Control

    Authors: Zhenghua Xu, Changhong Deng, Qiuling Yang

    Abstract: Variable-speed pumped-storage (VSPS) has great potential in helping solve the frequency control problem caused by low inertia, owing to its remarkable flexibility beyond conventional fixed-speed one, to make better use of which, a primary frequency control strategy based on adaptive model predictive control (AMPC) is proposed in this paper for VSPS plant in power generation.

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: 8 pages, 9 figures

  35. arXiv:2210.01434  [pdf, ps, other

    eess.SP

    Beamforming Design and Trajectory Optimization for UAV-Empowered Adaptable Integrated Sensing and Communication

    Authors: Cailian Deng, Xuming Fang, Xianbin Wang

    Abstract: Unmanned aerial vehicle (UAV) has high flexibility and controllable mobility, therefore it is considered as a promising enabler for future integrated sensing and communication (ISAC). In this paper, we propose a novel adaptable ISAC (AISAC) mechanism in the UAV-enabled system, where the UAV performs sensing on demand during communication and the sensing duration is configured flexibly according to… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

    Comments: This work has been submitted to the IEEE for possible publication

  36. arXiv:2209.13915  [pdf, ps, other

    eess.SP

    Joint Optimization of Resource Allocation and Trajectory Control for Mobile Group Users in Fixed-Wing UAV-Enabled Wireless Network

    Authors: Xuezhen Yan, Xuming Fang, Cailian Deng, Xianbin Wang

    Abstract: Owing to the controlling flexibility and cost-effectiveness, fixed-wing unmanned aerial vehicles (UAVs) are expected to serve as flying base stations (BSs) in the air-ground integrated network. By exploiting the mobility of UAVs, controllable coverage can be provided for mobile group users (MGUs) under challenging scenarios or even somewhere without communication infrastructure. However, in such d… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: 30 pages, 9 figures

  37. arXiv:2208.13952  [pdf, other

    eess.SP physics.optics

    Micro-Vibration Modes Reconstruction Based on Micro-Doppler Coincidence Imaging

    Authors: Shuang Liu, Chenjin Deng, Chaoran Wang, Zunwang Bo, Shensheng Han, Zihuai Lin

    Abstract: Micro-vibration, a ubiquitous nature phenomenon, can be seen as a characteristic feature on the objects, these vibrations always have tiny amplitudes which are much less than the wavelengths of the sensing systems, thus these motions information can only be reflected in the phase item of echo. Normally the conventional radar system can detect these micro vibrations through the time frequency analy… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

  38. arXiv:2206.12774  [pdf, other

    eess.AS cs.CL cs.SD

    Meta Auxiliary Learning for Low-resource Spoken Language Understanding

    Authors: Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

    Abstract: Spoken language understanding (SLU) treats automatic speech recognition (ASR) and natural language understanding (NLU) as a unified task and usually suffers from data scarcity. We exploit an ASR and NLU joint training method based on meta auxiliary learning to improve the performance of low-resource SLU task by only taking advantage of abundant manual transcriptions of speech data. One obvious adv… ▽ More

    Submitted 25 June, 2022; originally announced June 2022.

  39. arXiv:2206.08031  [pdf, other

    eess.AS

    A CTC Triggered Siamese Network with Spatial-Temporal Dropout for Speech Recognition

    Authors: Yingying Gao, Junlan Feng, Tianrui Wang, Chao Deng, Shilei Zhang

    Abstract: Siamese networks have shown effective results in unsupervised visual representation learning. These models are designed to learn an invariant representation of two augmentations for one input by maximizing their similarity. In this paper, we propose an effective Siamese network to improve the robustness of End-to-End automatic speech recognition (ASR). We introduce spatial-temporal dropout to supp… ▽ More

    Submitted 22 June, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

  40. arXiv:2202.04250  [pdf, other

    cs.NI eess.SP

    GenAD: General Representations of Multivariate Time Seriesfor Anomaly Detection

    Authors: Xiaolei Hua, Lin Zhu, Shenglin Zhang, Zeyan Li, Su Wang, Dong Zhou, Shuo Wang, Chao Deng

    Abstract: The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex correlations and various temporal patterns of multivar… ▽ More

    Submitted 8 February, 2022; originally announced February 2022.

  41. arXiv:2111.14220  [pdf, other

    cs.LG eess.SP

    On the Robustness and Generalization of Deep Learning Driven Full Waveform Inversion

    Authors: Chengyuan Deng, Youzuo Lin

    Abstract: The data-driven approach has been demonstrated as a promising technique to solve complicated scientific problems. Full Waveform Inversion (FWI) is commonly epitomized as an image-to-image translation task, which motivates the use of deep neural networks as an end-to-end solution. Despite being trained with synthetic data, the deep learning-driven FWI is expected to perform well when evaluated with… ▽ More

    Submitted 28 November, 2021; originally announced November 2021.

  42. arXiv:2111.02926  [pdf, other

    cs.LG eess.SP

    OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform Inversion

    Authors: Chengyuan Deng, Shihang Feng, Hanchen Wang, Xitong Zhang, Peng Jin, Yinan Feng, Qili Zeng, Yinpeng Chen, Youzuo Lin

    Abstract: Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible… ▽ More

    Submitted 23 June, 2023; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: This manuscript has been accepted by NeurIPS 2022 dataset and benchmark track

  43. arXiv:2106.15765  [pdf, other

    eess.IV cs.CV physics.optics

    10-mega pixel snapshot compressive imaging with a hybrid coded aperture

    Authors: Zhihong Zhang, Chao Deng, Yang Liu, Xin Yuan, Jinli Suo, Qionghai Dai

    Abstract: High resolution images are widely used in our daily life, whereas high-speed video capture is challenging due to the low frame rate of cameras working at the high resolution mode. Digging deeper, the main bottleneck lies in the low throughput of existing imaging systems. Towards this end, snapshot compressive imaging (SCI) was proposed as a promising solution to improve the throughput of imaging s… ▽ More

    Submitted 15 August, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

    Comments: 11 pages, 8 figures, accepted by Photonics Research

  44. arXiv:2011.02109  [pdf

    eess.AS

    Deep Multi-task Network for Delay Estimation and Echo Cancellation

    Authors: Yi Zhang, Chengyun Deng, Shiqian Ma, Yongtao Sha, Hui Song

    Abstract: Echo path delay (or ref-delay) estimation is a big challenge in acoustic echo cancellation. Different devices may introduce various ref-delay in practice. Ref-delay inconsistency slows down the convergence of adaptive filters, and also degrades the performance of deep learning models due to 'unseen' ref-delays in the training set. In this paper, a multi-task network is proposed to address both ref… ▽ More

    Submitted 11 August, 2022; v1 submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted by Interspeech 2020

  45. arXiv:2011.02102  [pdf, other

    eess.AS

    Robust Speaker Extraction Network Based on Iterative Refined Adaptation

    Authors: Chengyun Deng, Shiqian Ma, Yi Zhang, Yongtao Sha, Hui Zhang, Hui Song, Xiangang Li

    Abstract: Speaker extraction aims to extract target speech signal from a multi-talker environment with interference speakers and surrounding noise, given the target speaker's reference information. Most speaker extraction systems achieve satisfactory performance on the premise that the test speakers have been encountered during training time. Such systems suffer from performance degradation given unseen tar… ▽ More

    Submitted 11 August, 2022; v1 submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted by Interspeech 2021

  46. arXiv:2007.14974  [pdf, other

    eess.AS cs.SD

    On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems

    Authors: Zhuohuang Zhang, Chengyun Deng, Yi Shen, Donald S. Williamson, Yongtao Sha, Yi Zhang, Hui Song, Xiangang Li

    Abstract: Recent work has shown that it is feasible to use generative adversarial networks (GANs) for speech enhancement, however, these approaches have not been compared to state-of-the-art (SOTA) non GAN-based approaches. Additionally, many loss functions have been proposed for GAN-based approaches, but they have not been adequately compared. In this study, we propose novel convolutional recurrent GAN (CR… ▽ More

    Submitted 26 December, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

    Comments: accepted by Interspeech2020, 5 pages, 2 figures

  47. arXiv:2007.13401  [pdf, ps, other

    eess.SP

    IEEE 802.11be-Wi-Fi 7: New Challenges and Opportunities

    Authors: Cailian Deng, Xuming Fang, Xiao Han, Xianbin Wang, Li Yan, Rong He, Yan Long, Yuchen Guo

    Abstract: With the emergence of 4k/8k video, the throughput requirement of video delivery will keep grow to tens of Gbps. Other new high-throughput and low-latency video applications including augmented reality (AR), virtual reality (VR), and online gaming, are also proliferating. Due to the related stringent requirements, supporting these applications over wireless local area network (WLAN) is far beyond t… ▽ More

    Submitted 3 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Accepted for publication in IEEE Communications Surveys and Tutorials

  48. arXiv:1912.01852  [pdf, other

    cs.SD cs.CL eess.AS

    PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

    Authors: Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, Dong Yu

    Abstract: Singing voice conversion is to convert a singer's voice to another one's voice without changing singing content. Recent work shows that unsupervised singing voice conversion can be achieved with an autoencoder-based approach [1]. However, the converted singing voice can be easily out of key, showing that the existing approach cannot model the pitch information precisely. In this paper, we propose… ▽ More

    Submitted 18 February, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

    Comments: Accepted by ICASSP 2020

  49. arXiv:1901.07042  [pdf, other

    cs.CV cs.LG eess.IV

    MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

    Authors: Alistair E. W. Johnson, Tom J. Pollard, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G. Mark, Seth J. Berkowitz, Steven Horng

    Abstract: Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the d… ▽ More

    Submitted 14 November, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

  50. arXiv:1811.03455  [pdf, other

    eess.IV

    High fidelity single-pixel imaging

    Authors: Chao Deng, Xuemei Hu, Xiaoxu Li, Jinli Suo, Zhili Zhang, Qionghai Dai

    Abstract: Single-pixel imaging (SPI) is an emerging technique which has attracts wide attention in various research fields. However, restricted by the low reconstruction quality and large amount of measurements, the practical application is still in its infancy. Inspired by the fact that natural scenes exhibit unique degenerate structures in the low dimensional subspace, we propose to take advantage of the… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: 5 pages, 6 figures