Skip to main content

Showing 1–50 of 187 results for author: Yu, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.21555  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Efficient Multilingual ASR Finetuning via LoRA Language Experts

    Authors: Jiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu, Yanmin Qian

    Abstract: Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilingual ASR still suffers from the curse of multilinguality in that different languages tend to interfere with each other, making it difficult for the ASR model to… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted in Interspeech 2025

  2. arXiv:2506.07520  [pdf, ps, other

    cs.SD cs.AI eess.AS

    LeVo: High-Quality Song Generation with Multi-Preference Alignment

    Authors: Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu

    Abstract: Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address… ▽ More

    Submitted 15 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  3. arXiv:2506.05891  [pdf, ps, other

    cs.SD eess.AS

    WAKE: Watermarking Audio with Key Enrichment

    Authors: Yaoxun Xu, Jianwei Yu, Hangting Chen, Zhiyong Wu, Xixin Wu, Dong Yu, Rongzhi Gu, Yi Luo

    Abstract: As deep learning advances in audio generation, challenges in audio security and copyright protection highlight the need for robust audio watermarking. Recent neural network-based methods have made progress but still face three main issues: preventing unauthorized access, decoding initial watermarks after multiple embeddings, and embedding varying lengths of watermarks. To address these issues, we… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Accepted by InterSpeech2025

  4. arXiv:2505.22045  [pdf, ps, other

    cs.MM cs.CV cs.SD eess.AS

    Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

    Authors: Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

    Abstract: Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysi… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH 2025

  5. arXiv:2505.21527  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

    Authors: Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen

    Abstract: Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel A… ▽ More

    Submitted 29 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  6. arXiv:2505.13062  [pdf, other

    cs.MM cs.SD eess.AS

    Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

    Authors: Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu

    Abstract: Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions f… ▽ More

    Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  7. arXiv:2504.08661  [pdf, other

    cs.RO eess.SY

    Safe Flow Matching: Robot Motion Planning with Control Barrier Functions

    Authors: Xiaobing Dai, Zewen Yang, Dian Yu, Shanshan Zhang, Hamid Sadeghian, Sami Haddadin, Sandra Hirche

    Abstract: Recent advances in generative modeling have led to promising results in robot motion planning, particularly through diffusion and flow matching (FM)-based models that capture complex, multimodal trajectory distributions. However, these methods are typically trained offline and remain limited when faced with new environments with constraints, often lacking explicit mechanisms to ensure safety durin… ▽ More

    Submitted 3 May, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

  8. Overview of Variable Rate Coding in JPEG AI

    Authors: Panqi Jia, Fabian Brand, Dequan Yu, Alexander Karabutov, Elena Alshina, Andre Kaup

    Abstract: Empirical evidence has demonstrated that learning-based image compression can outperform classical compression frameworks. This has led to the ongoing standardization of learned-based image codecs, namely Joint Photographic Experts Group (JPEG) AI. The objective of JPEG AI is to enhance compression efficiency and provide a software and hardwarefriendly solution. Based on our research, JPEG AI repr… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  9. arXiv:2503.12936  [pdf, other

    eess.AS

    FNSE-SBGAN: Far-field Speech Enhancement with Schrodinger Bridge and Generative Adversarial Networks

    Authors: Tong Lei, Qinwen Hu, Ziyao Lin, Andong Li, Rilin Chen, Meng Yu, Dong Yu, Jing Lu

    Abstract: The prevailing method for neural speech enhancement predominantly utilizes fully-supervised deep learning with simulated pairs of far-field noisy-reverberant speech and clean speech. Nonetheless, these models frequently demonstrate restricted generalizability to mixtures recorded in real-world conditions. To address this issue, this study investigates training enhancement models directly on real m… ▽ More

    Submitted 15 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: 13 pages, 6 figures

  10. arXiv:2502.16897  [pdf, other

    eess.AS

    Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM

    Authors: Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu

    Abstract: Recent efforts have extended textual LLMs to the speech domain. Yet, a key challenge remains, which is balancing speech understanding and generation while avoiding catastrophic forgetting when integrating acoustically rich codec-based representations into models originally trained on text. In this work, we propose a novel approach that leverages continual pre-training (CPT) on a pre-trained textua… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

  11. arXiv:2502.14145  [pdf, other

    cs.CL eess.AS

    LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

    Authors: Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

    Abstract: Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD pred… ▽ More

    Submitted 24 February, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

    Comments: In submission to INTERSPEECH 2025

  12. arXiv:2501.06670  [pdf, other

    eess.SY

    A Geometric Analysis-Based Safety Assessment Framework for MASS Route Decision-Making in Restricted Waters

    Authors: Zilong Xu, Zihao Wang, He Li, Dingli Yu, Zaili Yang, Jin Wang

    Abstract: To enhance the safety of Maritime Autonomous Surface Ships (MASS) navigating in restricted waters, this paper aims to develop a geometric analysis-based route safety assessment (GARSA) framework, specifically designed for their route decision-making in irregularly shaped waterways. Utilizing line and point geometric elements to define waterway boundaries, the framework enables to construct a dynam… ▽ More

    Submitted 11 January, 2025; originally announced January 2025.

  13. arXiv:2501.01102  [pdf, other

    eess.AS cs.AI cs.SD

    Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT

    Authors: Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng

    Abstract: Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessi… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: Accepted at INTERSPEECH 2019

    Journal ref: Proc. Interspeech 2019, pp. 2090-2094

  14. arXiv:2411.16729  [pdf, other

    cs.SD cs.AI cs.GR cs.HC cs.MM eess.AS

    DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2

    Authors: Fan Zhang, Siyuan Zhao, Naye Ji, Zhaohan Wang, Jingmei Wu, Fuxing Gao, Zhenqing Ye, Leyao Yan, Lanxin Dai, Weidong Geng, Xin Lyu, Bozuo Zhao, Dingguo Yu, Hui Du, Bin Hu

    Abstract: Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamb… ▽ More

    Submitted 23 November, 2024; originally announced November 2024.

    Comments: 13 pages, 11 figures

  15. arXiv:2410.06544  [pdf, other

    cs.SD eess.AS

    SRC-gAudio: Sampling-Rate-Controlled Audio Generation

    Authors: Chenxing Li, Manjie Xu, Dong Yu

    Abstract: We introduce SRC-gAudio, a novel audio generation model designed to facilitate text-to-audio generation across a wide range of sampling rates within a single model architecture. SRC-gAudio incorporates the sampling rate as part of the generation condition to guide the diffusion-based audio generation process. Our model enables the generation of audio at multiple sampling rates with a single unifie… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: Accepted by APSIPA2024

  16. arXiv:2410.03751  [pdf, other

    cs.CL cs.SD eess.AS

    Recent Advances in Speech Language Models: A Survey

    Authors: Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King

    Abstract: Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is… ▽ More

    Submitted 5 February, 2025; v1 submitted 1 October, 2024; originally announced October 2024.

    Comments: Work in progress

  17. arXiv:2410.01150  [pdf, other

    eess.AS cs.SD

    Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

    Authors: Hsin-Tien Chiang, Hao Zhang, Yong Xu, Meng Yu, Dong Yu

    Abstract: In challenging environments with significant noise and reverberation, traditional speech enhancement (SE) methods often lead to over-suppressed speech, creating artifacts during listening and harming downstream tasks performance. To overcome these limitations, we propose a novel approach called Restorative SE (RestSE), which combines a lightweight SE module with a generative codec module to progre… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

    Comments: Paper in submission

  18. arXiv:2409.14709  [pdf, other

    eess.AS cs.SD

    Video-to-Audio Generation with Fine-grained Temporal Semantics

    Authors: Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu

    Abstract: With recent advances of AIGC, video generation have gained a surge of research interest in both academia and industry (e.g., Sora). However, it remains a challenge to produce temporally aligned audio to synchronize the generated video, considering the complicated semantic information included in the latter. In this work, inspired by the recent success of text-to-audio (TTA) generation, we first in… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  19. arXiv:2409.10819  [pdf, ps, other

    eess.AS cs.SD

    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

    Authors: Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

    Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling techni… ▽ More

    Submitted 19 June, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: Accepted at Interspeech 2025

  20. arXiv:2409.08601  [pdf, other

    cs.SD cs.MM eess.AS

    STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

    Authors: Yong Ren, Chenxing Li, Manjie Xu, Wei Liang, Yu Gu, Rilin Chen, Dong Yu

    Abstract: Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both l… ▽ More

    Submitted 24 March, 2025; v1 submitted 13 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  21. arXiv:2409.07556  [pdf, other

    eess.AS cs.SD

    SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

    Authors: Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

    Abstract: In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited re… ▽ More

    Submitted 1 January, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  22. arXiv:2409.06954  [pdf, other

    eess.AS

    Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array

    Authors: Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu

    Abstract: Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for efficient sound field capture, which limits their flexibility in practical scenarios. We propose a deep learning (DL)-based approach, leveraging a two-stage netw… ▽ More

    Submitted 16 September, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  23. arXiv:2409.01622  [pdf

    eess.IV cs.AI cs.CV

    T1-contrast Enhanced MRI Generation from Multi-parametric MRI for Glioma Patients with Latent Tumor Conditioning

    Authors: Zach Eidex, Mojtaba Safari, Richard L. J. Qiu, David S. Yu, Hui-Kuo Shu, Hui Mao, Xiaofeng Yang

    Abstract: Objective: Gadolinium-based contrast agents (GBCAs) are commonly used in MRI scans of patients with gliomas to enhance brain tumor characterization using T1-weighted (T1W) MRI. However, there is growing concern about GBCA toxicity. This study develops a deep-learning framework to generate T1-postcontrast (T1C) from pre-contrast multiparametric MRI. Approach: We propose the tumor-aware vision trans… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: arXiv admin note: text overlap with arXiv:2407.02616

  24. arXiv:2408.17431  [pdf, other

    eess.AS cs.AI

    Advancing Multi-talker ASR Performance with Large Language Models

    Authors: Mohan Shi, Zengrui Jin, Yaoxun Xu, Yong Xu, Shi-Xiong Zhang, Kun Wei, Yiwen Shao, Chunlei Zhang, Dong Yu

    Abstract: Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcr… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: 8 pages, accepted by IEEE SLT 2024

  25. arXiv:2408.01320  [pdf, ps, other

    eess.SP cs.IT

    Generalized Reduced-WMMSE Approach for Cell-Free Massive MIMO With Per-AP Power Constraints

    Authors: Wonsik Yoo, Daesung Yu, Hoon Lee, Seok-Hwan Park

    Abstract: The optimization of cooperative beamforming vectors in cell-free massive MIMO (mMIMO) systems is presented where multi-antenna access points (APs) support downlink data transmission of multiple users. Albeit the successes of the weighted minimum mean squared error (WMMSE) algorithm and their variants, they lack careful investigations about computational complexity that scales with the number of an… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: accepted for publication in IEEE Wireless Communications Letters

  26. arXiv:2407.07464  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Video-to-Audio Generation with Hidden Alignment

    Authors: Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu

    Abstract: Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techni… ▽ More

    Submitted 11 March, 2025; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: https://sites.google.com/view/vta-ldm

  27. SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

    Authors: Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu

    Abstract: The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, t… ▽ More

    Submitted 24 January, 2025; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: 8 pages, Accepted to SLT 2024

    Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT), pp. 317-324, 2024

  28. arXiv:2406.09589  [pdf, other

    eess.AS

    Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

    Authors: Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

    Abstract: In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024

  29. arXiv:2406.04350  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Prompt-guided Precise Audio Editing with Diffusion Models

    Authors: Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu

    Abstract: Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusi… ▽ More

    Submitted 11 May, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024

  30. arXiv:2406.00976  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

    Authors: Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

    Abstract: While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio wavef… ▽ More

    Submitted 1 November, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Accept in ACL2024-main

  31. arXiv:2404.08549  [pdf

    eess.IV cs.CV physics.bio-ph

    Practical Guidelines for Cell Segmentation Models Under Optical Aberrations in Microscopy

    Authors: Boyuan Peng, Jiaju Chen, P. Bilha Githinji, Ijaz Gul, Qihui Ye, Minjiang Chen, Peiwu Qin, Xingru Huang, Chenggang Yan, Dongmei Yu, Jiansong Ji, Zhenglin Chen

    Abstract: Cell segmentation is essential in biomedical research for analyzing cellular morphology and behavior. Deep learning methods, particularly convolutional neural networks (CNNs), have revolutionized cell segmentation by extracting intricate features from images. However, the robustness of these methods under microscope optical aberrations remains a critical challenge. This study evaluates cell image… ▽ More

    Submitted 25 August, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

  32. arXiv:2404.08453  [pdf, other

    cs.LG eess.SY

    Lightweight Multi-System Multivariate Interconnection and Divergence Discovery

    Authors: Mulugeta Weldezgina Asres, Christian Walter Omlin, Jay Dittmann, Pavel Parygin, Joshua Hiltbrand, Seth I. Cooper, Grace Cummings, David Yu

    Abstract: Identifying outlier behavior among sensors and subsystems is essential for discovering faults and facilitating diagnostics in large systems. At the same time, exploring large systems with numerous multivariate data sets is challenging. This study presents a lightweight interconnection and divergence discovery mechanism (LIDD) to identify abnormal behavior in multi-system environments. The approach… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 8 pages, 12 figures

  33. arXiv:2404.08285  [pdf

    cs.CV cs.AI eess.SY

    A Survey of Neural Network Robustness Assessment in Image Recognition

    Authors: Jie Wang, Jun Ai, Minyan Lu, Haoran Su, Dan Yu, Yutao Zhang, Junda Zhu, Jingyu Liu

    Abstract: In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models.… ▽ More

    Submitted 15 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

    Comments: Corrected typos and grammatical errors in Section 5

  34. arXiv:2403.02307  [pdf, other

    eess.IV cs.CV

    Harnessing Intra-group Variations Via a Population-Level Context for Pathology Detection

    Authors: P. Bilha Githinji, Xi Yuan, Zhenglin Chen, Ijaz Gul, Dingqi Shang, Wen Liang, Jianming Deng, Dan Zeng, Dongmei yu, Chenggang Yan, Peiwu Qin

    Abstract: Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph th… ▽ More

    Submitted 25 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  35. arXiv:2402.01828  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Retrieval Augmented End-to-End Spoken Dialog Models

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain dom… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Journal ref: Proc. ICASSP 2024

  36. arXiv:2312.06101  [pdf, other

    eess.IV cs.CV

    Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution

    Authors: Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong

    Abstract: Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup ta… ▽ More

    Submitted 8 May, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

  37. arXiv:2311.13075  [pdf, other

    eess.AS

    Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer

    Authors: Meng Yu, Dong Yu

    Abstract: Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 6 pages, 5 figures

  38. Real-Time Machine-Learning-Based Optimization Using Input Convex Long Short-Term Memory Network

    Authors: Zihao Wang, Donghan Yu, Zhe Wu

    Abstract: Neural network-based optimization and control methods, often referred to as black-box approaches, are increasingly gaining attention in energy and manufacturing systems, particularly in situations where first-principles models are either unavailable or inaccurate. However, their non-convex nature significantly slows down the optimization and control processes, limiting their application in real-ti… ▽ More

    Submitted 10 September, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Applied Energy

  39. arXiv:2311.00146  [pdf, other

    eess.AS cs.AI

    RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

    Authors: Yiwen Shao, Shi-Xiong Zhang, Dong Yu

    Abstract: Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that… ▽ More

    Submitted 11 June, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

    Comments: Accepted for presentation at Interspeech 2024

  40. arXiv:2310.16367  [pdf, other

    eess.AS

    UniX-Encoder: A Universal $X$-Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing

    Authors: Zili Huang, Yiwen Shao, Shi-Xiong Zhang, Dong Yu

    Abstract: The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel sp… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  41. arXiv:2310.11954  [pdf, other

    cs.CL cs.MM eess.AS

    MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

    Authors: Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian

    Abstract: AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data… ▽ More

    Submitted 25 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

  42. arXiv:2310.10992  [pdf, other

    cs.SD eess.AS

    A High Fidelity and Low Complexity Neural Audio Coding

    Authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

    Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  43. arXiv:2310.01292  [pdf, other

    cs.CV cs.LG eess.IV

    Efficient Remote Sensing Segmentation With Generative Adversarial Transformer

    Authors: Luyi Qiu, Dayu Yu, Xiaofeng Zhang, Chenxiao Zhang

    Abstract: Most deep learning methods that achieve high segmentation accuracy require deep network architectures that are too heavy and complex to run on embedded devices with limited storage and memory space. To address this issue, this paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation while maintaining an extremely efficient size. The… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  44. arXiv:2310.00900  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

    Authors: Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu

    Abstract: Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by p… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  45. arXiv:2310.00230  [pdf, other

    cs.CL cs.SD eess.AS

    SLM: Bridge the thin gap between speech and text foundation models

    Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu

    Abstract: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  46. arXiv:2309.16049  [pdf, other

    eess.AS cs.SD eess.SP

    Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression

    Authors: Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu

    Abstract: Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in e… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Paper in submission

  47. arXiv:2309.16048  [pdf, other

    eess.AS cs.SD eess.SP

    Advancing Acoustic Howling Suppression through Recursive Training of Neural Networks

    Authors: Hao Zhang, Yixuan Zhang, Meng Yu, Dong Yu

    Abstract: In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The propose… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Paper in submission

  48. arXiv:2309.09028  [pdf, other

    eess.AS cs.SD

    Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

    Authors: Heming Wang, Meng Yu, Hao Zhang, Chunlei Zhang, Zhongweiyang Xu, Muqiao Yang, Yixuan Zhang, Dong Yu

    Abstract: Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynth… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: Paper in submission

  49. arXiv:2309.07432  [pdf, other

    cs.SD eess.AS

    SpatialCodec: Neural Spatial Speech Coding

    Authors: Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu

    Abstract: In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our app… ▽ More

    Submitted 8 July, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP2024

  50. arXiv:2309.07416  [pdf, other

    cs.SD eess.AS

    BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

    Authors: Anton Ratnarajah, Shi-Xiong Zhang, Dong Yu

    Abstract: We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensu… ▽ More

    Submitted 24 November, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: More results and source code are available at https://anton-jeran.github.io/MAD/