Skip to main content

Showing 1–50 of 91 results for author: Kong, Q

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.19742  [pdf, ps, other

    eess.IV cs.AI cs.CV

    NeRF-based CBCT Reconstruction needs Normalization and Initialization

    Authors: Zhuowei Xu, Han Li, Dai Sun, Zhicheng Li, Yujia Li, Qingpeng Kong, Zhiwei Cheng, Nassir Navab, S. Kevin Zhou

    Abstract: Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specif… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  2. arXiv:2505.21827  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Music Source Restoration

    Authors: Yongyi Zang, Zheqi Dai, Mark D. Plumbley, Qiuqiang Kong

    Abstract: We introduce Music Source Restoration (MSR), a novel task addressing the gap between idealized source separation and real-world music production. Current Music Source Separation (MSS) approaches assume mixtures are simple sums of sources, ignoring signal degradations employed during music production like equalization, compression, and reverb. MSR models mixtures as degraded sums of individually de… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: A modified version of this paper is in review

  3. arXiv:2505.19534  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Training-Free Multi-Step Audio Source Separation

    Authors: Yongyi Zang, Jingyi Li, Qiuqiang Kong

    Abstract: Audio source separation aims to separate a mixture into target sources. Previous audio source separation systems usually conduct one-step inference, which does not fully explore the separation ability of models. In this work, we reveal that pretrained one-step audio source separation models can be leveraged for multi-step separation without additional training. We propose a simple yet effective in… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  4. arXiv:2505.17582  [pdf, ps, other

    eess.IV cs.CV cs.RO

    Distance Estimation in Outdoor Driving Environments Using Phase-only Correlation Method with Event Cameras

    Authors: Masataka Kobayashi, Shintaro Shiba, Quan Kong, Norimasa Kobori, Tsukasa Shimizu, Shan Lu, Takaya Yamazato

    Abstract: With the growing adoption of autonomous driving, the advancement of sensor technology is crucial for ensuring safety and reliable operation. Sensor fusion techniques that combine multiple sensors such as LiDAR, radar, and cameras have proven effective, but the integration of multiple devices increases both hardware complexity and cost. Therefore, developing a single sensor capable of performing mu… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: 6 pages, 7 figures. To appear in IEEE Intelligent Vehicles Symposium (IV) 2025

    ACM Class: I.4.8; I.2.10; I.5.4

  5. arXiv:2505.05501  [pdf, other

    cs.CV cs.AI eess.IV

    Preliminary Explorations with GPT-4o(mni) Native Image Generation

    Authors: Pu Cao, Feng Zhou, Junyi Ji, Qingye Kong, Zhixiang Lv, Mingjian Zhang, Xuekun Zhao, Siqi Wu, Yinghui Lin, Qing Song, Lu Yang

    Abstract: Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of t… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  6. arXiv:2504.18521  [pdf, other

    cs.CV cs.RO eess.SP

    E-VLC: A Real-World Dataset for Event-based Visible Light Communication And Localization

    Authors: Shintaro Shiba, Quan Kong, Norimasa Kobori

    Abstract: Optical communication using modulated LEDs (e.g., visible light communication) is an emerging application for event cameras, thanks to their high spatio-temporal resolutions. Event cameras can be used simply to decode the LED signals and also to localize the camera relative to the LED marker positions. However, there is no public dataset to benchmark the decoding and localization in various real-w… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: 10 pages, 9 figures, 5 tables, CVPRW on EventVision 2025

  7. arXiv:2503.18078  [pdf, other

    eess.SP

    GenMetaLoc: Learning to Learn Environment-Aware Fingerprint Generation for Sample Efficient Wireless Localization

    Authors: Jun Gao, Feng Yin, Wenzhong Yan, Qinglei Kong, Lexi Xu, Shuguang Cui

    Abstract: Existing fingerprinting-based localization methods often require extensive data collection and struggle to generalize to new environments. In contrast to previous environment-unknown MetaLoc, we propose GenMetaLoc in this paper, which first introduces meta-learning to enable the generation of dense fingerprint databases from an environment-aware perspective. In the model aspect, the learning-to-le… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

  8. arXiv:2502.04128  [pdf, other

    eess.AS cs.AI cs.CL cs.MM cs.SD

    Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

    Authors: Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

    Abstract: Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a pa… ▽ More

    Submitted 22 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

  9. arXiv:2501.13514  [pdf, other

    eess.IV cs.CV

    Self-Supervised Diffusion MRI Denoising via Iterative and Stable Refinement

    Authors: Chenxu Wu, Qingpeng Kong, Zihang Jiang, S. Kevin Zhou

    Abstract: Magnetic Resonance Imaging (MRI), including diffusion MRI (dMRI), serves as a ``microscope'' for anatomical structures and routinely mitigates the influence of low signal-to-noise ratio scans by compromising temporal or spatial resolution. However, these compromises fail to meet clinical demands for both efficiency and precision. Consequently, denoising is a vital preprocessing step, particularly… ▽ More

    Submitted 9 March, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

    Comments: 40pages, 34figures

    Journal ref: ICLR 2025

  10. arXiv:2501.03038  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders

    Authors: Dichucheng Li, Yongyi Zang, Qiuqiang Kong

    Abstract: Automatic Music Transcription (AMT), aiming to get musical notes from raw audio, typically uses frame-level systems with piano-roll outputs or language model (LM)-based systems with note-level predictions. However, frame-level systems require manual thresholding, while the LM-based systems struggle with long sequences. In this paper, we propose a hybrid method combining pre-trained roll-based enco… ▽ More

    Submitted 7 January, 2025; v1 submitted 6 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  11. arXiv:2410.09472  [pdf, other

    cs.SD cs.AI eess.AS

    DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning

    Authors: Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Qiuqiang Kong, Xie Chen

    Abstract: While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only da… ▽ More

    Submitted 6 January, 2025; v1 submitted 12 October, 2024; originally announced October 2024.

  12. arXiv:2409.09642  [pdf, other

    eess.AS cs.LG cs.SD

    Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

    Authors: Yudong Yang, Zhan Liu, Wenyi Yu, Guangzhi Sun, Qiuqiang Kong, Chao Zhang

    Abstract: Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may not achieve the same level of fidelity as the discriminative models specifically trained to enhance particular acoustic conditions. In this paper, we… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  13. arXiv:2409.09398  [pdf, other

    eess.AS cs.SD

    Language-Queried Target Sound Extraction Without Parallel Training Data

    Authors: Hao Ma, Zhiyuan Peng, Xu Li, Yukai Li, Mingjie Shao, Qiuqiang Kong, Ju Liu

    Abstract: Language-queried target sound extraction (TSE) aims to extract specific sounds from mixtures based on language queries. Traditional fully-supervised training schemes require extensively annotated parallel audio-text data, which are labor-intensive. We introduce a parallel-data-free training scheme, requiring only unlabelled audio clips for TSE model training by utilizing the contrastive language-a… ▽ More

    Submitted 21 March, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted by ICASSP 2025

  14. arXiv:2409.03055  [pdf, other

    cs.SD eess.AS

    SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

    Authors: Haonan Chen, Jordan B. L. Smith, Janne Spijkervet, Ju-Chiang Wang, Pei Zou, Bochen Li, Qiuqiang Kong, Xingjian Du

    Abstract: Progress in the task of symbolic music generation may be lagging behind other tasks like audio and text generation, in part because of the scarcity of symbolic training data. In this paper, we leverage the greater scale of audio music data by applying pre-trained MIR models (for transcription, beat tracking, structure analysis, etc.) to extract symbolic events and encode them into token sequences.… ▽ More

    Submitted 9 September, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: ISMIR 2024

  15. arXiv:2408.17175  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

    Authors: Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue

    Abstract: Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were or… ▽ More

    Submitted 27 November, 2024; v1 submitted 30 August, 2024; originally announced August 2024.

  16. arXiv:2408.14340  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Foundation Models for Music: A Survey

    Authors: Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan , et al. (17 additional authors not shown)

    Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi… ▽ More

    Submitted 3 September, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

  17. arXiv:2407.11745  [pdf, other

    eess.AS cs.AI cs.SD

    Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

    Authors: Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we… ▽ More

    Submitted 6 November, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

  18. arXiv:2406.11462  [pdf, other

    cs.MM cs.GR cs.SD eess.AS

    MusicScore: A Dataset for Music Score Modeling and Generation

    Authors: Yuheng Lin, Zheqi Dai, Qiuqiang Kong

    Abstract: Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are ma… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Dataset paper, dataset link: https://huggingface.co/datasets/ZheqiDAI/MusicScore

  19. arXiv:2406.02233  [pdf, other

    eess.AS

    Towards Out-of-Distribution Detection in Vocoder Recognition via Latent Feature Reconstruction

    Authors: Renmingyue Du, Jixun Yao, Qiuqiang Kong, Yin Cao

    Abstract: Advancements in synthesized speech have created a growing threat of impersonation, making it crucial to develop deepfake algorithm recognition. One significant aspect is out-of-distribution (OOD) detection, which has gained notable attention due to its important role in deepfake algorithm recognition. However, most of the current approaches for detecting OOD in deepfake algorithm recognition rely… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 5 pages, 4 figures

  20. arXiv:2403.09527  [pdf, other

    eess.AS

    WavCraft: Audio Editing and Generation with Large Language Models

    Authors: Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos

    Abstract: We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decompo… ▽ More

    Submitted 10 May, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  21. arXiv:2312.16422  [pdf, other

    eess.AS cs.SD

    Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

    Authors: Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, obtaining annotated samples for spatial sound events is notably costly. Deploying a SELD system in a new… ▽ More

    Submitted 5 October, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

    Comments: 14 pages, 11 figures, accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

  22. arXiv:2310.10159  [pdf, other

    cs.SD cs.CL eess.AS

    Joint Music and Language Attention Models for Zero-shot Music Tagging

    Authors: Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong

    Abstract: Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: \begin{keywords} Music tagging, joint music and language attention models, Music Foundation Model. \end{keywords}

  23. arXiv:2310.09853  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    MERTech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model With Multi-Task Finetuning

    Authors: Dichucheng Li, Yinghao Ma, Weixing Wei, Qiuqiang Kong, Yulun Wu, Mingjin Che, Fan Xia, Emmanouil Benetos, Wei Li

    Abstract: Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresse… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

    Comments: submitted to ICASSP 2024

  24. arXiv:2310.08950  [pdf, ps, other

    cs.SD eess.AS

    Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

    Authors: Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang

    Abstract: Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detectin… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: Accepted by EURASIP Journal on Audio, Speech, and Music Processing

  25. arXiv:2309.02612  [pdf, other

    cs.SD eess.AS

    Music Source Separation with Band-Split RoPE Transformer

    Authors: Wei-Tsung Lu, Ju-Chiang Wang, Qiuqiang Kong, Yun-Ning Hung

    Abstract: Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoP… ▽ More

    Submitted 9 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos

  26. arXiv:2308.05734  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

    Authors: Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

    Abstract: Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learn… ▽ More

    Submitted 11 May, 2024; v1 submitted 10 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. Project page is https://audioldm.github.io/audioldm2

  27. arXiv:2308.05037  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Separate Anything You Describe

    Authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

    Abstract: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instr… ▽ More

    Submitted 1 December, 2024; v1 submitted 9 August, 2023; originally announced August 2023.

    Comments: Code, benchmark and pre-trained models: https://github.com/Audio-AGI/AudioSep

  28. arXiv:2307.14335  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    WavJourney: Compositional Audio Creation with Large Language Models

    Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More

    Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: GitHub: https://github.com/Audio-AGI/WavJourney

  29. arXiv:2305.10666  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    A unified front-end framework for English text-to-speech synthesis

    Authors: Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, Yuanyuan Huo, Yuxuan Wang

    Abstract: The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However… ▽ More

    Submitted 25 March, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted in ICASSP 2024

  30. arXiv:2305.07447  [pdf, other

    cs.SD eess.AS

    Universal Source Separation with Weakly Labelled Data

    Authors: Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark D. Plumbley

    Abstract: Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  31. arXiv:2305.07204  [pdf, other

    eess.AS cs.SD

    Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion

    Authors: Zhichao Wang, Liumeng Xue, Qiuqiang Kong, Lei Xie, Yuanzhe Chen, Qiao Tian, Yuping Wang

    Abstract: Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook… ▽ More

    Submitted 18 May, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: Submitted to TASLP

  32. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 18 July, 2024; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Accepted to TASLP

  33. arXiv:2302.00286  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Ju-Chiang Wang, Yun-Ning Hung, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilize… ▽ More

    Submitted 1 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: arXiv admin note: text overlap with arXiv:2206.10805

  34. arXiv:2211.12195  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Ontology-aware Learning and Evaluation for Audio Tagging

    Authors: Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley

    Abstract: This study defines a new evaluation metric for audio tagging tasks to overcome the limitation of the conventional mean average precision (mAP) metric, which treats different kinds of sound as independent classes without considering their relations. Also, due to the ambiguities in sound labeling, the labels in the training and evaluation set are not guaranteed to be accurate and exhaustive, which p… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. The code is open-sourced at https://github.com/haoheliu/ontology-aware-audio-tagging

    Journal ref: Proc. Interspeech 2023

  35. arXiv:2211.04258  [pdf, other

    eess.SP eess.SY

    MetaLoc: Learning to Learn Wireless Localization

    Authors: Jun Gao, Dongze Wu, Feng Yin, Qinglei Kong, Lexi Xu, Shuguang Cui

    Abstract: Existing localization methods that intensively leverage the environment-specific received signal strength (RSS) or channel state information (CSI) of wireless signals are rather accurate in certain environments. However, these methods, whether based on pure statistical signal processing or data-driven approaches, often struggle to generalize to new environments, which results in considerable time… ▽ More

    Submitted 29 August, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: to be published in IEEE JSAC (Special Issue: 5G/6G Precise Positioning on Cooperative Intelligent Transportation Systems (C-ITS) and Connected Automated Vehicles (CAV))

  36. arXiv:2211.02301  [pdf, other

    cs.SD cs.AI eess.AS

    Binaural Rendering of Ambisonic Signals by Neural Networks

    Authors: Yin Zhu, Qiuqiang Kong, Junjie Shi, Shilei Liu, Xuzhou Ye, Ju-chiang Wang, Junping Zhang

    Abstract: Binaural rendering of ambisonic signals is of broad interest to virtual reality and immersive media. Conventional methods often require manually measured Head-Related Transfer Functions (HRTFs). To address this issue, we collect a paired ambisonic-binaural dataset and propose a deep learning framework in an end-to-end manner. Experimental results show that neural networks outperform the convention… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  37. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  38. arXiv:2210.15158  [pdf, other

    eess.AS cs.SD

    Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

    Authors: Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang

    Abstract: Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leak… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: The paper has been submitted to ICASSP2023

  39. arXiv:2210.12345  [pdf, other

    cs.SD eess.AS

    Neural Sound Field Decomposition with Super-resolution of Sound Direction

    Authors: Qiuqiang Kong, Shilei Liu, Junjie Shi, Xuzhou Ye, Yin Cao, Qiaoxi Zhu, Yong Xu, Yuxuan Wang

    Abstract: Sound field decomposition predicts waveforms in arbitrary directions using signals from a limited number of microphones as inputs. Sound field decomposition is fundamental to downstream tasks, including source localization, source separation, and spatial audio reproduction. Conventional sound field decomposition methods such as Ambisonics have limited spatial decomposition resolution. This paper p… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

    Comments: 12 pages

  40. arXiv:2210.01719  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    Learning Temporal Resolution in Spectrogram for Audio Classification

    Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not al… ▽ More

    Submitted 12 January, 2024; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted by the 38th Annual AAAI Conference on Artificial Intelligence

  41. arXiv:2210.00943  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Simple Pooling Front-ends For Efficient Audio Classification

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang

    Abstract: Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram… ▽ More

    Submitted 6 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  42. arXiv:2209.01802  [pdf, other

    eess.AS cs.SD

    Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

    Authors: Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Sound event localization and detection (SELD) is a joint task of sound event detection and direction-of-arrival estimation. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentat… ▽ More

    Submitted 9 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: Submitted to DCASE 2022 Workshop. Code is available at https://github.com/Jinbo-Hu/DCASE2022-TASK3

  43. arXiv:2207.10547  [pdf, other

    cs.SD eess.AS

    Surrey System for DCASE 2022 Task 5: Few-shot Bioacoustic Event Detection with Segment-level Metric Learning

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to build the loss function, and use transductive inferenc… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: Technical Report of the system that ranks 2nd in the DCASE Challenge Task 5. arXiv admin note: text overlap with arXiv:2207.07773

  44. arXiv:2207.07773  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Segment-level Metric Learning for Few-shot Bioacoustic Event Detection

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model o… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: 2nd place in the DCASE 2022 Challenge Task 5. Submitted to the DCASE 2022 workshop

  45. arXiv:2206.10805  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Amy Hung, Ju-Chiang Wang, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utiliz… ▽ More

    Submitted 28 June, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: Submitted to ISMIR

  46. VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

    Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang

    Abstract: Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a single type of distortion, such as speech denoising or dereverberation. However, speech signals can be degraded by several different distortions simultaneously in the real world. It is thus important to extend speech restoration models to deal with multiple distortions. In this paper, we introduce Voic… ▽ More

    Submitted 17 April, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

    Journal ref: Proc. Interspeech 2022

  47. arXiv:2203.15147  [pdf, other

    eess.AS cs.AI cs.CL cs.SD eess.SP

    Separate What You Describe: Language-Queried Audio Source Separation

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 3 figures

  48. arXiv:2203.14941  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Neural Vocoder is All You Need for Speech Super-resolution

    Authors: Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang

    Abstract: Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. These strong constraints can potentially lead to poor generalization ability in mismatched real-world cases. In this paper, we propose a neural vocoder based speech super-resol… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

    Journal ref: Proc. Interspeech 2022

  49. arXiv:2203.10228  [pdf, other

    cs.SD eess.AS

    A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

    Authors: Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In this paper, a track-wise ensemble event independent network with a novel data augmentation method is proposed. The proposed model is based on our previous proposed Event-Independent Network V2 and is extended by conformer blocks and dense… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

    Comments: 6 pages, 2 figures, submitted to IEEE ICASSP 2022

  50. arXiv:2112.04685  [pdf, other

    cs.SD cs.AI eess.AS

    CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet

    Authors: Haohe Liu, Qiuqiang Kong, Jiafeng Liu

    Abstract: Music source separation (MSS) shows active progress with deep learning models in recent years. Many MSS models perform separations on spectrograms by estimating bounded ratio masks and reusing the phases of the mixture. When using convolutional neural networks (CNN), weights are usually shared within a spectrogram during convolution regardless of the different patterns between frequency bands. In… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

    Comments: Published at MDX Workshop @ ISMIR 2021