Skip to main content

Showing 1–14 of 14 results for author: Lam, M W Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.19611  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation

    Authors: Max W. Y. Lam, Yijin Xing, Weiya You, Jingcheng Wu, Zongyu Yin, Fuqiang Jiang, Hangyu Liu, Feng Liu, Xingda Li, Wei-Tsung Lu, Hanyu Chen, Tong Feng, Tianwei Zhao, Chien-Hung Liu, Xuchen Song, Yang Li, Yahui Zhou

    Abstract: Autoregressive (AR) models have demonstrated impressive capabilities in generating high-fidelity music. However, the conventional next-token prediction paradigm in AR models does not align with the human creative process in music composition, potentially compromising the musicality of generated samples. To overcome this limitation, we introduce MusiCoT, a novel chain-of-thought (CoT) prompting tec… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Preprint

  2. arXiv:2409.06029  [pdf, other

    cs.SD cs.AI eess.AS

    SongCreator: Lyrics-based Universal Song Generation

    Authors: Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

    Abstract: Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the a… ▽ More

    Submitted 30 October, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted by NeurIPS 2024

  3. arXiv:2408.14340  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    Foundation Models for Music: A Survey

    Authors: Yinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan , et al. (17 additional authors not shown)

    Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the signifi… ▽ More

    Submitted 3 September, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

  4. arXiv:2305.16749  [pdf, other

    cs.SD eess.AS

    Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

    Authors: Xiang Li, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu, Chao Weng, Helen Meng

    Abstract: Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in makin… ▽ More

    Submitted 7 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Proceedings of Interspeech 2023 (doi: 10.21437/Interspeech.2023-715), demo site at https://thuhcsi.github.io/interspeech2023-DiffVar/

  5. arXiv:2305.15719  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Efficient Neural Music Generation

    Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang

    Abstract: Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

  6. arXiv:2204.09934  [pdf, other

    eess.AS cs.LG cs.SD

    FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

    Authors: Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

    Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of div… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: Accepted by IJCAI 2022

  7. arXiv:2203.13508  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surro… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: Accepted in ICLR 2022. arXiv admin note: text overlap with arXiv:2108.11514

    Journal ref: International Conference on Learning Representations 2022

  8. arXiv:2108.11514  [pdf, other

    cs.LG cs.AI cs.SD eess.AS eess.SP

    Bilateral Denoising Diffusion Models

    Authors: Max W. Y. Lam, Jun Wang, Rongjie Huang, Dan Su, Dong Yu

    Abstract: Denoising diffusion probabilistic models (DDPMs) have emerged as competitive generative models yet brought challenges to efficient sampling. In this paper, we propose novel bilateral denoising diffusion models (BDDMs), which take significantly fewer steps to generate high-quality samples. From a bilateral modeling objective, BDDMs parameterize the forward and reverse processes with a score network… ▽ More

    Submitted 14 September, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

  9. arXiv:2106.04275  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition

    Authors: Max W. Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu

    Abstract: End-to-end speech recognition generally uses hand-engineered acoustic features as input and excludes the feature extraction module from its joint optimization. To extract learnable and adaptive features and mitigate information loss, we propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input. We observe improved ASR performanc… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted in Interspeech 2021

  10. arXiv:2103.01461  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

    Authors: Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

    Abstract: We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Be… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted in AAAI 2021

  11. arXiv:2103.00819  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass… ▽ More

    Submitted 8 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted in ICASSP 2021

  12. arXiv:2103.00816  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Contrastive Separative Coding for Self-supervised Representation Learning

    Authors: Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

    Abstract: To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating the target signal from contrastive interfering signals. First, a multi-task separative encoder is built to extract shared separable and discriminative embedding… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted in ICASSP 2021

  13. arXiv:2101.05014  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), w… ▽ More

    Submitted 13 January, 2021; originally announced January 2021.

    Comments: Accepted in IEEE SLT 2021

  14. arXiv:1910.13253  [pdf, other

    eess.AS cs.LG cs.SD

    Mixup-breakdown: a consistency training method for improving generalization of speech separation models

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: Deep-learning based speech separation models confront poor generalization problem that even the state-of-the-art models could abruptly fail when evaluating them in mismatch conditions. To address this problem, we propose an easy-to-implement yet effective consistency based semi-supervised learning (SSL) approach, namely Mixup-Breakdown training (MBT). It learns a teacher model to "breakdown" unlab… ▽ More

    Submitted 3 March, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

    Comments: Accepted in a Lesson session in ICASSP2020