Skip to main content

Showing 1–12 of 12 results for author: Wong, J H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.11403  [pdf, ps, other

    cs.SD cs.AI eess.AS

    A correlation-permutation approach for speech-music encoders model merging

    Authors: Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jeremy H. M Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

    Abstract: Creating a unified speech and music model requires expensive pre-training. Model merging can instead create an unified audio model with minimal computational expense. However, direct merging is challenging when the models are not aligned in the weight space. Motivated by Git Re-Basin, we introduce a correlation-permutation approach that aligns a music encoder's internal layers with a speech encode… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Under review

  2. arXiv:2506.06820  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

    Authors: Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor, Nancy F. Chen, Ai Ti Aw

    Abstract: Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a s… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

  3. arXiv:2505.13270  [pdf, ps, other

    cs.SD eess.AS

    Distilling a speech and music encoder with task arithmetic

    Authors: Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee

    Abstract: Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Kn… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at INTERSPEECH 2025

  4. arXiv:2502.20311  [pdf, other

    cs.LG cs.SD eess.AS

    Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications

    Authors: Marcus Yu Zhe Wee, Justin Juin Hng Wong, Lynus Lim, Joe Yu Wei Tan, Prannaya Gupta, Dillion Lim, En Hao Tew, Aloysius Keng Siew Han, Yong Zhi Lim

    Abstract: Effective communication in Air Traffic Control (ATC) is critical to maintaining aviation safety, yet the challenges posed by accented English remain largely unaddressed in Automatic Speech Recognition (ASR) systems. Existing models struggle with transcription accuracy for Southeast Asian-accented (SEA-accented) speech, particularly in noisy ATC environments. This study presents the development of… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  5. arXiv:2412.11538  [pdf, other

    cs.CL cs.AI eess.AS

    MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond

    Authors: Muhammad Huzaifah, Geyu Lin, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Nancy F. Chen, Ai Ti Aw

    Abstract: This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports main… ▽ More

    Submitted 20 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

  6. arXiv:2409.14666  [pdf, other

    cs.AI

    Semi-supervised Learning For Robust Speech Evaluation

    Authors: Huayun Zhang, Jeremy H. M. Wong, Geyu Lin, Nancy F. Chen

    Abstract: Speech evaluation measures a learners oral proficiency using automatic models. Corpora for training such models often pose sparsity challenges given that there often is limited scored data from teachers, in addition to the score distribution across proficiency levels being often imbalanced among student cohorts. Automatic scoring is thus not robust when faced with under-represented samples or out-… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

    Comments: 6 pages

  7. arXiv:2406.02963  [pdf, other

    cs.SD eess.AS

    Dataset-Distillation Generative Model for Speech Emotion Recognition

    Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng

    Abstract: Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Em… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  8. arXiv:2312.12153  [pdf, other

    cs.SD eess.AS

    Noise robust distillation of self-supervised speech models via correlation metrics

    Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Dianwen Ng, Jeremy H. M. Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

    Abstract: Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Te… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 6 pages

  9. arXiv:2306.02719  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Multiple output samples per input in a single-output Gaussian process

    Authors: Jeremy H. M. Wong, Huayun Zhang, Nancy F. Chen

    Abstract: The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty… ▽ More

    Submitted 25 January, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

    Comments: This paper is presented in the "Symposium for Celebrating 40 Years of Bayesian Learning in Speech and Language Processing and Beyond", which is a satellite event of the ASRU workshop, on 20 December 2023. https://bayesian40.github.io/

  10. arXiv:2109.11140  [pdf, other

    cs.SD cs.AI cs.CL cs.LG

    Joint speaker diarisation and tracking in switching state-space model

    Authors: Jeremy H. M. Wong, Yifan Gong

    Abstract: Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeti… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

  11. arXiv:2109.10598  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Diarisation using location tracking with agglomerative clustering

    Authors: Jeremy H. M. Wong, Igor Abramovski, Xiong Xiao, Yifan Gong

    Abstract: Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framewo… ▽ More

    Submitted 23 September, 2021; v1 submitted 22 September, 2021; originally announced September 2021.

  12. arXiv:2003.07482  [pdf, other

    eess.AS cs.CL cs.SD

    High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

    Authors: Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong

    Abstract: While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LST… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

    Comments: Accepted by ICASSP 2020