Skip to main content

Showing 1–16 of 16 results for author: Zezario, R E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.21951  [pdf, ps, other

    eess.AS

    HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment

    Authors: Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung-yi Lee, Yu Tsao

    Abstract: Modern speech quality prediction models are trained on audio data resampled to a specific sampling rate. When faced with higher-rate audio at test time, these models can produce biased scores. We introduce HighRateMOS, the first non-intrusive mean opinion score (MOS) model that explicitly considers sampling rate. HighRateMOS ensembles three model variants that exploit the following information: (i… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: Under Review, 3 pages + 1 References

  2. arXiv:2506.09549  [pdf, ps, other

    eess.AS cs.SD eess.SP

    A Study on Speech Assessment with Visual Cues

    Authors: Shafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain, Hsin-Min Wang, Yu Tsao

    Abstract: Non-intrusive assessment of speech quality and intelligibility is essential when clean reference signals are unavailable. In this work, we propose a multimodal framework that integrates audio features and visual cues to predict PESQ and STOI scores. It employs a dual-branch architecture, where spectral features are extracted using STFT, and visual embeddings are obtained via a visual encoder. Thes… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  3. arXiv:2502.10822  [pdf, other

    eess.AS cs.AI cs.SD

    NeuroAMP: A Novel End-to-end General Purpose Deep Neural Amplifier for Personalized Hearing Aids

    Authors: Shafique Ahmed, Ryandhimas E. Zezario, Hui-Guan Yuan, Amir Hussain, Hsin-Min Wang, Wei-Ho Chung, Yu Tsao

    Abstract: The prevalence of hearing aids is increasing. However, optimizing the amplification processes of hearing aids remains challenging due to the complexity of integrating multiple modular components in traditional methods. To address this challenge, we present NeuroAMP, a novel deep neural network designed for end-to-end, personalized amplification in hearing aids. NeuroAMP leverages both spectral fea… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

  4. arXiv:2409.09914  [pdf, other

    eess.AS cs.SD

    A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

    Authors: Ryandhimas E. Zezario, Sabato M. Siniscalchi, Hsin-Min Wang, Yu Tsao

    Abstract: This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper,… ▽ More

    Submitted 20 January, 2025; v1 submitted 15 September, 2024; originally announced September 2024.

    Comments: Accepted to IEEE ICASSP 2025

  5. arXiv:2409.07001  [pdf, other

    cs.SD eess.AS

    The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

    Authors: Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

    Abstract: We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

    Comments: Accepted to SLT2024

  6. arXiv:2401.01145  [pdf, other

    eess.AS cs.LG cs.SD

    HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids

    Authors: Dyah A. M. G. Wisnu, Stefano Rini, Ryandhimas E. Zezario, Hsin-Min Wang, Yu Tsao

    Abstract: This paper introduces HAAQI-Net, a non-intrusive deep learning-based music audio quality assessment model for hearing aid users. Unlike traditional methods like the Hearing Aid Audio Quality Index (HAAQI) that require intrusive reference signal comparisons, HAAQI-Net offers a more accessible and computationally efficient alternative. By utilizing a Bidirectional Long Short-Term Memory (BLSTM) arch… ▽ More

    Submitted 9 January, 2025; v1 submitted 2 January, 2024; originally announced January 2024.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2025

  7. arXiv:2309.12766  [pdf, other

    eess.AS cs.SD

    A Study on Incorporating Whisper for Robust Speech Assessment

    Authors: Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh

    Abstract: This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results r… ▽ More

    Submitted 29 April, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ICME 2024

  8. arXiv:2309.09548  [pdf, other

    eess.AS cs.LG cs.SD

    Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata

    Authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes m… ▽ More

    Submitted 13 June, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted to Interspeech 2024

  9. arXiv:2308.09262  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model

    Authors: Ryandhimas E. Zezario, Bo-Ren Brian Bai, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The pretrained MOSA-Net model is u… ▽ More

    Submitted 13 March, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

    Comments: Accepted to IEEE ICASSP 2024

  10. arXiv:2204.03310  [pdf, other

    eess.AS cs.LG cs.SD

    MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

    Authors: Ryandhimas E. Zezario, Szu-wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: Recently, deep learning (DL)-based non-intrusive speech assessment models have attracted great attention. Many studies report that these DL-based models yield satisfactory assessment performance and good flexibility, but their performance in unseen environments remains a challenge. Furthermore, compared to quality scores, fewer studies elaborate deep learning models to estimate intelligibility sco… ▽ More

    Submitted 30 August, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  11. arXiv:2204.03305  [pdf, other

    eess.AS cs.LG cs.SD

    MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

    Authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: Improving the user's hearing ability to understand speech in noisy environments is critical to the development of hearing aid (HA) devices. For this, it is important to derive a metric that can fairly predict speech intelligibility for HA users. A straightforward approach is to conduct a subjective listening test and use the test results as an evaluation metric. However, conducting large-scale lis… ▽ More

    Submitted 30 August, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  12. arXiv:2111.02363  [pdf, other

    eess.AS cs.LG cs.SD

    Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

    Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of sp… ▽ More

    Submitted 19 December, 2024; v1 submitted 3 November, 2021; originally announced November 2021.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 31, pp. 54-70, 2023

  13. arXiv:2012.09359  [pdf

    eess.AS cs.LG cs.SD

    Speech Enhancement with Zero-Shot Model Selection

    Authors: Ryandhimas E. Zezario, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

    Abstract: Recent research on speech enhancement (SE) has seen the emergence of deep-learning-based methods. It is still a challenging task to determine the effective ways to increase the generalizability of SE under diverse test conditions. In this study, we combine zero-shot learning and ensemble learning to propose a zero-shot model selection (ZMOS) approach to increase the generalization of SE performanc… ▽ More

    Submitted 31 August, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

    Comments: Accepted in EUSIPCO 2021

  14. arXiv:2011.04292  [pdf

    cs.SD cs.LG eess.AS

    STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model

    Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Chiou-Shann Fuh, Yu Tsao, Hsin-Min Wang

    Abstract: The calculation of most objective speech intelligibility assessment metrics requires clean speech as a reference. Such a requirement may limit the applicability of these metrics in real-world scenarios. To overcome this limitation, we propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net. The input and output of STOI-Net are speech spectral features a… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Accepted in APSIPA 2020

  15. arXiv:2006.10296  [pdf

    eess.AS cs.LG cs.SD

    Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing

    Authors: Szu-Wei Fu, Chien-Feng Liao, Tsun-An Hsieh, Kuo-Hsuan Hung, Syu-Siang Wang, Cheng Yu, Heng-Cheng Kuo, Ryandhimas E. Zezario, You-Jin Li, Shang-Yi Chuang, Yen-Ju Lu, Yu Tsao

    Abstract: The Transformer architecture has demonstrated a superior ability compared to recurrent neural networks in many different natural language processing applications. Therefore, our study applies a modified Transformer in a speech enhancement task. Specifically, positional encoding in the Transformer may not be necessary for speech enhancement, and hence, it is replaced by convolutional layers. To fur… ▽ More

    Submitted 3 March, 2021; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: Accepted by APSIPA 2020

  16. arXiv:2001.01538  [pdf, other

    eess.AS cs.SD

    Speech Enhancement based on Denoising Autoencoder with Multi-branched Encoders

    Authors: Cheng Yu, Ryandhimas E. Zezario, Syu-Siang Wang, Jonathan Sherman, Yi-Yen Hsieh, Xugang Lu, Hsin-Min Wang, Yu Tsao

    Abstract: Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2… ▽ More

    Submitted 24 December, 2020; v1 submitted 6 January, 2020; originally announced January 2020.