Skip to main content

Showing 1–50 of 141 results for author: Toda, T

.
  1. arXiv:2506.12059  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

    Authors: Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

    Abstract: In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual b… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted by INTERSPEECH 2025

  2. arXiv:2506.11064  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding

    Authors: Jiajun He, Tomoki Toda

    Abstract: End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error detection and context-aware error correction (ED-CEC), which leverages contextual information such as named entities and technical terms to improve the accuracy of ASR transcripts. Although ED-CEC achieves a notable success in c… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted by IEEE TASLP 2025

  3. arXiv:2506.03554  [pdf, ps, other

    cs.SD eess.AS

    Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments

    Authors: Reo Yoneyama, Masaya Kawamura, Ryo Terashima, Ryuichi Yamamoto, Tomoki Toda

    Abstract: In real-time speech synthesis, neural vocoders often require low-latency synthesis through causal processing and streaming. However, streaming introduces inefficiencies absent in batch synthesis, such as limited parallelism, inter-frame dependency management, and parameter loading overhead. This paper proposes multi-stream Wavehax (MS-Wavehax), an efficient neural vocoder for low-latency streaming… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  4. arXiv:2506.00865  [pdf, ps, other

    cs.AI cs.LG

    GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints

    Authors: Jiajun He, Jinyi Mi, Tomoki Toda

    Abstract: Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction. Attention-based fusion methods dominate MER research, achieving strong classification performance. However, two key challenges remain: effectively extracting modality-specific features and capturing cross-modal similarities despit… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted by INTERSPEECH 2025

  5. arXiv:2505.18982  [pdf, ps, other

    cs.SD eess.AS

    Serial-OE: Anomalous sound detection based on serial method with outlier exposure capable of using small amounts of anomalous data for training

    Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

    Abstract: We introduce Serial-OE, a new approach to anomalous sound detection (ASD) that leverages small amounts of anomalous data to improve the performance. Conventional ASD methods rely primarily on the modeling of normal data, due to the cost of collecting anomalous data from various possible types of equipment breakdowns. Our method improves upon existing ASD systems by implementing an outlier exposure… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 39 pages, 5 figures, 5 tables, APSIPA Transactions on Signal and Information Processing

  6. arXiv:2505.18980  [pdf, ps, other

    cs.SD eess.AS

    Improving Anomalous Sound Detection through Pseudo-anomalous Set Selection and Pseudo-label Utilization under Unlabeled Conditions

    Authors: Ibuki Kuroyanagi, Takuya Fujimura, Kazuya Takeda, Tomoki Toda

    Abstract: This paper addresses performance degradation in anomalous sound detection (ASD) when neither sufficiently similar machine data nor operational state labels are available. We present an integrated pipeline that combines three complementary components derived from prior work and extends them to the unlabeled ASD setting. First, we adapt an anomaly score based selector to curate external audio data r… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 33 pages, 3 figures, 7 tables, APSIPA Transactions on Signal and Information Processing

  7. arXiv:2505.15061  [pdf, ps, other

    cs.SD eess.AS

    SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit

    Authors: Wen-Chin Huang, Erica Cooper, Tomoki Toda

    Abstract: We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural network-based models trained to predict human-labeled quality scores of speech samples. SHEET provides comprehensive training and evaluation scripts, multi-datas… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: INTERSPEECH 2025. Codebase: https://github.com/unilight/sheet

  8. arXiv:2503.18486  [pdf, other

    cs.SD eess.AS

    Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

    Authors: Takehiro Imamura, Yuka Hashizume, Wen-Chin Huang, Tomoki Toda

    Abstract: This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially perfor… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  9. arXiv:2503.17281  [pdf, other

    cs.SD eess.AS

    Learning disentangled representations for instrument-based music similarity

    Authors: Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

    Abstract: A flexible recommendation and retrieval system requires music similarity in terms of multiple partial elements of musical pieces to allow users to select the element they want to focus on. A method for music similarity learning using multiple networks with individual instrumental signals is effective but faces the problem that using each clean instrumental signal as a query is impractical for retr… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: arXiv admin note: text overlap with arXiv:2404.06682

  10. arXiv:2503.14854  [pdf, other

    eess.AS

    Analysis and Extension of Noisy-target Training for Unsupervised Target Signal Enhancement

    Authors: Takuya Fujimura, Tomoki Toda

    Abstract: Deep neural network-based target signal enhancement (TSE) is usually trained in a supervised manner using clean target signals. However, collecting clean target signals is costly and such signals are not always available. Thus, it is desirable to develop an unsupervised method that does not rely on clean target signals. Among various studies on unsupervised TSE methods, Noisy-target Training (NyTT… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  11. arXiv:2503.12388  [pdf, other

    cs.SD eess.AS

    Serenade: A Singing Style Conversion Framework Based On Audio Infilling

    Authors: Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

    Abstract: We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing s… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: Preprint under review

  12. arXiv:2503.10435  [pdf, other

    eess.AS cs.SD

    Handling Domain Shifts for Anomalous Sound Detection: A Review of DCASE-Related Work

    Authors: Kevin Wilkinghoff, Takuya Fujimura, Keisuke Imoto, Jonathan Le Roux, Zheng-Hua Tan, Tomoki Toda

    Abstract: When detecting anomalous sounds in complex environments, one of the main difficulties is that trained models must be sensitive to subtle differences in monitored target signals, while many practical applications also require them to be insensitive to changes in acoustic domains. Examples of such domain shifts include changing the type of microphone or the location of acoustic sensors, which can ha… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  13. arXiv:2502.02138  [pdf, other

    cs.SD eess.AS

    Investigation of perceptual music similarity focusing on each instrumental part

    Authors: Yuka Hashizume, Tomoki Toda

    Abstract: This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

    Comments: Accepted ICASSP 2025

  14. arXiv:2411.06807  [pdf, other

    cs.SD eess.AS

    Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

    Authors: Reo Yoneyama, Atsushi Miyashita, Ryuichi Yamamoto, Tomoki Toda

    Abstract: Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers m… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

    Comments: 13 pages, 5 figures, Submitted to IEEE/ACM Trans. ASLP

  15. arXiv:2411.03715  [pdf, other

    cs.SD eess.AS

    MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

    Authors: Wen-Chin Huang, Erica Cooper, Tomoki Toda

    Abstract: Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse c… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Comments: Submitted to Transactions on Audio, Speech and Language Processing. This work has been submitted to the IEEE for possible publication

  16. arXiv:2409.19614  [pdf, other

    cs.SD eess.AS

    Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

    Authors: Jinyi Mi, Sehun Kim, Tomoki Toda

    Abstract: Automatic music transcription (AMT), aiming to convert musical signals into musical notation, is one of the important tasks in music information retrieval. Recently, previous works have applied high-resolution labels, i.e., the continuous onset and offset times of piano notes, as training targets, achieving substantial improvements in transcription performance. However, there still remain some iss… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

    Comments: Accepted to APSIPA ASC 2024

  17. arXiv:2409.19585  [pdf, other

    cs.SD cs.CL eess.AS

    Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

    Authors: Jinyi Mi, Xiaohan Shi, Ding Ma, Jiajun He, Takuya Fujimura, Tomoki Toda

    Abstract: Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train… ▽ More

    Submitted 17 December, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

    Comments: This is the preprint version of the paper accepted at APSIPA ASC 2024

  18. arXiv:2409.09332  [pdf, other

    eess.AS cs.SD

    Improvements of Discriminative Feature Space Training for Anomalous Sound Detection in Unlabeled Conditions

    Authors: Takuya Fujimura, Ibuki Kuroyanagi, Tomoki Toda

    Abstract: In anomalous sound detection, the discriminative method has demonstrated superior performance. This approach constructs a discriminative feature space through the classification of the meta-information labels for normal sounds. This feature space reflects the differences in machine sounds and effectively captures anomalous sounds. However, its performance significantly degrades when the meta-infor… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP2025

  19. arXiv:2409.07001  [pdf, other

    cs.SD eess.AS

    The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

    Authors: Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

    Abstract: We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

    Comments: Accepted to SLT2024

  20. arXiv:2408.16132  [pdf, other

    eess.AS cs.MM cs.SD

    SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

    Abstract: With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD trac… ▽ More

    Submitted 23 September, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

    Comments: 6 pages, Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT 2024)

  21. arXiv:2406.06208  [pdf, other

    cs.SD eess.AS

    Quantifying the effect of speech pathology on automatic and human speaker verification

    Authors: Bence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki Toda

    Abstract: This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance,… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables. Accepted to Interspeech 2024

    ACM Class: I.2.7

  22. arXiv:2406.06201  [pdf, other

    cs.CV cs.AI

    2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval

    Authors: Jiajun He, Tomoki Toda

    Abstract: Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper propos… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  23. CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

    Journal ref: Proceedings of Interspeech 2024

  24. arXiv:2405.11767  [pdf, other

    eess.AS cs.CR cs.SD

    Multi-speaker Text-to-speech Training with Speaker Anonymized Data

    Authors: Wen-Chin Huang, Yi-Chiao Wu, Tomoki Toda

    Abstract: The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: 5 pages. Submitted to Signal Processing Letters. Audio sample page: https://unilight.github.io/Publication-Demos/publications/sa-tts-spl/index.html

  25. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  26. arXiv:2404.06682  [pdf, other

    cs.SD eess.AS

    Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

    Authors: Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

    Abstract: To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  27. arXiv:2403.11508  [pdf, other

    eess.AS

    Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

    Authors: Takuya Fujimura, Keisuke Imoto, Tomoki Toda

    Abstract: We propose discriminative neighborhood smoothing of generative anomaly scores for anomalous sound detection. While the discriminative approach is known to achieve better performance than generative approaches often, we have found that it sometimes causes significant performance degradation due to the discrepancy between the training and test data, making it less robust than the generative approach… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Submitted to EUSIPCO 2024

  28. arXiv:2403.06100  [pdf, other

    cs.HC cs.CL cs.LG eess.AS stat.ML

    Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

    Authors: Yusuke Yasuda, Tomoki Toda

    Abstract: A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volu… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

  29. arXiv:2401.13260  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

    Authors: Jiajun He, Xiaohan Shi, Xingfeng Li, Tomoki Toda

    Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxi… ▽ More

    Submitted 28 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  30. arXiv:2311.13097  [pdf, other

    astro-ph.EP astro-ph.GA astro-ph.SR

    KMT-2023-BLG-1431Lb: A New $q < 10^{-4}$ Microlensing Planet from a Subtle Signature

    Authors: Aislyn Bell, Jiyuan Zhang, Youn Kil Jung, Jennifer C. Yee, Hongjing Yang, Takahiro Sumi, Andrzej Udalski, Michael D. Albrow, Sun-Ju Chung, Andrew Gould, Cheongho Han, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Weicheng Zang, Sang-Mok Cha, Dong-Jin Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Yunyi Tang , et al. (48 additional authors not shown)

    Abstract: The current studies of microlensing planets are limited by small number statistics. Follow-up observations of high-magnification microlensing events can efficiently form a statistical planetary sample. Since 2020, the Korea Microlensing Telescope Network (KMTNet) and the Las Cumbres Observatory (LCO) global network have been conducting a follow-up program for high-magnification KMTNet events. Here… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: PASP submitted. arXiv admin note: text overlap with arXiv:2301.06779

  31. arXiv:2311.07093  [pdf, other

    cs.SD cs.CL eess.AS

    On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

    Authors: Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

    Abstract: This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adop… ▽ More

    Submitted 12 January, 2025; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  32. arXiv:2310.05203  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

    Authors: Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

    Abstract: This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utt… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  33. arXiv:2310.05129  [pdf, other

    cs.AI

    ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

    Authors: Jiajun He, Zekun Yang, Tomoki Toda

    Abstract: Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: 6 pages, 5 figures, conference

  34. arXiv:2310.02640  [pdf, other

    eess.AS

    The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

    Authors: Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

    Abstract: We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seve… ▽ More

    Submitted 6 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  35. arXiv:2310.02570  [pdf, other

    cs.SD eess.AS

    Improving severity preservation of healthy-to-pathological voice conversion with global style tokens

    Authors: Bence Mark Halpern, Wen-Chin Huang, Lester Phillip Violeta, R. J. J. H. van Son, Tomoki Toda

    Abstract: In healthy-to-pathological voice conversion (H2P-VC), healthy speech is converted into pathological while preserving the identity. The paper improves on previous two-stage approach to H2P-VC where (1) speech is created first with the appropriate severity, (2) then the speaker identity of the voice is converted while preserving the severity of the voice. Specifically, we propose improvements to (2)… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: 7 pages, 3 figures, 5 tables. Accepted to IEEE Automatic Speech Recognition and Understanding Workshop 2023

    ACM Class: I.2.7

  36. arXiv:2309.09627  [pdf, other

    cs.SD eess.AS

    Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

    Authors: Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conv… ▽ More

    Submitted 20 January, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024. Demo page: lesterphillip.github.io/icassp2024_el_sie

  37. arXiv:2309.08141  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Audio Difference Learning for Audio Captioning

    Authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

    Abstract: This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, bo… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP2024

  38. arXiv:2309.07598  [pdf, other

    cs.SD eess.AS

    AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

    Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generaliz… ▽ More

    Submitted 15 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024. Demo: https://unilight.github.io/Publication-Demos/publications/aas-vc/index.html. Code: https://github.com/unilight/seq2seq-vc

  39. arXiv:2309.02133  [pdf, other

    cs.SD cs.CL eess.AS

    Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

    Authors: Wen-Chin Huang, Tomoki Toda

    Abstract: Foreign accent conversion (FAC) is a special application of voice conversion (VC) which aims to convert the accented speech of a non-native speaker to a native-sounding speech with the same speaker identity. FAC is difficult since the native speech from the desired non-native speaker to be used as the training target is impossible to collect. In this work, we evaluate three recently proposed metho… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: Accepted to the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Demo page: https://unilight.github.io/Publication-Demos/publications/fac-evaluate. Code: https://github.com/unilight/seq2seq-vc

  40. Preference-based training framework for automatic speech quality assessment using deep neural network

    Authors: Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda

    Abstract: One objective of Speech Quality Assessment (SQA) is to estimate the ranks of synthetic speech systems. However, recent SQA models are typically trained using low-precision direct scores such as mean opinion scores (MOS) as the training objective, which is not straightforward to estimate ranking. Although it is effective for predicting quality scores of individual sentences, this approach does not… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: Accepted by Interspeech 2023, oral

  41. arXiv:2307.14274  [pdf, other

    astro-ph.EP astro-ph.GA astro-ph.SR

    OGLE-2019-BLG-0825: Constraints on the Source System and Effect on Binary-lens Parameters arising from a Five Day Xallarap Effect in a Candidate Planetary Microlensing Event

    Authors: Yuki K. Satoh, Naoki Koshimoto, David P. Bennett, Takahiro Sumi, Nicholas J. Rattenbury, Daisuke Suzuki, Shota Miyazaki, Ian A. Bond, Andrzej Udalski, Andrew Gould, Valerio Bozza, Martin Dominik, Yuki Hirao, Iona Kondo, Rintaro Kirikawa, Ryusei Hamada, Fumio Abe, Richard Barry, Aparna Bhattacharya, Hirosane Fujii, Akihiko Fukui, Katsuki Fujita, Tomoya Ikeno, Stela Ishitani Silva, Yoshitaka Itow , et al. (64 additional authors not shown)

    Abstract: We present an analysis of microlensing event OGLE-2019-BLG-0825. This event was identified as a planetary candidate by preliminary modeling. We find that significant residuals from the best-fit static binary-lens model exist and a xallarap effect can fit the residuals very well and significantly improves $χ^2$ values. On the other hand, by including the xallarap effect in our models, we find that… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: 19 pages, 7 figures, 6 tables. Accepted by AJ

  42. arXiv:2307.00753  [pdf, ps, other

    astro-ph.EP astro-ph.GA

    KMT-2022-BLG-0475Lb and KMT-2022-BLG-1480Lb: Microlensing ice giants detected via non-caustic-crossing channel

    Authors: Cheongho Han, Chung-Uk Lee, Ian A. Bond, Weicheng Zang, Sun-Ju Chung, Michael D. Albrow, Andrew Gould, Kyu-Ha Hwang, Youn Kil Jung, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Jennifer C. Yee, Sang-Mok Cha, Doeon Kim, Dong-Jin Kim, Seung-Lee Kim, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Shude Mao, Wei Zhu, Fumio Abe , et al. (27 additional authors not shown)

    Abstract: We investigate the microlensing data collected in the 2022 season from the high-cadence microlensing surveys in order to find weak signals produced by planetary companions to lenses. From these searches, we find that two lensing events KMT-2022-BLG-0475 and KMT-2022-BLG-1480 exhibit weak short-term anomalies. From the detailed modeling of the lensing light curves, we identify that the anomalies ar… ▽ More

    Submitted 3 July, 2023; originally announced July 2023.

    Comments: 10 pages, 10 figures

  43. arXiv:2306.14422  [pdf, other

    cs.SD cs.CL eess.AS

    The Singing Voice Conversion Challenge 2023

    Authors: Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, Tomoki Toda

    Abstract: We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely… ▽ More

    Submitted 6 July, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

  44. arXiv:2306.13953  [pdf, other

    cs.SD eess.AS

    An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing

    Authors: Lester Phillip Violeta, Tomoki Toda

    Abstract: Deaf or hard-of-hearing (DHH) speakers typically have atypical speech caused by deafness. With the growing support of speech-based devices and software applications, more work needs to be done to make these devices inclusive to everyone. To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

    Comments: Submitted to APSIPA 2023

  45. arXiv:2305.15628  [pdf, ps, other

    astro-ph.EP astro-ph.GA astro-ph.IM

    KMT-2021-BLG-1150Lb: Microlensing planet detected through a densely covered planetary-caustic signal

    Authors: Cheongho Han, Youn Kil Jung, Ian A. Bond, Andrew Gould, Sun-Ju Chung, Michael D. Albrow, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Jennifer C. Yee, Weicheng Zang, Sang-Mok Cha, Doeon Kim, Dong-Jin Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Fumio Abe, Richard Barry, David P. Bennett , et al. (27 additional authors not shown)

    Abstract: Recently, there have been reports of various types of degeneracies in the interpretation of planetary signals induced by planetary caustics. In this work, we check whether such degeneracies persist in the case of well-covered signals by analyzing the lensing event KMT-2021-BLG-1150, for which the light curve exhibits a densely and continuously covered short-term anomaly. In order to identify degen… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 9 pages, 8 figures

  46. arXiv:2305.06605  [pdf, ps, other

    astro-ph.SR astro-ph.EP

    Probable brown dwarf companions detected in binary microlensing events during the 2018-2020 seasons of the KMTNet survey

    Authors: Cheongho Han, Youn Kil Jung, Doeon Kim, Andrew Gould, Valerio Bozza, Ian A. Bond, Sun-Ju Chung, Michael D. Albrow, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Weicheng Zang, Sang-Mok Cha, Dong-Jin Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Jennifer C. Yee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Fumio Abe , et al. (26 additional authors not shown)

    Abstract: We inspect the microlensing data of the KMTNet survey collected during the 2018--2020 seasons in order to find lensing events produced by binaries with brown-dwarf companions. In order to pick out binary-lens events with candidate BD lens companions, we conduct systematic analyses of all anomalous lensing events observed during the seasons. By applying the selection criterion with mass ratio betwe… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: 10 pages, 8 figures

  47. arXiv:2304.02815  [pdf, ps, other

    astro-ph.EP astro-ph.GA

    MOA-2022-BLG-249Lb: Nearby microlensing super-Earth planet detected from high-cadence surveys

    Authors: Cheongho Han, Andrew Gould, Youn Kil Jung, Ian A. Bond, Weicheng Zang, Sun-Ju Chung, Michael D. Albrow, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Jennifer C. Yee, Sang-Mok Cha, Doeon Kim, Dong-Jin Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Shude Mao, Wei Zhu, Fumio Abe , et al. (29 additional authors not shown)

    Abstract: We investigate the data collected by the high-cadence microlensing surveys during the 2022 season in search for planetary signals appearing in the light curves of microlensing events. From this search, we find that the lensing event MOA-2022-BLG-249 exhibits a brief positive anomaly that lasted for about 1 day with a maximum deviation of $\sim 0.2$~mag from a single-source single-lens model. We an… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

    Comments: 10 pages, 9 figures

  48. Precise lifetime measurement of $^4_Λ$H hypernucleus using in-flight $^4$He$(K^-, π^0)^4_Λ$H reaction

    Authors: T. Akaishi, H. Asano, X. Chen, A. Clozza, C. Curceanu, R. Del Grande, C. Guaraldo, C. Han, T. Hashimoto, M. Iliescu, K. Inoue, S. Ishimoto, K. Itahashi, M. Iwasaki, Y. Ma, M. Miliucci, R. Murayama, H. Noumi, H. Ohnishi, S. Okada, H. Outa, K. Piscicchia, A. Sakaguchi, F. Sakuma, M. Sato , et al. (13 additional authors not shown)

    Abstract: We present a new measurement of the $^4_Λ$H hypernuclear lifetime using in-flight $K^-$ + $^4$He $\rightarrow$ $^4_Λ$H + $π^0$ reaction at the J-PARC hadron facility. We demonstrate, for the first time, the effective selection of the hypernuclear bound state using only the $γ$-ray energy decayed from $π^0$. This opens the possibility for a systematic study of isospin partner hypernuclei through co… ▽ More

    Submitted 27 August, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

  49. arXiv:2301.06779  [pdf

    astro-ph.EP astro-ph.GA

    KMT-2022-BLG-0440Lb: A New $q < 10^{-4}$ Microlensing Planet with the Central-Resonant Caustic Degeneracy Broken

    Authors: Jiyuan Zhang, Weicheng Zang, Youn Kil Jung, Hongjing Yang, Andrew Gould, Takahiro Sumi, Shude Mao, Subo Dong, Michael D. Albrow, Sun-Ju Chung, Cheongho Han, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Jennifer C. Yee, Sang-Mok Cha, Dong-Jin Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge , et al. (35 additional authors not shown)

    Abstract: We present the observations and analysis of a high-magnification microlensing planetary event, KMT-2022-BLG-0440, for which the weak and short-lived planetary signal was covered by both the KMTNet survey and follow-up observations. The binary-lens models with a central caustic provide the best fits, with a planet/host mass ratio, $q = 0.75$--$1.00 \times 10^{-4}$ at $1σ$. The binary-lens models wi… ▽ More

    Submitted 2 May, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

    Comments: MNRAS accepted

  50. arXiv:2212.08329  [pdf, other

    eess.AS cs.CL stat.ML

    Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

    Authors: Yusuke Yasuda, Tomoki Toda

    Abstract: Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffus… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023