Search | arXiv e-print repository

arXiv:2506.12059 [pdf, ps, other]

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

Authors: Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda

Abstract: In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual b… ▽ More In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task. Our ASR method integrates pretrained speech encoders and large language models (LLMs), using optimized finetuning strategies. We also introduce a two-stage filtering algorithm to efficiently identify relevant rare words from large biasing lists and incorporate them into the LLM's prompt input, enhancing rare word recognition. Experiments show that our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM when the biasing size is 1,000, demonstrating its effectiveness in complex speech scenarios. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: Accepted by INTERSPEECH 2025

arXiv:2506.11064 [pdf, ps, other]

PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding

Authors: Jiajun He, Tomoki Toda

Abstract: End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error detection and context-aware error correction (ED-CEC), which leverages contextual information such as named entities and technical terms to improve the accuracy of ASR transcripts. Although ED-CEC achieves a notable success in c… ▽ More End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error detection and context-aware error correction (ED-CEC), which leverages contextual information such as named entities and technical terms to improve the accuracy of ASR transcripts. Although ED-CEC achieves a notable success in correcting rare words, its accuracy remains low when dealing with rare words that have similar pronunciations but different spellings. To address this issue, we proposed a phoneme-augmented multimodal fusion method for context-aware error correction (PMF-CEC) method on the basis of ED-CEC, which allowed for better differentiation between target rare words and homophones. Additionally, we observed that the previous ASR error detection module suffers from overdetection. To mitigate this, we introduced a retention probability mechanism to filter out editing operations with confidence scores below a set threshold, preserving the original operation to improve error detection accuracy. Experiments conducted on five datasets demonstrated that our proposed PMF-CEC maintains reasonable inference speed while further reducing the biased word error rate compared with ED-CEC, showing a stronger advantage in correcting homophones. Moreover, our method outperforms other contextual biasing methods, and remains valuable compared with LLM-based methods in terms of faster inference and better robustness under large biasing lists. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: Accepted by IEEE TASLP 2025

arXiv:2506.03554 [pdf, ps, other]

Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments

Authors: Reo Yoneyama, Masaya Kawamura, Ryo Terashima, Ryuichi Yamamoto, Tomoki Toda

Abstract: In real-time speech synthesis, neural vocoders often require low-latency synthesis through causal processing and streaming. However, streaming introduces inefficiencies absent in batch synthesis, such as limited parallelism, inter-frame dependency management, and parameter loading overhead. This paper proposes multi-stream Wavehax (MS-Wavehax), an efficient neural vocoder for low-latency streaming… ▽ More In real-time speech synthesis, neural vocoders often require low-latency synthesis through causal processing and streaming. However, streaming introduces inefficiencies absent in batch synthesis, such as limited parallelism, inter-frame dependency management, and parameter loading overhead. This paper proposes multi-stream Wavehax (MS-Wavehax), an efficient neural vocoder for low-latency streaming, by extending the aliasing-free neural vocoder Wavehax with multi-stream decomposition. We analyze the latency-throughput trade-off in a CPU-only environment and identify key bottlenecks in streaming neural vocoders. Our findings provide practical insights for optimizing chunk sizes and designing vocoders tailored to specific application demands and hardware constraints. Furthermore, our subjective evaluations show that MS-Wavehax delivers high speech quality under causal and non-causal conditions while being remarkably compact and easily deployable in resource-constrained environments. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Accepted to Interspeech 2025

arXiv:2506.00865 [pdf, ps, other]

GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints

Authors: Jiajun He, Jinyi Mi, Tomoki Toda

Abstract: Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction. Attention-based fusion methods dominate MER research, achieving strong classification performance. However, two key challenges remain: effectively extracting modality-specific features and capturing cross-modal similarities despit… ▽ More Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction. Attention-based fusion methods dominate MER research, achieving strong classification performance. However, two key challenges remain: effectively extracting modality-specific features and capturing cross-modal similarities despite distribution differences caused by modality heterogeneity. To address these, we propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions. Additionally, we introduce a modality-invariant generator to learn modality-invariant representations and constrain domain shifts by aligning cross-modal similarities. Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Comments: Accepted by INTERSPEECH 2025

arXiv:2505.18982 [pdf, ps, other]

Serial-OE: Anomalous sound detection based on serial method with outlier exposure capable of using small amounts of anomalous data for training

Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

Abstract: We introduce Serial-OE, a new approach to anomalous sound detection (ASD) that leverages small amounts of anomalous data to improve the performance. Conventional ASD methods rely primarily on the modeling of normal data, due to the cost of collecting anomalous data from various possible types of equipment breakdowns. Our method improves upon existing ASD systems by implementing an outlier exposure… ▽ More We introduce Serial-OE, a new approach to anomalous sound detection (ASD) that leverages small amounts of anomalous data to improve the performance. Conventional ASD methods rely primarily on the modeling of normal data, due to the cost of collecting anomalous data from various possible types of equipment breakdowns. Our method improves upon existing ASD systems by implementing an outlier exposure framework that utilizes normal and pseudo-anomalous data for training, with the capability to also use small amounts of real anomalous data. A comprehensive evaluation using the DCASE2020 Task2 dataset shows that our method outperforms state-of-the-art ASD models. We also investigate the impact on performance of using a small amount of anomalous data during training, of using data without machine ID information, and of using contaminated training data. Our experimental results reveal the potential of using a very limited amount of anomalous data during training to address the limitations of existing methods using only normal data for training due to the scarcity of anomalous data. This study contributes to the field by presenting a method that can be dynamically adapted to include anomalous data during the operational phase of an ASD system, paving the way for more accurate ASD. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: 39 pages, 5 figures, 5 tables, APSIPA Transactions on Signal and Information Processing

arXiv:2505.18980 [pdf, ps, other]

Improving Anomalous Sound Detection through Pseudo-anomalous Set Selection and Pseudo-label Utilization under Unlabeled Conditions

Authors: Ibuki Kuroyanagi, Takuya Fujimura, Kazuya Takeda, Tomoki Toda

Abstract: This paper addresses performance degradation in anomalous sound detection (ASD) when neither sufficiently similar machine data nor operational state labels are available. We present an integrated pipeline that combines three complementary components derived from prior work and extends them to the unlabeled ASD setting. First, we adapt an anomaly score based selector to curate external audio data r… ▽ More This paper addresses performance degradation in anomalous sound detection (ASD) when neither sufficiently similar machine data nor operational state labels are available. We present an integrated pipeline that combines three complementary components derived from prior work and extends them to the unlabeled ASD setting. First, we adapt an anomaly score based selector to curate external audio data resembling the normal sounds of the target machine. Second, we utilize triplet learning to assign pseudo-labels to unlabeled data, enabling finer classification of operational sounds and detection of subtle anomalies. Third, we employ iterative training to refine both the pseudo-anomalous set selection and pseudo-label assignment, progressively improving detection accuracy. Experiments on the DCASE2022-2024 Task 2 datasets demonstrate that, in unlabeled settings, our approach achieves an average AUC increase of over 6.6 points compared to conventional methods. In labeled settings, incorporating external data from the pseudo-anomalous set further boosts performance. These results highlight the practicality and robustness of our methods in scenarios with scarce machine data and labels, facilitating ASD deployment across diverse industrial settings with minimal annotation effort. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: 33 pages, 3 figures, 7 tables, APSIPA Transactions on Signal and Information Processing

arXiv:2505.15061 [pdf, ps, other]

SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit

Authors: Wen-Chin Huang, Erica Cooper, Tomoki Toda

Abstract: We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural network-based models trained to predict human-labeled quality scores of speech samples. SHEET provides comprehensive training and evaluation scripts, multi-datas… ▽ More We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural network-based models trained to predict human-labeled quality scores of speech samples. SHEET provides comprehensive training and evaluation scripts, multi-dataset and multi-model support, as well as pre-trained models accessible via Torch Hub and HuggingFace Spaces. To demonstrate its capabilities, we re-evaluated SSL-MOS, a speech self-supervised learning (SSL)-based SSQA model widely used in recent scientific papers, on an extensive list of speech SSL models. Experiments were conducted on two representative SSQA datasets named BVCC and NISQA, and we identified the optimal speech SSL model, whose performance surpassed the original SSL-MOS implementation and was comparable to state-of-the-art methods. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: INTERSPEECH 2025. Codebase: https://github.com/unilight/sheet

arXiv:2503.18486 [pdf, other]

Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

Authors: Takehiro Imamura, Yuka Hashizume, Wen-Chin Huang, Tomoki Toda

Abstract: This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially perfor… ▽ More This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT. △ Less

Submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.17281 [pdf, other]

Learning disentangled representations for instrument-based music similarity

Authors: Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

Abstract: A flexible recommendation and retrieval system requires music similarity in terms of multiple partial elements of musical pieces to allow users to select the element they want to focus on. A method for music similarity learning using multiple networks with individual instrumental signals is effective but faces the problem that using each clean instrumental signal as a query is impractical for retr… ▽ More A flexible recommendation and retrieval system requires music similarity in terms of multiple partial elements of musical pieces to allow users to select the element they want to focus on. A method for music similarity learning using multiple networks with individual instrumental signals is effective but faces the problem that using each clean instrumental signal as a query is impractical for retrieval systems and using separated instrumental sounds reduces accuracy owing to artifacts. In this paper, we present instrumental-part-based music similarity learning with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we designed a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which are trained using the triplet loss with masks. Experimental results showed that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input in the evaluation of an instrument that had low accuracy, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human acceptance, especially when focusing on timbre. △ Less

Submitted 21 March, 2025; originally announced March 2025.

Comments: arXiv admin note: text overlap with arXiv:2404.06682

arXiv:2503.14854 [pdf, other]

Analysis and Extension of Noisy-target Training for Unsupervised Target Signal Enhancement

Authors: Takuya Fujimura, Tomoki Toda

Abstract: Deep neural network-based target signal enhancement (TSE) is usually trained in a supervised manner using clean target signals. However, collecting clean target signals is costly and such signals are not always available. Thus, it is desirable to develop an unsupervised method that does not rely on clean target signals. Among various studies on unsupervised TSE methods, Noisy-target Training (NyTT… ▽ More Deep neural network-based target signal enhancement (TSE) is usually trained in a supervised manner using clean target signals. However, collecting clean target signals is costly and such signals are not always available. Thus, it is desirable to develop an unsupervised method that does not rely on clean target signals. Among various studies on unsupervised TSE methods, Noisy-target Training (NyTT) has been established as a fundamental method. NyTT simply replaces clean target signals with noisy ones in the typical supervised training, and it has been experimentally shown to achieve TSE. Despite its effectiveness and simplicity, its mechanism and detailed behavior are still unclear. In this paper, to advance NyTT and, thus, unsupervised methods as a whole, we analyze NyTT from various perspectives. We experimentally demonstrate the mechanism of NyTT, the desirable conditions, and the effectiveness of utilizing noisy signals in situations where a small number of clean target signals are available. Furthermore, we propose an improved version of NyTT based on its properties and explore its capabilities in the dereverberation and declipping tasks, beyond the denoising task. △ Less

Submitted 18 March, 2025; originally announced March 2025.

arXiv:2503.12388 [pdf, other]

Serenade: A Singing Style Conversion Framework Based On Audio Infilling

Authors: Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

Abstract: We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing s… ▽ More We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel-spectrogram with a flow-matching model using the complement of the masked target mel-spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel-spectrogram as a target. Finally, to retain the source melody better, we investigate a post-processing module using a source-filter-based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. Moreover, although resynthesizing with the original F0 patterns alleviated out-of-tune singing and improved naturalness, we found a slight tradeoff in similarity due to not changing the F0 patterns into the target style. △ Less

Submitted 16 March, 2025; originally announced March 2025.

Comments: Preprint under review

arXiv:2503.10435 [pdf, other]

Handling Domain Shifts for Anomalous Sound Detection: A Review of DCASE-Related Work

Authors: Kevin Wilkinghoff, Takuya Fujimura, Keisuke Imoto, Jonathan Le Roux, Zheng-Hua Tan, Tomoki Toda

Abstract: When detecting anomalous sounds in complex environments, one of the main difficulties is that trained models must be sensitive to subtle differences in monitored target signals, while many practical applications also require them to be insensitive to changes in acoustic domains. Examples of such domain shifts include changing the type of microphone or the location of acoustic sensors, which can ha… ▽ More When detecting anomalous sounds in complex environments, one of the main difficulties is that trained models must be sensitive to subtle differences in monitored target signals, while many practical applications also require them to be insensitive to changes in acoustic domains. Examples of such domain shifts include changing the type of microphone or the location of acoustic sensors, which can have a much stronger impact on the acoustic signal than subtle anomalies themselves. Moreover, users typically aim to train a model only on source domain data, which they may have a relatively large collection of, and they hope that such a trained model will be able to generalize well to an unseen target domain by providing only a minimal number of samples to characterize the acoustic signals in that domain. In this work, we review and discuss recent publications focusing on this domain generalization problem for anomalous sound detection in the context of the DCASE challenges on acoustic machine condition monitoring. △ Less

Submitted 13 March, 2025; originally announced March 2025.

arXiv:2502.02138 [pdf, other]

Investigation of perceptual music similarity focusing on each instrumental part

Authors: Yuka Hashizume, Tomoki Toda

Abstract: This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of… ▽ More This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of the slakh2100 dataset. The perceptual similarity is evaluated based on four perspectives: timbre, rhythm, melody, and overall. We have analyzed the results of the listening test and have found that 1) perceptual music similarity varies depending on which instrumental part is focused on within each track; 2) rhythm and melody tend to have a larger impact on the perceptual music similarity than timbre except for the melody of drums; and 3) the previously proposed music similarity features tend to capture the perceptual similarity on timbre mainly. △ Less

Submitted 4 February, 2025; originally announced February 2025.

Comments: Accepted ICASSP 2025

arXiv:2411.06807 [pdf, other]

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

Authors: Reo Yoneyama, Atsushi Miyashita, Ryuichi Yamamoto, Tomoki Toda

Abstract: Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers m… ▽ More Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed. △ Less

Submitted 11 November, 2024; originally announced November 2024.

Comments: 13 pages, 5 figures, Submitted to IEEE/ACM Trans. ASLP

arXiv:2411.03715 [pdf, other]

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Authors: Wen-Chin Huang, Erica Cooper, Tomoki Toda

Abstract: Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse c… ▽ More Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research. △ Less

Submitted 6 November, 2024; originally announced November 2024.

Comments: Submitted to Transactions on Audio, Speech and Language Processing. This work has been submitted to the IEEE for possible publication

arXiv:2409.19614 [pdf, other]

Improved Architecture for High-resolution Piano Transcription to Efficiently Capture Acoustic Characteristics of Music Signals

Authors: Jinyi Mi, Sehun Kim, Tomoki Toda

Abstract: Automatic music transcription (AMT), aiming to convert musical signals into musical notation, is one of the important tasks in music information retrieval. Recently, previous works have applied high-resolution labels, i.e., the continuous onset and offset times of piano notes, as training targets, achieving substantial improvements in transcription performance. However, there still remain some iss… ▽ More Automatic music transcription (AMT), aiming to convert musical signals into musical notation, is one of the important tasks in music information retrieval. Recently, previous works have applied high-resolution labels, i.e., the continuous onset and offset times of piano notes, as training targets, achieving substantial improvements in transcription performance. However, there still remain some issues to be addressed, e.g., the harmonics of notes are sometimes recognized as false positive notes, and the size of AMT model tends to be larger to improve the transcription performance. To address these issues, we propose an improved high-resolution piano transcription model to well capture specific acoustic characteristics of music signals. First, we employ the Constant-Q Transform as the input representation to better adapt to musical signals. Moreover, we have designed two architectures: the first is based on a convolutional recurrent neural network (CRNN) with dilated convolution, and the second is an encoder-decoder architecture that combines CRNN with a non-autoregressive Transformer decoder. We conduct systematic experiments for our models. Compared to the high-resolution AMT system used as a baseline, our models effectively achieve 1) consistent improvement in note-level metrics, and 2) the significant smaller model size, which shed lights on future work. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: Accepted to APSIPA ASC 2024

arXiv:2409.19585 [pdf, other]

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Authors: Jinyi Mi, Xiaohan Shi, Ding Ma, Jiajun He, Takuya Fujimura, Tomoki Toda

Abstract: Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train… ▽ More Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering speaker gender, showing that our framework performs particularly well in different-gender mixture. △ Less

Submitted 17 December, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

Comments: This is the preprint version of the paper accepted at APSIPA ASC 2024

arXiv:2409.09332 [pdf, other]

Improvements of Discriminative Feature Space Training for Anomalous Sound Detection in Unlabeled Conditions

Authors: Takuya Fujimura, Ibuki Kuroyanagi, Tomoki Toda

Abstract: In anomalous sound detection, the discriminative method has demonstrated superior performance. This approach constructs a discriminative feature space through the classification of the meta-information labels for normal sounds. This feature space reflects the differences in machine sounds and effectively captures anomalous sounds. However, its performance significantly degrades when the meta-infor… ▽ More In anomalous sound detection, the discriminative method has demonstrated superior performance. This approach constructs a discriminative feature space through the classification of the meta-information labels for normal sounds. This feature space reflects the differences in machine sounds and effectively captures anomalous sounds. However, its performance significantly degrades when the meta-information labels are missing. In this paper, we improve the performance of a discriminative method under unlabeled conditions by two approaches. First, we enhance the feature extractor to perform better under unlabeled conditions. Our enhanced feature extractor utilizes multi-resolution spectrograms with a new training strategy. Second, we propose various pseudo-labeling methods to effectively train the feature extractor. The experimental evaluations show that the proposed feature extractor and pseudo-labeling methods significantly improve performance under unlabeled conditions. △ Less

Submitted 14 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP2025

arXiv:2409.07001 [pdf, other]

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

Authors: Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

Abstract: We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion… ▽ More We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction. △ Less

Submitted 11 September, 2024; originally announced September 2024.

Comments: Accepted to SLT2024

arXiv:2408.16132 [pdf, other]

SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, Zhiyao Duan

Abstract: With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD trac… ▽ More With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD track utilizes publicly available singing vocal data to generate deepfakes using state-of-the-art singing voice synthesis and conversion systems. Meanwhile, the WildSVDD track expands upon the existing SingFake dataset, which includes data sourced from popular user-generated content websites. For the CtrSVDD track, we received submissions from 47 teams, with 37 surpassing our baselines and the top team achieving a 1.65% equal error rate. For the WildSVDD track, we benchmarked the baselines. This paper reviews these results, discusses key findings, and outlines future directions for SVDD research. △ Less

Submitted 23 September, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

Comments: 6 pages, Accepted by 2024 IEEE Spoken Language Technology Workshop (SLT 2024)

arXiv:2406.06208 [pdf, other]

Quantifying the effect of speech pathology on automatic and human speaker verification

Authors: Bence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki Toda

Abstract: This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance,… ▽ More This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance, and whether objective/subjective measures of speech severity are correlated with the performance. Finally, we carry out a perceptual study to compare judgements of ASV and human listeners. Our findings reveal that pathological speech negatively affects ASV performance, and the severity of the speech is negatively correlated with the performance. There is a moderate agreement in perceptual and objective scores of speaker similarity and severity, however, we could not clearly establish in the perceptual study, whether the same phenomenon also exists in human perception. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 5 pages, 2 figures, 2 tables. Accepted to Interspeech 2024

ACM Class: I.2.7

arXiv:2406.06201 [pdf, other]

2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval

Authors: Jiajun He, Tomoki Toda

Abstract: Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper propos… ▽ More Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper proposes a novel 2-Dimensional Pointer-based Machine Reading Comprehension for Moment Retrieval Choice (2DP-2MRC) model to address the issue of imprecise localization in clip-based methods while maintaining lower computational complexity than moment-based methods. Specifically, we introduce an AV-Encoder to capture coarse-grained information at moment and video levels. Additionally, a 2D pointer encoder module is introduced to further enhance boundary detection for target moment. Extensive experiments on the HiREST dataset demonstrate that 2DP-2MRC significantly outperforms existing baseline models. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.02438 [pdf, other]

doi 10.21437/Interspeech.2024-2242

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible. △ Less

Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

Journal ref: Proceedings of Interspeech 2024

arXiv:2405.11767 [pdf, other]

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Authors: Wen-Chin Huang, Yi-Chiao Wu, Tomoki Toda

Abstract: The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while… ▽ More The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 5 pages. Submitted to Signal Processing Letters. Audio sample page: https://unilight.github.io/Publication-Demos/publications/sa-tts-spl/index.html

arXiv:2405.05244 [pdf, other]

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024). △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

arXiv:2404.06682 [pdf, other]

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Authors: Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

Abstract: To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal… ▽ More To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.11508 [pdf, other]

Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

Authors: Takuya Fujimura, Keisuke Imoto, Tomoki Toda

Abstract: We propose discriminative neighborhood smoothing of generative anomaly scores for anomalous sound detection. While the discriminative approach is known to achieve better performance than generative approaches often, we have found that it sometimes causes significant performance degradation due to the discrepancy between the training and test data, making it less robust than the generative approach… ▽ More We propose discriminative neighborhood smoothing of generative anomaly scores for anomalous sound detection. While the discriminative approach is known to achieve better performance than generative approaches often, we have found that it sometimes causes significant performance degradation due to the discrepancy between the training and test data, making it less robust than the generative approach. Our proposed method aims to compensate for the disadvantages of generative and discriminative approaches by combining them. Generative anomaly scores are smoothed using multiple samples with similar discriminative features to improve the performance of the generative approach in an ensemble manner while keeping its robustness. Experimental results show that our proposed method greatly improves the original generative method, including absolute improvement of 22% in AUC and robustly works, while a discriminative method suffers from the discrepancy. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Submitted to EUSIPCO 2024

arXiv:2403.06100 [pdf, other]

Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

Authors: Yusuke Yasuda, Tomoki Toda

Abstract: A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volu… ▽ More A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volumes with online learning in a crowdsourcing environment. We use a preference-based online learning method based on a sorting algorithm to identify the total order of evaluation targets with minimum sample volumes. Our online learning algorithm supports parallel and asynchronous execution under fixed-budget conditions required for crowdsourcing. Our experiment on preference-based subjective evaluation of synthetic speech shows that our method successfully optimizes the test by reducing pair combinations from 351 to 83 and allocating optimal evaluation volumes for each pair ranging from 30 to 663 without compromising evaluation accuracies and wasting budget allocations. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2401.13260 [pdf, other]

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

Authors: Jiajun He, Xiaohan Shi, Xingfeng Li, Tomoki Toda

Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxi… ▽ More The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%. △ Less

Submitted 28 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2311.13097 [pdf, other]

KMT-2023-BLG-1431Lb: A New $q < 10^{-4}$ Microlensing Planet from a Subtle Signature

Authors: Aislyn Bell, Jiyuan Zhang, Youn Kil Jung, Jennifer C. Yee, Hongjing Yang, Takahiro Sumi, Andrzej Udalski, Michael D. Albrow, Sun-Ju Chung, Andrew Gould, Cheongho Han, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Weicheng Zang, Sang-Mok Cha, Dong-Jin Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Yunyi Tang , et al. (48 additional authors not shown)

Abstract: The current studies of microlensing planets are limited by small number statistics. Follow-up observations of high-magnification microlensing events can efficiently form a statistical planetary sample. Since 2020, the Korea Microlensing Telescope Network (KMTNet) and the Las Cumbres Observatory (LCO) global network have been conducting a follow-up program for high-magnification KMTNet events. Here… ▽ More The current studies of microlensing planets are limited by small number statistics. Follow-up observations of high-magnification microlensing events can efficiently form a statistical planetary sample. Since 2020, the Korea Microlensing Telescope Network (KMTNet) and the Las Cumbres Observatory (LCO) global network have been conducting a follow-up program for high-magnification KMTNet events. Here, we report the detection and analysis of a microlensing planetary event, KMT-2023-BLG-1431, for which the subtle (0.05 magnitude) and short-lived (5 hours) planetary signature was characterized by the follow-up from KMTNet and LCO. A binary-lens single-source (2L1S) analysis reveals a planet/host mass ratio of $q = (0.72 \pm 0.07) \times 10^{-4}$, and the single-lens binary-source (1L2S) model is excluded by $Δχ^2 = 80$. A Bayesian analysis using a Galactic model yields estimates of the host star mass of $M_{\rm host} = 0.57^{+0.33}_{-0.29}~M_\odot$, the planetary mass of $M_{\rm planet} = 13.5_{-6.8}^{+8.1}~M_{\oplus}$, and the lens distance of $D_{\rm L} = 6.9_{-1.7}^{+0.8}$ kpc. The projected planet-host separation of $a_\perp = 2.3_{-0.5}^{+0.5}$ au or $a_\perp = 3.2_{-0.8}^{+0.7}$, subject to the close/wide degeneracy. We also find that without the follow-up data, the survey-only data cannot break the degeneracy of central/resonant caustics and the degeneracy of 2L1S/1L2S models, showing the importance of follow-up observations for current microlensing surveys. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: PASP submitted. arXiv admin note: text overlap with arXiv:2301.06779

arXiv:2311.07093 [pdf, other]

On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

Authors: Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

Abstract: This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adop… ▽ More This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech. △ Less

Submitted 12 January, 2025; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2310.05203 [pdf, other]

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Authors: Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

Abstract: This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utt… ▽ More This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a diffusion-based any-to-any voice conversion model using publicly available large-scale 750 hours of speech and singing data. Then, we finetune the model for each target singer/speaker of Task 1 and Task 2. Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC (Task 2), which implies the generalization ability of our proposed method. Our objective evaluation results show that using large datasets is particularly beneficial for cross-domain SVC. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: Accepted to ASRU 2023

arXiv:2310.05129 [pdf, other]

ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

Authors: Jiajun He, Zekun Yang, Tomoki Toda

Abstract: Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection… ▽ More Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection and context-aware error correction. Our method optimizes the decoding process by targeting only the predicted error positions, minimizing unnecessary computations. Moreover, we leverage a rare word list to provide additional contextual knowledge, enabling the model to better correct rare words. Experimental results across five datasets demonstrate that our proposed method achieves significantly lower word error rates (WERs) than previous approaches while maintaining a reasonable inference speed. Furthermore, our approach exhibits promising robustness across different ASR systems. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: 6 pages, 5 figures, conference

arXiv:2310.02640 [pdf, other]

The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

Authors: Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

Abstract: We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seve… ▽ More We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches. △ Less

Submitted 6 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted to ASRU 2023

arXiv:2310.02570 [pdf, other]

Improving severity preservation of healthy-to-pathological voice conversion with global style tokens

Authors: Bence Mark Halpern, Wen-Chin Huang, Lester Phillip Violeta, R. J. J. H. van Son, Tomoki Toda

Abstract: In healthy-to-pathological voice conversion (H2P-VC), healthy speech is converted into pathological while preserving the identity. The paper improves on previous two-stage approach to H2P-VC where (1) speech is created first with the appropriate severity, (2) then the speaker identity of the voice is converted while preserving the severity of the voice. Specifically, we propose improvements to (2)… ▽ More In healthy-to-pathological voice conversion (H2P-VC), healthy speech is converted into pathological while preserving the identity. The paper improves on previous two-stage approach to H2P-VC where (1) speech is created first with the appropriate severity, (2) then the speaker identity of the voice is converted while preserving the severity of the voice. Specifically, we propose improvements to (2) by using phonetic posteriorgrams (PPG) and global style tokens (GST). Furthermore, we present a new dataset that contains parallel recordings of pathological and healthy speakers with the same identity which allows more precise evaluation. Listening tests by expert listeners show that the framework preserves severity of the source sample, while modelling target speaker's voice. We also show that (a) pathology impacts x-vectors but not all speaker information is lost, (b) choosing source speakers based on severity labels alone is insufficient. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 7 pages, 3 figures, 5 tables. Accepted to IEEE Automatic Speech Recognition and Understanding Workshop 2023

ACM Class: I.2.7

arXiv:2309.09627 [pdf, other]

Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

Authors: Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda

Abstract: We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conv… ▽ More We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score. △ Less

Submitted 20 January, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024. Demo page: lesterphillip.github.io/icassp2024_el_sie

arXiv:2309.08141 [pdf, other]

Audio Difference Learning for Audio Captioning

Authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

Abstract: This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, bo… ▽ More This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as the caption for their difference, eliminating the need for additional annotations for the differences. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: submitted to ICASSP2024

arXiv:2309.07598 [pdf, other]

AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

Abstract: Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generaliz… ▽ More Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We also conducted ablation studies to justify several model design choices. The audio samples and implementation are available online. △ Less

Submitted 15 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024. Demo: https://unilight.github.io/Publication-Demos/publications/aas-vc/index.html. Code: https://github.com/unilight/seq2seq-vc

arXiv:2309.02133 [pdf, other]

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

Authors: Wen-Chin Huang, Tomoki Toda

Abstract: Foreign accent conversion (FAC) is a special application of voice conversion (VC) which aims to convert the accented speech of a non-native speaker to a native-sounding speech with the same speaker identity. FAC is difficult since the native speech from the desired non-native speaker to be used as the training target is impossible to collect. In this work, we evaluate three recently proposed metho… ▽ More Foreign accent conversion (FAC) is a special application of voice conversion (VC) which aims to convert the accented speech of a non-native speaker to a native-sounding speech with the same speaker identity. FAC is difficult since the native speech from the desired non-native speaker to be used as the training target is impossible to collect. In this work, we evaluate three recently proposed methods for ground-truth-free FAC, where all of them aim to harness the power of sequence-to-sequence (seq2seq) and non-parallel VC models to properly convert the accent and control the speaker identity. Our experimental evaluation results show that no single method was significantly better than the others in all evaluation axes, which is in contrast to conclusions drawn in previous studies. We also explain the effectiveness of these methods with the training input and output of the seq2seq model and examine the design choice of the non-parallel VC model, and show that intelligibility measures such as word error rates do not correlate well with subjective accentedness. Finally, our implementation is open-sourced to promote reproducible research and help future researchers improve upon the compared systems. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Accepted to the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Demo page: https://unilight.github.io/Publication-Demos/publications/fac-evaluate. Code: https://github.com/unilight/seq2seq-vc

arXiv:2308.15203 [pdf, other]

doi 10.21437/Interspeech.2023-589

Preference-based training framework for automatic speech quality assessment using deep neural network

Authors: Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda

Abstract: One objective of Speech Quality Assessment (SQA) is to estimate the ranks of synthetic speech systems. However, recent SQA models are typically trained using low-precision direct scores such as mean opinion scores (MOS) as the training objective, which is not straightforward to estimate ranking. Although it is effective for predicting quality scores of individual sentences, this approach does not… ▽ More One objective of Speech Quality Assessment (SQA) is to estimate the ranks of synthetic speech systems. However, recent SQA models are typically trained using low-precision direct scores such as mean opinion scores (MOS) as the training objective, which is not straightforward to estimate ranking. Although it is effective for predicting quality scores of individual sentences, this approach does not account for speech and system preferences when ranking multiple systems. We propose a training framework of SQA models that can be trained with only preference scores derived from pairs of MOS to improve ranking prediction. Our experiment reveals conditions where our framework works the best in terms of pair generation, aggregation functions to derive system score from utterance preferences, and threshold functions to determine preference from a pair of MOS. Our results demonstrate that our proposed method significantly outperforms the baseline model in Spearman's Rank Correlation Coefficient. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: Accepted by Interspeech 2023, oral

arXiv:2307.14274 [pdf, other]

OGLE-2019-BLG-0825: Constraints on the Source System and Effect on Binary-lens Parameters arising from a Five Day Xallarap Effect in a Candidate Planetary Microlensing Event

Authors: Yuki K. Satoh, Naoki Koshimoto, David P. Bennett, Takahiro Sumi, Nicholas J. Rattenbury, Daisuke Suzuki, Shota Miyazaki, Ian A. Bond, Andrzej Udalski, Andrew Gould, Valerio Bozza, Martin Dominik, Yuki Hirao, Iona Kondo, Rintaro Kirikawa, Ryusei Hamada, Fumio Abe, Richard Barry, Aparna Bhattacharya, Hirosane Fujii, Akihiko Fukui, Katsuki Fujita, Tomoya Ikeno, Stela Ishitani Silva, Yoshitaka Itow , et al. (64 additional authors not shown)

Abstract: We present an analysis of microlensing event OGLE-2019-BLG-0825. This event was identified as a planetary candidate by preliminary modeling. We find that significant residuals from the best-fit static binary-lens model exist and a xallarap effect can fit the residuals very well and significantly improves $χ^2$ values. On the other hand, by including the xallarap effect in our models, we find that… ▽ More We present an analysis of microlensing event OGLE-2019-BLG-0825. This event was identified as a planetary candidate by preliminary modeling. We find that significant residuals from the best-fit static binary-lens model exist and a xallarap effect can fit the residuals very well and significantly improves $χ^2$ values. On the other hand, by including the xallarap effect in our models, we find that binary-lens parameters like mass-ratio, $q$, and separation, $s$, cannot be constrained well. However, we also find that the parameters for the source system like the orbital period and semi major axis are consistent between all the models we analyzed. We therefore constrain the properties of the source system better than the properties of the lens system. The source system comprises a G-type main-sequence star orbited by a brown dwarf with a period of $P\sim5$ days. This analysis is the first to demonstrate that the xallarap effect does affect binary-lens parameters in planetary events. It would not be common for the presence or absence of the xallarap effect to affect lens parameters in events with long orbital periods of the source system or events with transits to caustics, but in other cases, such as this event, the xallarap effect can affect binary-lens parameters. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: 19 pages, 7 figures, 6 tables. Accepted by AJ

arXiv:2307.00753 [pdf, ps, other]

KMT-2022-BLG-0475Lb and KMT-2022-BLG-1480Lb: Microlensing ice giants detected via non-caustic-crossing channel

Authors: Cheongho Han, Chung-Uk Lee, Ian A. Bond, Weicheng Zang, Sun-Ju Chung, Michael D. Albrow, Andrew Gould, Kyu-Ha Hwang, Youn Kil Jung, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Jennifer C. Yee, Sang-Mok Cha, Doeon Kim, Dong-Jin Kim, Seung-Lee Kim, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Shude Mao, Wei Zhu, Fumio Abe , et al. (27 additional authors not shown)

Abstract: We investigate the microlensing data collected in the 2022 season from the high-cadence microlensing surveys in order to find weak signals produced by planetary companions to lenses. From these searches, we find that two lensing events KMT-2022-BLG-0475 and KMT-2022-BLG-1480 exhibit weak short-term anomalies. From the detailed modeling of the lensing light curves, we identify that the anomalies ar… ▽ More We investigate the microlensing data collected in the 2022 season from the high-cadence microlensing surveys in order to find weak signals produced by planetary companions to lenses. From these searches, we find that two lensing events KMT-2022-BLG-0475 and KMT-2022-BLG-1480 exhibit weak short-term anomalies. From the detailed modeling of the lensing light curves, we identify that the anomalies are produced by planetary companions with a mass ratio to the primary of $q\sim 1.8\times 10^{-4}$ for KMT-2022-BLG-0475L and a ratio $q\sim 4.3\times 10^{-4}$ for KMT-2022-BLG-1480L. It is estimated that the host and planet masses and the projected planet-host separation are $(M_{\rm h}/M_\odot, M_{\rm p}/M_{\rm U}, a_\perp/{\rm au}) = (0.43^{+0.35}_{-0.23}, 1.73^{+1.42}_{-0.92}, 2.03^{+0.25}_{-0.38})$ for KMT-2022-BLG-0475L, and $(0.18^{+0.16}_{-0.09}, 1.82^{+1.60}_{-0.92}, 1.22^{+0.15}_{-0.14})$ for KMT-2022-BLG-1480L, where $M_{\rm U}$ denotes the mass of Uranus. Both planetary systems share common characteristics that the primaries of the lenses are early-mid M dwarfs lying in the Galactic bulge and the companions are ice giants lying beyond the snow lines of the planetary systems. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: 10 pages, 10 figures

arXiv:2306.14422 [pdf, other]

The Singing Voice Conversion Challenge 2023

Authors: Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, Tomoki Toda

Abstract: We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely… ▽ More We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain and cross-domain SVC. The challenge was run for two months, and in total we received 26 submissions, including 2 baselines. Through a large-scale crowd-sourced listening test, we observed that for both tasks, although human-level naturalness was achieved by the top system, no team was able to obtain a similarity score as high as the target speakers. Also, as expected, cross-domain SVC is harder than in-domain SVC, especially in the similarity aspect. We also investigated whether existing objective measurements were able to predict perceptual performance, and found that only few of them could reach a significant correlation. △ Less

Submitted 6 July, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

arXiv:2306.13953 [pdf, other]

An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing

Authors: Lester Phillip Violeta, Tomoki Toda

Abstract: Deaf or hard-of-hearing (DHH) speakers typically have atypical speech caused by deafness. With the growing support of speech-based devices and software applications, more work needs to be done to make these devices inclusive to everyone. To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models… ▽ More Deaf or hard-of-hearing (DHH) speakers typically have atypical speech caused by deafness. With the growing support of speech-based devices and software applications, more work needs to be done to make these devices inclusive to everyone. To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models typically do not perform well on DHH speech, we provide a thorough analysis of creating personalized ASR systems. We collected a large DHH speaker dataset of four speakers totaling around 28.05 hours and thoroughly analyzed the performance of different training frameworks by varying the training data sizes. Our findings show that 1000 utterances (or 1-2 hours) from a target speaker can already significantly improve the model performance with minimal amount of work needed, thus we recommend researchers to collect at least 1000 utterances to make an efficient personalized ASR system. In cases where 1000 utterances is difficult to collect, we also discover significant improvements in using previously proposed data augmentation techniques such as intermediate fine-tuning when only 200 utterances are available. △ Less

Submitted 24 June, 2023; originally announced June 2023.

Comments: Submitted to APSIPA 2023

arXiv:2305.15628 [pdf, ps, other]

doi 10.1051/0004-6361/202346596

KMT-2021-BLG-1150Lb: Microlensing planet detected through a densely covered planetary-caustic signal

Authors: Cheongho Han, Youn Kil Jung, Ian A. Bond, Andrew Gould, Sun-Ju Chung, Michael D. Albrow, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Jennifer C. Yee, Weicheng Zang, Sang-Mok Cha, Doeon Kim, Dong-Jin Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Fumio Abe, Richard Barry, David P. Bennett , et al. (27 additional authors not shown)

Abstract: Recently, there have been reports of various types of degeneracies in the interpretation of planetary signals induced by planetary caustics. In this work, we check whether such degeneracies persist in the case of well-covered signals by analyzing the lensing event KMT-2021-BLG-1150, for which the light curve exhibits a densely and continuously covered short-term anomaly. In order to identify degen… ▽ More Recently, there have been reports of various types of degeneracies in the interpretation of planetary signals induced by planetary caustics. In this work, we check whether such degeneracies persist in the case of well-covered signals by analyzing the lensing event KMT-2021-BLG-1150, for which the light curve exhibits a densely and continuously covered short-term anomaly. In order to identify degenerate solutions, we thoroughly investigate the parameter space by conducting dense grid searches for the lensing parameters. We then check the severity of the degeneracy among the identified solutions. We identify a pair of planetary solutions resulting from the well-known inner-outer degeneracy, and find that interpreting the anomaly is not subject to any degeneracy other than the inner-outer degeneracy. The measured parameters of the planet separation (normalized to the Einstein radius) and mass ratio between the lens components are $(s, q)_{\rm in}\sim (1.297, 1.10\times 10^{-3})$ for the inner solution and $(s, q)_{\rm out}\sim (1.242, 1.15\times 10^{-3})$ for the outer solution. According to a Bayesian estimation, the lens is a planetary system consisting of a planet with a mass $M_{\rm p}=0.88^{+0.38}_{-0.36}~M_{\rm J}$ and its host with a mass $M_{\rm h}=0.73^{+0.32}_{-0.30}~M_\odot$ lying toward the Galactic center at a distance $D_{\rm L} =3.8^{+1.3}_{-1.2}$~kpc. By conducting analyses using mock data sets prepared to mimic those obtained with data gaps and under various observational cadences, it is found that gaps in data can result in various degenerate solutions, while the observational cadence does not pose a serious degeneracy problem as long as the anomaly feature can be delineated. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 9 pages, 8 figures

arXiv:2305.06605 [pdf, ps, other]

doi 10.1051/0004-6361/202245455

Probable brown dwarf companions detected in binary microlensing events during the 2018-2020 seasons of the KMTNet survey

Authors: Cheongho Han, Youn Kil Jung, Doeon Kim, Andrew Gould, Valerio Bozza, Ian A. Bond, Sun-Ju Chung, Michael D. Albrow, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Weicheng Zang, Sang-Mok Cha, Dong-Jin Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Jennifer C. Yee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Fumio Abe , et al. (26 additional authors not shown)

Abstract: We inspect the microlensing data of the KMTNet survey collected during the 2018--2020 seasons in order to find lensing events produced by binaries with brown-dwarf companions. In order to pick out binary-lens events with candidate BD lens companions, we conduct systematic analyses of all anomalous lensing events observed during the seasons. By applying the selection criterion with mass ratio betwe… ▽ More We inspect the microlensing data of the KMTNet survey collected during the 2018--2020 seasons in order to find lensing events produced by binaries with brown-dwarf companions. In order to pick out binary-lens events with candidate BD lens companions, we conduct systematic analyses of all anomalous lensing events observed during the seasons. By applying the selection criterion with mass ratio between the lens components of $0.03\lesssim q\lesssim 0.1$, we identify four binary-lens events with candidate BD companions, including KMT-2018-BLG-0321, KMT-2018-BLG-0885, KMT-2019-BLG-0297, and KMT-2019-BLG-0335. For the individual events, we present the interpretations of the lens systems and measure the observables that can constrain the physical lens parameters. The masses of the lens companions estimated from the Bayesian analyses based on the measured observables indicate that the probabilities for the lens companions to be in the brown-dwarf mass regime are high: 59\%, 68\%, 66\%, and 66\% for the four events respectively. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: 10 pages, 8 figures

arXiv:2304.02815 [pdf, ps, other]

doi 10.1051/0004-6361/202346166

MOA-2022-BLG-249Lb: Nearby microlensing super-Earth planet detected from high-cadence surveys

Authors: Cheongho Han, Andrew Gould, Youn Kil Jung, Ian A. Bond, Weicheng Zang, Sun-Ju Chung, Michael D. Albrow, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Hongjing Yang, Jennifer C. Yee, Sang-Mok Cha, Doeon Kim, Dong-Jin Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge, Shude Mao, Wei Zhu, Fumio Abe , et al. (29 additional authors not shown)

Abstract: We investigate the data collected by the high-cadence microlensing surveys during the 2022 season in search for planetary signals appearing in the light curves of microlensing events. From this search, we find that the lensing event MOA-2022-BLG-249 exhibits a brief positive anomaly that lasted for about 1 day with a maximum deviation of $\sim 0.2$~mag from a single-source single-lens model. We an… ▽ More We investigate the data collected by the high-cadence microlensing surveys during the 2022 season in search for planetary signals appearing in the light curves of microlensing events. From this search, we find that the lensing event MOA-2022-BLG-249 exhibits a brief positive anomaly that lasted for about 1 day with a maximum deviation of $\sim 0.2$~mag from a single-source single-lens model. We analyze the light curve under the two interpretations of the anomaly: one originated by a low-mass companion to the lens (planetary model) and the other originated by a faint companion to the source (binary-source model). It is found that the anomaly is better explained by the planetary model than the binary-source model. We identify two solutions rooted in the inner--outer degeneracy, for both of which the estimated planet-to-host mass ratio, $q\sim 8\times 10^{-5}$, is very small. With the constraints provided by the microlens parallax and the lower limit on the Einstein radius, as well as the blend-flux constraint, we find that the lens is a planetary system, in which a super-Earth planet, with a mass $(4.83\pm 1.44)~M_\oplus$, orbits a low-mass host star, with a mass $(0.18\pm 0.05)~M_\odot$, lying in the Galactic disk at a distance $(2.00\pm 0.42)$~kpc. The planet detection demonstrates the elevated microlensing sensitivity of the current high-cadence lensing surveys to low-mass planets. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: 10 pages, 9 figures

arXiv:2302.07443 [pdf, other]

doi 10.1016/j.physletb.2023.138128

Precise lifetime measurement of $^4_Λ$H hypernucleus using in-flight $^4$He$(K^-, π^0)^4_Λ$H reaction

Authors: T. Akaishi, H. Asano, X. Chen, A. Clozza, C. Curceanu, R. Del Grande, C. Guaraldo, C. Han, T. Hashimoto, M. Iliescu, K. Inoue, S. Ishimoto, K. Itahashi, M. Iwasaki, Y. Ma, M. Miliucci, R. Murayama, H. Noumi, H. Ohnishi, S. Okada, H. Outa, K. Piscicchia, A. Sakaguchi, F. Sakuma, M. Sato , et al. (13 additional authors not shown)

Abstract: We present a new measurement of the $^4_Λ$H hypernuclear lifetime using in-flight $K^-$ + $^4$He $\rightarrow$ $^4_Λ$H + $π^0$ reaction at the J-PARC hadron facility. We demonstrate, for the first time, the effective selection of the hypernuclear bound state using only the $γ$-ray energy decayed from $π^0$. This opens the possibility for a systematic study of isospin partner hypernuclei through co… ▽ More We present a new measurement of the $^4_Λ$H hypernuclear lifetime using in-flight $K^-$ + $^4$He $\rightarrow$ $^4_Λ$H + $π^0$ reaction at the J-PARC hadron facility. We demonstrate, for the first time, the effective selection of the hypernuclear bound state using only the $γ$-ray energy decayed from $π^0$. This opens the possibility for a systematic study of isospin partner hypernuclei through comparison with data from ($K^-$, $π^-$) reaction. As the first application of this method, our result for the $^4_Λ$H lifetime, $τ(^4_Λ\mathrm{H}) = 206 \pm 8 (\mathrm{stat.}) \pm 12 (\mathrm{syst.})\ \mathrm{ps}$, is one of the most precise measurements to date. We are also preparing to measure the lifetime of the hypertriton ($^3_Λ$H) using the same setup in the near future. △ Less

Submitted 27 August, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2301.06779 [pdf]

doi 10.1093/mnras/stad1398

KMT-2022-BLG-0440Lb: A New $q < 10^{-4}$ Microlensing Planet with the Central-Resonant Caustic Degeneracy Broken

Authors: Jiyuan Zhang, Weicheng Zang, Youn Kil Jung, Hongjing Yang, Andrew Gould, Takahiro Sumi, Shude Mao, Subo Dong, Michael D. Albrow, Sun-Ju Chung, Cheongho Han, Kyu-Ha Hwang, Yoon-Hyun Ryu, In-Gu Shin, Yossi Shvartzvald, Jennifer C. Yee, Sang-Mok Cha, Dong-Jin Kim, Hyoun-Woo Kim, Seung-Lee Kim, Chung-Uk Lee, Dong-Joo Lee, Yongseok Lee, Byeong-Gon Park, Richard W. Pogge , et al. (35 additional authors not shown)

Abstract: We present the observations and analysis of a high-magnification microlensing planetary event, KMT-2022-BLG-0440, for which the weak and short-lived planetary signal was covered by both the KMTNet survey and follow-up observations. The binary-lens models with a central caustic provide the best fits, with a planet/host mass ratio, $q = 0.75$--$1.00 \times 10^{-4}$ at $1σ$. The binary-lens models wi… ▽ More We present the observations and analysis of a high-magnification microlensing planetary event, KMT-2022-BLG-0440, for which the weak and short-lived planetary signal was covered by both the KMTNet survey and follow-up observations. The binary-lens models with a central caustic provide the best fits, with a planet/host mass ratio, $q = 0.75$--$1.00 \times 10^{-4}$ at $1σ$. The binary-lens models with a resonant caustic and a brown-dwarf mass ratio are both excluded by $Δχ^2 > 70$. The binary-source model can fit the anomaly well but is rejected by the ``color argument'' on the second source. From Bayesian analyses, it is estimated that the host star is likely a K or M dwarf located in the Galactic disk, the planet probably has a Neptune-mass, and the projected planet-host separation is $1.9^{+0.6}_{-0.7}$ or $4.6^{+1.4}_{-1.7}$ au, subject to the close/wide degeneracy. This is the third $q < 10^{-4}$ planet from a high-magnification planetary signal ($A \gtrsim 65$). Together with another such planet, KMT-2021-BLG-0171Lb, the ongoing follow-up program for the KMTNet high-magnification events has demonstrated its ability in detecting high-magnification planetary signals for $q < 10^{-4}$ planets, which are challenging for the current microlensing surveys. △ Less

Submitted 2 May, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: MNRAS accepted

arXiv:2212.08329 [pdf, other]

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

Authors: Yusuke Yasuda, Tomoki Toda

Abstract: Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffus… ▽ More Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffusion model that predicts the distribution of latent variables in the waveform model from texts, and an alignment model that learns alignments between the text and speech latent sequences. Our method integrates diffusion with VAE by modeling both mean and variance parameters with diffusion, where the target distribution is determined by approximation from VAE. This latent variable conversion framework potentially enables us to flexibly incorporate various latent feature extractors. Our experiments show that our method is robust to linguistic labels with poor orthography and alignment errors. △ Less

Submitted 16 December, 2022; originally announced December 2022.

Comments: Submitted to ICASSP 2023

Showing 1–50 of 141 results for author: Toda, T