Skip to main content

Showing 1–38 of 38 results for author: Koizumi, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.05077  [pdf, other

    cs.SD eess.AS

    ReverbMiipher: Generative Speech Restoration meets Reverberation Characteristics Controllability

    Authors: Wataru Nakata, Yuma Koizumi, Shigeki Karita, Robin Scheibler, Haruko Ishikawa, Adriana Guevara-Rukoz, Heiga Zen, Michiel Bacchiani

    Abstract: Reverberation encodes spatial information regarding the acoustic source environment, yet traditional Speech Restoration (SR) usually completely removes reverberation. We propose ReverbMiipher, an SR model extending parametric resynthesis framework, designed to denoise speech while preserving and enabling control over reverberation. ReverbMiipher incorporates a dedicated ReverbEncoder to extract a… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 5 pages, 5 figures

  2. arXiv:2505.04457  [pdf, other

    cs.SD cs.CL eess.AS

    Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

    Authors: Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel Bacchiani

    Abstract: Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID… ▽ More

    Submitted 8 May, 2025; v1 submitted 7 May, 2025; originally announced May 2025.

  3. arXiv:2408.06227  [pdf

    cs.CL cs.AI cs.SD eess.AS

    FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks

    Authors: Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani

    Abstract: This paper introduces FLEURS-R, a speech restoration applied version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R maintains an N-way parallel speech corpus in 102 languages as FLEURS, with improved audio quality and fidelity by applying the speech restoration model Miipher. The aim of FLEURS-R is to advance speech technology in more languages… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Journal ref: INTERSPEECH 2024

  4. arXiv:2305.18802  [pdf, other

    eess.AS cs.SD

    LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna

    Abstract: This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  5. arXiv:2305.07828  [pdf, other

    cs.SD cs.LG eess.AS

    Description and Discussion on DCASE 2023 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

    Authors: Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Yohei Kawaguchi

    Abstract: We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 2: ``First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring''. The main goal is to enable rapid deployment of ASD systems for new kinds of machines without the need for hyperparameter tuning. In the past ASD tasks, developed methods tuned h… ▽ More

    Submitted 2 November, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: anomaly detection, acoustic condition monitoring, domain shift, first-shot problem, DCASE Challenge, Accepted in DCASE2023 Workshop

  6. arXiv:2303.01664  [pdf, other

    cs.SD cs.LG eess.AS

    Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani

    Abstract: Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation,… ▽ More

    Submitted 14 August, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to WASPAA 2023

  7. arXiv:2210.01029  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

    Authors: Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani

    Abstract: Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like it… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  8. arXiv:2206.05876  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Description and Discussion on DCASE 2022 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques

    Authors: Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Masaaki Yamamoto, Yohei Kawaguchi

    Abstract: We present the task description and discussion on the results of the DCASE 2022 Challenge Task 2: ``Unsupervised anomalous sound detection (ASD) for machine condition monitoring applying domain generalization techniques''. Domain shifts are a critical problem for the application of ASD systems. Because domain shifts can change the acoustic characteristics of data, a model trained in a source domai… ▽ More

    Submitted 21 November, 2022; v1 submitted 12 June, 2022; originally announced June 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.04492

  9. arXiv:2204.12092  [pdf, other

    eess.AS cs.SD

    Mask scalar prediction for improving robust automatic speech recognition

    Authors: Arun Narayanan, James Walker, Sankaran Panchapagesan, Nathan Howard, Yuma Koizumi

    Abstract: Using neural network based acoustic frontends for improving robustness of streaming automatic speech recognition (ASR) systems is challenging because of the causality constraints and the resulting distortion that the frontend processing introduces in speech. Time-frequency masking based approaches have been shown to work well, but they need additional hyper-parameters to scale the mask to limit sp… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  10. arXiv:2203.16749  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

    Authors: Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

    Abstract: Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality es… ▽ More

    Submitted 4 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  11. arXiv:2111.00764  [pdf, other

    eess.AS cs.SD

    SNRi Target Training for Joint Speech Enhancement and Recognition

    Authors: Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

    Abstract: Speech enhancement (SE) is used as a frontend in speech applications including automatic speech recognition (ASR) and telecommunication. A difficulty in using the SE frontend is that the appropriate noise reduction level differs depending on applications and/or noise characteristics. In this study, we propose "signal-to-noise ratio improvement (SNRi) target training"; the SE frontend is trained to… ▽ More

    Submitted 28 March, 2022; v1 submitted 1 November, 2021; originally announced November 2021.

    Comments: Submitted to Interspeech 2022 (v1 has been rejected from ICASSP 2022)

  12. arXiv:2106.15813  [pdf, other

    eess.AS cs.SD

    DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

    Authors: Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani

    Abstract: Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequ… ▽ More

    Submitted 5 August, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: 5 pages, 2 figure. accepted for WASPAA 2021

  13. arXiv:2106.04492  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions

    Authors: Yohei Kawaguchi, Keisuke Imoto, Yuma Koizumi, Noboru Harada, Daisuke Niizumi, Kota Dohi, Ryo Tanabe, Harsh Purohit, Takashi Endo

    Abstract: We present the task description and discussion on the results of the DCASE 2021 Challenge Task 2. In 2020, we organized an unsupervised anomalous sound detection (ASD) task, identifying whether a given sound was normal or anomalous without anomalous training data. In 2021, we organized an advanced unsupervised ASD task under domain-shift conditions, which focuses on the inevitable problem of the p… ▽ More

    Submitted 27 September, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted to DCASE 2021 Workshop

  14. arXiv:2105.04079  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method

    Authors: Koichi Saito, Tomohiko Nakamura, Kohei Yatabe, Yuma Koizumi, Hiroshi Saruwatari

    Abstract: Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampli… ▽ More

    Submitted 9 May, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2021 (EUSIPCO 2021)

  15. arXiv:2101.08625  [pdf, other

    eess.AS

    Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

    Authors: Takuya Fujimura, Yuma Koizumi, Kohei Yatabe, Ryoichi Miyazaki

    Abstract: Deep neural network (DNN)-based speech enhancement ordinarily requires clean speech signals as the training target. However, collecting clean signals is very costly because they must be recorded in a studio. This requirement currently restricts the amount of training data for speech enhancement to less than 1/1000 of that of speech recognition which does not need clean signals. Increasing the amou… ▽ More

    Submitted 10 May, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

  16. arXiv:2012.07331  [pdf, other

    eess.AS cs.CL cs.SD

    Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

    Authors: Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda

    Abstract: The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted in… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Submitted to ICASSP 2021

  17. arXiv:2009.11436  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

    Authors: Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not ye… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

    Comments: Accepted to DCASE2020 Workshop

  18. arXiv:2007.00225  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

    Authors: Yuma Koizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub i… ▽ More

    Submitted 1 July, 2020; originally announced July 2020.

    Comments: Technical Report of DCASE2020 Challenge Task 6

  19. arXiv:2007.00222  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    A Transformer-based Audio Captioning Model with Keyword Estimation

    Authors: Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito

    Abstract: One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation calle… ▽ More

    Submitted 8 August, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: Accepted to Interspeech 2020

  20. arXiv:2006.05822  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring

    Authors: Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, Toshiki Nakamura, Yuki Nikaido, Ryo Tanabe, Harsh Purohit, Kaori Suefusa, Takashi Endo, Masahiro Yasuda, Noboru Harada

    Abstract: In this paper, we present the task description and discuss the results of the DCASE 2020 Challenge Task 2: Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring. The goal of anomalous sound detection (ASD) is to identify whether the sound emitted from a target machine is normal or anomalous. The main challenge of this task is to detect unknown anomalous sounds under the condi… ▽ More

    Submitted 8 August, 2020; v1 submitted 10 June, 2020; originally announced June 2020.

    Comments: Submitted to DCASE2020 Workshop

  21. arXiv:2006.05712  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Listen to What You Want: Neural Network-based Universal Sound Selector

    Authors: Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki

    Abstract: Being able to control the acoustic events (AEs) to which we want to listen would allow the development of more controllable hearable devices. This paper addresses the AE sound selection (or removal) problems, that we define as the extraction (or suppression) of all the sounds that belong to one or multiple desired AE classes. Although this problem could be addressed with a combination of source se… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

    Comments: 5 pages, 2 figures, submitted to INTERSPEECH 2020

  22. arXiv:2002.05994  [pdf, ps, other

    eess.AS cs.SD

    Sound Event Localization based on Sound Intensity Vector Refined By DNN-Based Denoising and Source Separation

    Authors: Masahiro Yasuda, Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, Keisuke Imoto

    Abstract: We propose a direction-of-arrival (DOA) estimation method for Sound Event Localization and Detection (SELD). Direct estimation of DOA using a deep neural network (DNN), i.e. completely-datadriven approach, achieves high accuracy. However, there is a gap in the accuracy between DOA estimation for single and overlapping sources because they cannot incorporate physical knowledge. Meanwhile, although… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: 5 pages, 3 figures, to appear in IEEE ICASSP 2020

  23. arXiv:2002.05879  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Stable Training of DNN for Speech Enhancement based on Perceptually-Motivated Black-Box Cost Function

    Authors: Masaki Kawanaka, Yuma Koizumi, Ryoichi Miyazaki, Kohei Yatabe

    Abstract: Improving subjective sound quality of enhanced signals is one of the most important missions in speech enhancement. For evaluating the subjective quality, several methods related to perceptually-motivated objective sound quality assessment (OSQA) have been proposed such as PESQ (perceptual evaluation of speech quality). However, direct use of such measures for training deep neural network (DNN) is… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: accepted to the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  24. arXiv:2002.05873  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

    Authors: Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, Daiki Takeuchi

    Abstract: This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and s… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: 5 pages, to appear in IEEE ICASSP 2020

  25. arXiv:2002.05848  [pdf, ps, other

    cs.SD eess.AS

    Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

    Authors: Keisuke Imoto, Noriyuki Tonami, Yuma Koizumi, Masahiro Yasuda, Ryosuke Yamanishi, Yoichi Yamashita

    Abstract: Sound event detection (SED) and acoustic scene classification (ASC) are major tasks in environmental sound analysis. Considering that sound events and scenes are closely related to each other, some works have addressed joint analyses of sound events and acoustic scenes based on multitask learning (MTL), in which the knowledge of sound events and scenes can help in estimating them mutually. The con… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020

  26. arXiv:2002.05843  [pdf, other

    eess.AS cs.SD

    Real-time speech enhancement using equilibriated RNN

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term mem… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

  27. arXiv:2002.05832  [pdf, other

    eess.AS cs.SD

    Phase reconstruction based on recurrent phase unwrapping with deep neural networks

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy tas… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  28. arXiv:1911.10764  [pdf, other

    eess.AS cs.SD

    Invertible DNN-based nonlinear time-frequency transform for speech enhancement

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods ha… ▽ More

    Submitted 13 February, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

  29. arXiv:1910.04415  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    DOA Estimation by DNN-based Denoising and Dereverberation from Sound Intensity Vector

    Authors: Masahiro Yasuda, Yuma Koizumi, Luca Mazzon, Shoichiro Saito, Hisashi Uematsu

    Abstract: We propose a direction of arrival (DOA) estimation method that combines sound-intensity vector (IV)-based DOA estimation and DNN-based denoising and dereverberation. Since the accuracy of IV-based DOA estimation degrades due to environmental noise and reverberation, two DNNs are used to remove such effects from the observed IVs. DOA is then estimated from the refined IVs based on the physics of wa… ▽ More

    Submitted 10 October, 2019; originally announced October 2019.

    Comments: 4 pages

  30. arXiv:1910.04388  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation

    Authors: Luca Mazzon, Yuma Koizumi, Masahiro Yasuda, Noboru Harada

    Abstract: In this paper, we propose a novel data augmentation method for training neural networks for Direction of Arrival (DOA) estimation. This method focuses on expanding the representation of the DOA subspace of a dataset. Given some input data, it applies a transformation to it in order to change its DOA information and simulate new potentially unseen one. Such transformation, in general, is a combinat… ▽ More

    Submitted 10 October, 2019; originally announced October 2019.

    Comments: 5 pages, to appear in DCASE 2019

  31. arXiv:1908.03299  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection

    Authors: Yuma Koizumi, Shoichiro Saito, Hisashi Uematsu, Noboru Harada, Keisuke Imoto

    Abstract: This paper introduces a new dataset called "ToyADMOS" designed for anomaly detection in machine operating sounds (ADMOS). To the best our knowledge, no large-scale datasets are available for ADMOS, although large-scale datasets have contributed to recent advancements in acoustic signal processing. This is because anomalous sound data are difficult to collect. To build a large-scale dataset for ADM… ▽ More

    Submitted 8 August, 2019; originally announced August 2019.

    Comments: 5 pages, to appear in IEEE WASPAA 2019

  32. arXiv:1907.08338  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds

    Authors: Yuma Koizumi, Shoichiro Saito, Masataka Yamaguchi, Shin Murata, Noboru Harada

    Abstract: Use of an autoencoder (AE) as a normal model is a state-of-the-art technique for unsupervised-anomaly detection in sounds (ADS). The AE is trained to minimize the sample mean of the anomaly score of normal sounds in a mini-batch. One problem with this approach is that the anomaly score of rare-normal sounds becomes higher than that of frequent-normal sounds, because the sample mean is strongly aff… ▽ More

    Submitted 18 July, 2019; originally announced July 2019.

    Comments: 5 pages, to appear in IEEE WASPAA 2019

  33. arXiv:1903.08876  [pdf, other

    eess.AS cs.SD

    Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such… ▽ More

    Submitted 21 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P8.8, Session: Spatial Audio, Audio Enhancement and Bandwidth Extension)

  34. arXiv:1903.03971  [pdf, other

    cs.SD cs.LG eess.AS

    Deep Griffin-Lim Iteration

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of… ▽ More

    Submitted 10 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1, Session: Source Separation and Speech Enhancement I)

  35. arXiv:1812.05796  [pdf, other

    stat.ML cs.LG cs.SD eess.AS

    AdaFlow: Domain-Adaptive Density Estimator with Application to Anomaly Detection and Unpaired Cross-Domain Translation

    Authors: Masataka Yamaguchi, Yuma Koizumi, Noboru Harada

    Abstract: We tackle unsupervised anomaly detection (UAD), a problem of detecting data that significantly differ from normal data. UAD is typically solved by using density estimation. Recently, deep neural network (DNN)-based density estimators, such as Normalizing Flows, have been attracting attention. However, one of their drawbacks is the difficulty in adapting them to the change in the normal data's dist… ▽ More

    Submitted 13 March, 2019; v1 submitted 14 December, 2018; originally announced December 2018.

    Comments: Accepted to ICASSP2019

  36. arXiv:1811.02438  [pdf, other

    eess.AS cs.LG cs.SD eess.SP stat.ML

    Trainable Adaptive Window Switching for Speech Enhancement

    Authors: Yuma Koizumi, Noboru Harada, Yoichi Haneda

    Abstract: This study proposes a trainable adaptive window switching (AWS) method and apply it to a deep-neural-network (DNN) for speech enhancement in the modified discrete cosine transform domain. Time-frequency (T-F) mask processing in the short-time Fourier transform (STFT)-domain is a typical speech enhancement method. To recover the target signal precisely, DNN-based short-time frequency transforms hav… ▽ More

    Submitted 19 February, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019)

  37. arXiv:1810.09137  [pdf, other

    stat.ML cs.LG cs.SD eess.AS

    DNN-based Source Enhancement to Increase Objective Sound Quality Assessment Score

    Authors: Yuma Koizumi, Kenta Niwa, Yusuke Hioka, Kazunori Kobayashi, Yoichi Haneda

    Abstract: We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squa… ▽ More

    Submitted 22 October, 2018; originally announced October 2018.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.26, Issue.10, 2018

  38. arXiv:1810.09133  [pdf, other

    stat.ML cs.LG cs.SD eess.AS

    Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma

    Authors: Yuma Koizumi, Shoichiro Saito, Hisashi Uematsum Yuta Kawachi, Noboru Harada

    Abstract: This paper proposes a novel optimization principle and its implementation for unsupervised anomaly detection in sound (ADS) using an autoencoder (AE). The goal of unsupervised-ADS is to detect unknown anomalous sound without training data of anomalous sound. Use of an AE as a normal model is a state-of-the-art technique for unsupervised-ADS. To decrease the false positive rate (FPR), the AE is tra… ▽ More

    Submitted 22 October, 2018; originally announced October 2018.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018