Skip to main content

Showing 1–39 of 39 results for author: Yatabe, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.03550  [pdf, ps, other

    cs.SD eess.AS

    Local Equivariance Error-Based Metrics for Evaluating Sampling-Frequency-Independent Property of Neural Network

    Authors: Kanami Imamura, Tomohiko Nakamura, Norihiro Takamune, Kohei Yatabe, Hiroshi Saruwatari

    Abstract: Audio signal processing methods based on deep neural networks (DNNs) are typically trained only at a single sampling frequency (SF) and therefore require signal resampling to handle untrained SFs. However, recent studies have shown that signal resampling can degrade performance with untrained SFs. This problem has been overlooked because most studies evaluate only the performance at trained SFs. I… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 5 pages, 4 figures, accepted for European Signal Processing Conference 2025 (EUSIPCO 2025)

  2. arXiv:2409.20516  [pdf, other

    eess.AS cs.SD eess.SP

    Proposal of protocols for speech materials acquisition and presentation assisted by tools based on structured test signals

    Authors: Hideki Kawahara, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Kohei Yatabe

    Abstract: We propose protocols for acquiring speech materials, making them reusable for future investigations, and presenting them for subjective experiments. We also provide means to evaluate existing speech materials' compatibility with target applications. We built these protocols and tools based on structured test signals and analysis methods, including a new family of the Time-Stretched Pulse (TSP). Ov… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: 6 pages 6 figures, accepted ORIENTAL COCOSDA 2024

    MSC Class: 68-04 ACM Class: J.2

  3. arXiv:2409.09294  [pdf, other

    cs.SD eess.AS

    Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation

    Authors: Kazuki Matsumoto, Kohei Yatabe

    Abstract: Solving the permutation problem is essential for determined blind source separation (BSS). Existing methods, such as independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA), tackle the permutation problem by modeling the co-occurrence of the frequency components of source signals. One of the remaining challenges in these methods is the block permutation problem, which ma… ▽ More

    Submitted 14 March, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Submitted to Acoustical Science and Technology

  4. arXiv:2309.12581  [pdf, other

    eess.AS cs.LG cs.SD

    Sampling-Frequency-Independent Universal Sound Separation

    Authors: Tomohiko Nakamura, Kohei Yatabe

    Abstract: This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: univers… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP2024

  5. arXiv:2309.02767  [pdf, ps, other

    cs.SD eess.AS

    Simultaneous Measurement of Multiple Acoustic Attributes Using Structured Periodic Test Signals Including Music and Other Sound Materials

    Authors: Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Tatsuya Kitamura

    Abstract: We introduce a general framework for measuring acoustic properties such as liner time-invariant (LTI) response, signal-dependent time-invariant (SDTI) component, and random and time-varying (RTV) component simultaneously using structured periodic test signals. The framework also enables music pieces and other sound materials as test signals by "safeguarding" them by adding slight deterministic "no… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

    Comments: 8 pages, 17 figures, accepted for APSIPA ASC 2023

    MSC Class: 68-04 ACM Class: J.2

  6. arXiv:2308.01665  [pdf, other

    eess.SP cs.SD eess.AS

    Versatile Time-Frequency Representations Realized by Convex Penalty on Magnitude Spectrogram

    Authors: Keidai Arai, Koki Yamada, Kohei Yatabe

    Abstract: Sparse time-frequency (T-F) representations have been an important research topic for more than several decades. Among them, optimization-based methods (in particular, extensions of basis pursuit) allow us to design the representations through objective functions. Since acoustic signal processing utilizes models of spectrogram, the flexibility of optimization-based T-F representations is helpful f… ▽ More

    Submitted 3 August, 2023; originally announced August 2023.

    Comments: 5 pages, 3 figures

  7. Algorithms of Sampling-Frequency-Independent Layers for Non-integer Strides

    Authors: Kanami Imamura, Tomohiko Nakamura, Norihiro Takamune, Kohei Yatabe, Hiroshi Saruwatari

    Abstract: In this paper, we propose algorithms for handling non-integer strides in sampling-frequency-independent (SFI) convolutional and transposed convolutional layers. The SFI layers have been developed for handling various sampling frequencies (SFs) by a single neural network. They are replaceable with their non-SFI counterparts and can be introduced into various network architectures. However, they cou… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

    Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2023 (EUSIPCO 2023)

    Journal ref: European Signal Processing Conference, Sep. 2023, pp. 326--330

  8. arXiv:2305.18802  [pdf, other

    eess.AS cs.SD

    LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna

    Abstract: This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  9. arXiv:2303.01664  [pdf, other

    cs.SD cs.LG eess.AS

    Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani

    Abstract: Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation,… ▽ More

    Submitted 14 August, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to WASPAA 2023

  10. arXiv:2211.08246  [pdf, other

    cs.SD eess.AS eess.SP

    Online Phase Reconstruction via DNN-based Phase Differences Estimation

    Authors: Yoshiki Masuyama, Kohei Yatabe, Kento Nagatomo, Yasuhiro Oikawa

    Abstract: This paper presents a two-stage online phase reconstruction framework using causal deep neural networks (DNNs). Phase reconstruction is a task of recovering phase of the short-time Fourier transform (STFT) coefficients only from the corresponding magnitude. However, phase is sensitive to waveform shifts and not easy to estimate from the magnitude even with a DNN. To overcome this problem, we propo… ▽ More

    Submitted 12 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  11. arXiv:2210.01029  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

    Authors: Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani

    Abstract: Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like it… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  12. arXiv:2204.00911  [pdf, ps, other

    cs.SD eess.AS

    Measuring pitch extractors' response to frequency-modulated multi-component signals

    Authors: Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise

    Abstract: This article focuses on the research tool for investigating the fundamental frequencies of voiced sounds. We introduce an objective and informative measurement method of pitch extractors' response to frequency-modulated tones. The method uses a new test signal for acoustic system analysis. The test signal enables simultaneous measurement of the extractors' responses. They are the modulation freque… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: 11 pages, 9 figures, The following article has been submitted to/accepted by The Acoustical Society of America. After it is published, it will be found at http://asa.scitation.org/journal/jas

    MSC Class: 94A12; 93C80; 42-08

  13. arXiv:2204.00902  [pdf, ps, other

    cs.SD eess.AS eess.SP

    An objective test tool for pitch extractors' response attributes

    Authors: Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise

    Abstract: We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. It enables us to evaluate different pitch extractors with unified criteria. The method uses extended time-stretched pulses combined by binary orthogonal sequences. It provides simultaneous measurement results consisting of the linear and the non-linear time-invariant responses and random and… ▽ More

    Submitted 24 June, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

    Comments: 5 pages, 9 figures, submitted to Interspeech2022. arXiv admin note: text overlap with arXiv:2111.03629

    MSC Class: 94A12; 93C80; 42-08

  14. arXiv:2203.16749  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

    Authors: Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

    Abstract: Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality es… ▽ More

    Submitted 4 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  15. arXiv:2202.08458  [pdf, other

    eess.AS cs.SD

    Wearable SELD dataset: Dataset for sound event localization and detection using wearable devices around head

    Authors: Kento Nagatomo, Masahiro Yasuda, Kohei Yatabe, Shoichiro Saito, Yasuhiro Oikawa

    Abstract: Sound event localization and detection (SELD) is a combined task of identifying the sound event and its direction. Deep neural networks (DNNs) are utilized to associate them with the sound signals observed by a microphone array. Although ambisonic microphones are popular in the literature of SELD, they might limit the range of applications due to their predetermined geometry. Some applications (in… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 5 pages, 6 figures, accepted to IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022

  16. arXiv:2202.08028  [pdf, other

    eess.AS cs.SD eess.SP

    APPLADE: Adjustable Plug-and-play Audio Declipper Combining DNN with Sparse Optimization

    Authors: Tomoro Tanaka, Kohei Yatabe, Masahiro Yasuda, Yasuhiro Oikawa

    Abstract: In this paper, we propose an audio declipping method that takes advantages of both sparse optimization and deep learning. Since sparsity-based audio declipping methods have been developed upon constrained optimization, they are adjustable and well-studied in theory. However, they always uniformly promote sparsity and ignore the individual properties of a signal. Deep neural network (DNN)-based met… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: 5 pages, 7 figures, accepted to IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022

  17. arXiv:2112.11373  [pdf, ps, other

    cs.SD eess.AS

    Safeguarding test signals for acoustic measurement using arbitrary sounds

    Authors: Hideki Kawahara, Kohei Yatabe

    Abstract: We propose a simple method to measure acoustic responses using any sounds by converting them suitable for measurement. This method enables us to use music pieces for measuring acoustic conditions. It is advantageous to measure such conditions without annoying test sounds to listeners. In addition, applying the underlying idea of simultaneous measurement of multiple paths provides practically valua… ▽ More

    Submitted 21 December, 2021; originally announced December 2021.

    Comments: 4 pages, 10 figures, submitted to Acoustical Science and Technology

    MSC Class: 42-04; 42-08; 68-04

  18. arXiv:2111.03629   

    cs.SD cs.HC eess.AS eess.SP

    Objective measurement of pitch extractors' responses to frequency modulated sounds and two reference pitch extraction methods for analyzing voice pitch responses to auditory stimulation

    Authors: Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise

    Abstract: We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. The method simultaneously measures the linear and the non-linear time-invariant responses and random and time-varying responses. It uses extended time-stretched pulses combined by binary orthogonal sequences. Our recent finding of involuntary voice pitch response to auditory stimulation while… ▽ More

    Submitted 27 June, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

    Comments: ICASSP2022 rejected this. The substantially revised version was submitted to Interspeech2022 and accepted. It is arXiv:2204.00911

    MSC Class: 94A12; 93C80; 42-08

  19. arXiv:2111.01593  [pdf, other

    eess.SP eess.AS math.NA math.OC

    Design of Tight Minimum-Sidelobe Windows by Riemannian Newton's Method

    Authors: Daichi Kitahara, Kohei Yatabe

    Abstract: The short-time Fourier transform (STFT), or the discrete Gabor transform (DGT), has been extensively used in signal analysis and processing. Their properties are characterized by a window function. For signal processing, designing a special window called tight window is important because it is known to make DGT-domain processing robust to error. In this paper, we propose a method of designing tigh… ▽ More

    Submitted 5 December, 2021; v1 submitted 2 November, 2021; originally announced November 2021.

  20. arXiv:2105.04079  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method

    Authors: Koichi Saito, Tomohiko Nakamura, Kohei Yatabe, Yuma Koizumi, Hiroshi Saruwatari

    Abstract: Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampli… ▽ More

    Submitted 9 May, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2021 (EUSIPCO 2021)

  21. arXiv:2105.03345  [pdf, other

    eess.SP

    Sparse time-frequency representation via atomic norm minimization

    Authors: Tsubasa Kusano, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Nonstationary signals are commonly analyzed and processed in the time-frequency (T-F) domain that is obtained by the discrete Gabor transform (DGT). The T-F representation obtained by DGT is spread due to windowing, which may degrade the performance of T-F domain analysis and processing. To obtain a well-localized T-F representation, sparsity-aware methods using $\ell_1$-norm have been studied. Ho… ▽ More

    Submitted 7 May, 2021; originally announced May 2021.

    Comments: Accepted to ICASSP 2021. There was a mistake in the algorithm and it has been corrected

  22. arXiv:2104.01444  [pdf, ps, other

    cs.SD eess.AS eess.SP

    Mixture of orthogonal sequences made from extended time-stretched pulses enables measurement of involuntary voice fundamental frequency response to pitch perturbation

    Authors: Hideki Kawahara, Toshie Matsui, Kohei Yatabe, Ken-Ichi Sakakibara, Minoru Tsuzaki, Masanori Morise, Toshio Irino

    Abstract: Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary control of voice pitch. This involuntary response… ▽ More

    Submitted 3 April, 2021; originally announced April 2021.

    Comments: 5 pages, 9 figures, submitted to Interspeech2021

    MSC Class: 92C55

  23. arXiv:2101.08625  [pdf, other

    eess.AS

    Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

    Authors: Takuya Fujimura, Yuma Koizumi, Kohei Yatabe, Ryoichi Miyazaki

    Abstract: Deep neural network (DNN)-based speech enhancement ordinarily requires clean speech signals as the training target. However, collecting clean signals is very costly because they must be recorded in a studio. This requirement currently restricts the amount of training data for speech enhancement to less than 1/1000 of that of speech recognition which does not need clean signals. Increasing the amou… ▽ More

    Submitted 10 May, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

  24. arXiv:2010.13185  [pdf, ps, other

    cs.SD eess.AS

    Cascaded all-pass filters with randomized center frequencies and phase polarity for acoustic and speech measurement and data augmentation

    Authors: Hideki Kawahara, Kohei Yatabe

    Abstract: We introduce a new member of TSP (Time Stretched Pulse) for acoustic and speech measurement infrastructure, based on a simple all-pass filter and systematic randomization. This new infrastructure fundamentally upgrades our previous measurement procedure, which enables simultaneous measurement of multiple attributes, including non-linear ones without requiring extra filtering nor post-processing. O… ▽ More

    Submitted 12 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: 5 pages, 5 figures, Accepted ICASSP2021(Review comment by all reviewers: Very original)

    MSC Class: 68U06(Primary); 68T06; 68W06(Secondary)

  25. arXiv:2007.13976  [pdf, other

    cs.SD cs.CV eess.AS

    Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

    Authors: Yoshiki Masuyama, Yoshiaki Bando, Kohei Yatabe, Yoko Sasaki, Masaki Onishi, Yasuhiro Oikawa

    Abstract: Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-superv… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: Accepted for publication in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  26. Consistent Independent Low-Rank Matrix Analysis for Determined Blind Source Separation

    Authors: Daichi Kitamura, Kohei Yatabe

    Abstract: Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such a highly developed sour… ▽ More

    Submitted 1 November, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: Submitted to EURASIP J. Adv. Signal. Process. Accepted on Oct. 30, 2020

  27. arXiv:2006.13590  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Gamma Boltzmann Machine for Simultaneously Modeling Linear- and Log-amplitude Spectra

    Authors: Toru Nakashika, Kohei Yatabe

    Abstract: In audio applications, one of the most important representations of audio signals is the amplitude spectrogram. It is utilized in many machine-learning-based information processing methods including the ones using the restricted Boltzmann machines (RBM). However, the ordinary Gaussian-Bernoulli RBM (the most popular RBM among its variations) cannot directly handle amplitude spectra because the Gau… ▽ More

    Submitted 25 June, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

    Comments: Submitted to APSIPA2020

  28. arXiv:2005.09873  [pdf, other

    eess.AS cs.SD eess.SP

    Consistent ICA: Determined BSS meets spectrogram consistency

    Authors: Kohei Yatabe

    Abstract: Multichannel audio blind source separation (BSS) in the determined situation (the number of microphones is equal to that of the sources), or determined BSS, is performed by multichannel linear filtering in the time-frequency domain to handle the convolutive mixing process. Ordinarily, the filter treats each frequency independently, which causes the well-known permutation problem, i.e., the problem… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

  29. arXiv:2004.14091  [pdf, other

    eess.AS cs.SD eess.SP

    Determined BSS based on time-frequency masking and its application to harmonic vector analysis

    Authors: Kohei Yatabe, Daichi Kitamura

    Abstract: This paper proposes harmonic vector analysis (HVA) based on a general algorithmic framework of audio blind source separation (BSS) that is also presented in this paper. BSS for a convolutive audio mixture is usually performed by multichannel linear filtering when the numbers of microphones and sources are equal (determined situation). This paper addresses such determined BSS based on batch process… ▽ More

    Submitted 14 April, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

  30. arXiv:2002.05879  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Stable Training of DNN for Speech Enhancement based on Perceptually-Motivated Black-Box Cost Function

    Authors: Masaki Kawanaka, Yuma Koizumi, Ryoichi Miyazaki, Kohei Yatabe

    Abstract: Improving subjective sound quality of enhanced signals is one of the most important missions in speech enhancement. For evaluating the subjective quality, several methods related to perceptually-motivated objective sound quality assessment (OSQA) have been proposed such as PESQ (perceptual evaluation of speech quality). However, direct use of such measures for training deep neural network (DNN) is… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: accepted to the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  31. arXiv:2002.05873  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

    Authors: Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, Daiki Takeuchi

    Abstract: This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and s… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: 5 pages, to appear in IEEE ICASSP 2020

  32. arXiv:2002.05843  [pdf, other

    eess.AS cs.SD

    Real-time speech enhancement using equilibriated RNN

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term mem… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

  33. arXiv:2002.05832  [pdf, other

    eess.AS cs.SD

    Phase reconstruction based on recurrent phase unwrapping with deep neural networks

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy tas… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  34. arXiv:1911.10764  [pdf, other

    eess.AS cs.SD

    Invertible DNN-based nonlinear time-frequency transform for speech enhancement

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods ha… ▽ More

    Submitted 13 February, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

  35. arXiv:1903.08876  [pdf, other

    eess.AS cs.SD

    Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such… ▽ More

    Submitted 21 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P8.8, Session: Spatial Audio, Audio Enhancement and Bandwidth Extension)

  36. arXiv:1903.05603  [pdf, other

    eess.AS cs.SD eess.SP

    Low-rankness of Complex-valued Spectrogram and Its Application to Phase-aware Audio Processing

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-va… ▽ More

    Submitted 13 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P13.9, Session: Acoustic Scene Classification and Music Signal Analysis)

  37. arXiv:1903.05600  [pdf, other

    eess.AS cs.SD eess.SP

    Phase-aware Harmonic/Percussive Source Separation via Convex Optimization

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Decomposition of an audio mixture into harmonic and percussive components, namely harmonic/percussive source separation (HPSS), is a useful pre-processing tool for many audio applications. Popular approaches to HPSS exploit the distinctive source-specific structures of power spectrograms. However, such approaches consider only power spectrograms, and the phase remains intact for resynthesizing the… ▽ More

    Submitted 13 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P16.5, Session: Music Signal Analysis, Feedback and Echo Cancellation and Equalization)

  38. arXiv:1903.03971  [pdf, other

    cs.SD cs.LG eess.AS

    Deep Griffin-Lim Iteration

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of… ▽ More

    Submitted 10 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1, Session: Source Separation and Speech Enhancement I)

  39. arXiv:1811.08783  [pdf, other

    eess.SP cs.SD eess.AS

    Designing nearly tight window for improving time-frequency masking

    Authors: Tsubasa Kusano, Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Many audio signal processing methods are formulated in the time-frequency (T-F) domain which is obtained by the short-time Fourier transform (STFT). The properties of the STFT are fully characterized by window function, number of frequency channels, and time-shift. Thus, designing a better window is important for improving the performance of the processing especially when a less redundant T-F repr… ▽ More

    Submitted 4 February, 2019; v1 submitted 17 November, 2018; originally announced November 2018.