-
Local Equivariance Error-Based Metrics for Evaluating Sampling-Frequency-Independent Property of Neural Network
Authors:
Kanami Imamura,
Tomohiko Nakamura,
Norihiro Takamune,
Kohei Yatabe,
Hiroshi Saruwatari
Abstract:
Audio signal processing methods based on deep neural networks (DNNs) are typically trained only at a single sampling frequency (SF) and therefore require signal resampling to handle untrained SFs. However, recent studies have shown that signal resampling can degrade performance with untrained SFs. This problem has been overlooked because most studies evaluate only the performance at trained SFs. I…
▽ More
Audio signal processing methods based on deep neural networks (DNNs) are typically trained only at a single sampling frequency (SF) and therefore require signal resampling to handle untrained SFs. However, recent studies have shown that signal resampling can degrade performance with untrained SFs. This problem has been overlooked because most studies evaluate only the performance at trained SFs. In this paper, to assess the robustness of DNNs to SF changes, which we refer to as the SF-independent (SFI) property, we propose three metrics to quantify the SFI property on the basis of local equivariance error (LEE). LEE measures the robustness of DNNs to input transformations. By using signal resampling as input transformation, we extend LEE to measure the robustness of audio source separation methods to signal resampling. The proposed metrics are constructed to quantify the SFI property in specific network components responsible for predicting time-frequency masks. Experiments on music source separation demonstrated a strong correlation between the proposed metrics and performance degradation at untrained SFs.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Proposal of protocols for speech materials acquisition and presentation assisted by tools based on structured test signals
Authors:
Hideki Kawahara,
Ken-Ichi Sakakibara,
Mitsunori Mizumachi,
Kohei Yatabe
Abstract:
We propose protocols for acquiring speech materials, making them reusable for future investigations, and presenting them for subjective experiments. We also provide means to evaluate existing speech materials' compatibility with target applications. We built these protocols and tools based on structured test signals and analysis methods, including a new family of the Time-Stretched Pulse (TSP). Ov…
▽ More
We propose protocols for acquiring speech materials, making them reusable for future investigations, and presenting them for subjective experiments. We also provide means to evaluate existing speech materials' compatibility with target applications. We built these protocols and tools based on structured test signals and analysis methods, including a new family of the Time-Stretched Pulse (TSP). Over a billion times more powerful computational (including software development) resources than a half-century ago enabled these protocols and tools to be accessible to under-resourced environments.
△ Less
Submitted 30 September, 2024;
originally announced September 2024.
-
Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation
Authors:
Kazuki Matsumoto,
Kohei Yatabe
Abstract:
Solving the permutation problem is essential for determined blind source separation (BSS). Existing methods, such as independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA), tackle the permutation problem by modeling the co-occurrence of the frequency components of source signals. One of the remaining challenges in these methods is the block permutation problem, which ma…
▽ More
Solving the permutation problem is essential for determined blind source separation (BSS). Existing methods, such as independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA), tackle the permutation problem by modeling the co-occurrence of the frequency components of source signals. One of the remaining challenges in these methods is the block permutation problem, which may cause severe performance degradation. In this paper, we propose a simple and effective technique for solving the block permutation problem. The proposed technique splits the entire frequency bands into several overlapping subbands and sequentially applies BSS methods (e.g., IVA, ILRMA, or any other method) to each subband. Since the splitting reduces the size of the problem, the BSS methods can effectively work in each subband. Then, the permutations among the subbands are aligned by using the separation result in one subband as the initial values for the other subbands. Additionally, we propose SS-IVA and SS-ILRMA by combining subband splitting (SS) with IVA and ILRMA. Experimental results demonstrated that our technique remarkably improves the separation performance without increasing computational cost. In particular, our SS-ILRMA achieved the separation performance comparable to the oracle method (frequency-domain independent component analysis with the ideal permutation solver). Moreover, SS-ILRMA converged faster than conventional IVA and ILRMA.
△ Less
Submitted 14 March, 2025; v1 submitted 14 September, 2024;
originally announced September 2024.
-
Sampling-Frequency-Independent Universal Sound Separation
Authors:
Tomohiko Nakamura,
Kohei Yatabe
Abstract:
This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: univers…
▽ More
This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: universalities with respect to source types and recording conditions. The former property has been studied in the USS literature, which has greatly increased the number of source types that can be handled by a single neural network. However, the latter property (e.g., SF) has received less attention despite its necessity. Since the SF varies widely depending on the downstream tasks, the universal source separator must handle a wide variety of SFs. In this paper, to encompass the two properties, we propose an SF-independent (SFI) extension of a computationally efficient USS network, SuDoRM-RF. The proposed network uses our previously proposed SFI convolutional layers, which can handle various SFs by generating convolutional kernels in accordance with an input SF. Experiments show that signal resampling can degrade the USS performance and the proposed method works more consistently than signal-resampling-based methods for various SFs.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Simultaneous Measurement of Multiple Acoustic Attributes Using Structured Periodic Test Signals Including Music and Other Sound Materials
Authors:
Hideki Kawahara,
Kohei Yatabe,
Ken-Ichi Sakakibara,
Mitsunori Mizumachi,
Tatsuya Kitamura
Abstract:
We introduce a general framework for measuring acoustic properties such as liner time-invariant (LTI) response, signal-dependent time-invariant (SDTI) component, and random and time-varying (RTV) component simultaneously using structured periodic test signals. The framework also enables music pieces and other sound materials as test signals by "safeguarding" them by adding slight deterministic "no…
▽ More
We introduce a general framework for measuring acoustic properties such as liner time-invariant (LTI) response, signal-dependent time-invariant (SDTI) component, and random and time-varying (RTV) component simultaneously using structured periodic test signals. The framework also enables music pieces and other sound materials as test signals by "safeguarding" them by adding slight deterministic "noise." Measurement using swept-sin, MLS (Maxim Length Sequence), and their variants are special cases of the proposed framework. We implemented interactive and real-time measuring tools based on this framework and made them open-source. Furthermore, we applied this framework to assess pitch extractors objectively.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Versatile Time-Frequency Representations Realized by Convex Penalty on Magnitude Spectrogram
Authors:
Keidai Arai,
Koki Yamada,
Kohei Yatabe
Abstract:
Sparse time-frequency (T-F) representations have been an important research topic for more than several decades. Among them, optimization-based methods (in particular, extensions of basis pursuit) allow us to design the representations through objective functions. Since acoustic signal processing utilizes models of spectrogram, the flexibility of optimization-based T-F representations is helpful f…
▽ More
Sparse time-frequency (T-F) representations have been an important research topic for more than several decades. Among them, optimization-based methods (in particular, extensions of basis pursuit) allow us to design the representations through objective functions. Since acoustic signal processing utilizes models of spectrogram, the flexibility of optimization-based T-F representations is helpful for adjusting the representation for each application. However, acoustic applications often require models of \textit{magnitude} of T-F representations obtained by discrete Gabor transform (DGT). Adjusting a T-F representation to such a magnitude model (e.g., smoothness of magnitude of DGT coefficients) results in a non-convex optimization problem that is difficult to solve. In this paper, instead of tackling difficult non-convex problems, we propose a convex optimization-based framework that realizes a T-F representation whose magnitude has characteristics specified by the user. We analyzed the properties of the proposed method and provide numerical examples of sparse T-F representations having, e.g., low-rank or smooth magnitude, which have not been realized before.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
Algorithms of Sampling-Frequency-Independent Layers for Non-integer Strides
Authors:
Kanami Imamura,
Tomohiko Nakamura,
Norihiro Takamune,
Kohei Yatabe,
Hiroshi Saruwatari
Abstract:
In this paper, we propose algorithms for handling non-integer strides in sampling-frequency-independent (SFI) convolutional and transposed convolutional layers. The SFI layers have been developed for handling various sampling frequencies (SFs) by a single neural network. They are replaceable with their non-SFI counterparts and can be introduced into various network architectures. However, they cou…
▽ More
In this paper, we propose algorithms for handling non-integer strides in sampling-frequency-independent (SFI) convolutional and transposed convolutional layers. The SFI layers have been developed for handling various sampling frequencies (SFs) by a single neural network. They are replaceable with their non-SFI counterparts and can be introduced into various network architectures. However, they could not handle some specific configurations when combined with non-SFI layers. For example, an SFI extension of Conv-TasNet, a standard audio source separation model, cannot handle some pairs of trained and target SFs because the strides of the SFI layers become non-integers. This problem cannot be solved by simple rounding or signal resampling, resulting in the significant performance degradation. To overcome this problem, we propose algorithms for handling non-integer strides by using windowed sinc interpolation. The proposed algorithms realize the continuous-time representations of features using the interpolation and enable us to sample instants with the desired stride. Experimental results on music source separation showed that the proposed algorithms outperformed the rounding- and signal-resampling-based methods at SFs lower than the trained SF.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus
Authors:
Yuma Koizumi,
Heiga Zen,
Shigeki Karita,
Yifan Ding,
Kohei Yatabe,
Nobuyuki Morioka,
Michiel Bacchiani,
Yu Zhang,
Wei Han,
Ankur Bapna
Abstract:
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.…
▽ More
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from \url{http://www.openslr.org/141/}.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations
Authors:
Yuma Koizumi,
Heiga Zen,
Shigeki Karita,
Yifan Ding,
Kohei Yatabe,
Nobuyuki Morioka,
Yu Zhang,
Wei Han,
Ankur Bapna,
Michiel Bacchiani
Abstract:
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation,…
▽ More
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/
△ Less
Submitted 14 August, 2023; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Online Phase Reconstruction via DNN-based Phase Differences Estimation
Authors:
Yoshiki Masuyama,
Kohei Yatabe,
Kento Nagatomo,
Yasuhiro Oikawa
Abstract:
This paper presents a two-stage online phase reconstruction framework using causal deep neural networks (DNNs). Phase reconstruction is a task of recovering phase of the short-time Fourier transform (STFT) coefficients only from the corresponding magnitude. However, phase is sensitive to waveform shifts and not easy to estimate from the magnitude even with a DNN. To overcome this problem, we propo…
▽ More
This paper presents a two-stage online phase reconstruction framework using causal deep neural networks (DNNs). Phase reconstruction is a task of recovering phase of the short-time Fourier transform (STFT) coefficients only from the corresponding magnitude. However, phase is sensitive to waveform shifts and not easy to estimate from the magnitude even with a DNN. To overcome this problem, we propose to use DNNs for estimating differences of phase between adjacent time-frequency bins. We show that convolutional neural networks are suitable for phase difference estimation, according to the theoretical relation between partial derivatives of STFT phase and magnitude. The estimated phase differences are used for reconstructing phase by solving a weighted least squares problem in a frame-by-frame manner. In contrast to existing DNN-based phase reconstruction methods, the proposed framework is causal and does not require any iterative procedure. The experiments showed that the proposed method outperforms existing online methods and a DNN-based method for phase reconstruction.
△ Less
Submitted 12 November, 2022;
originally announced November 2022.
-
WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration
Authors:
Yuma Koizumi,
Kohei Yatabe,
Heiga Zen,
Michiel Bacchiani
Abstract:
Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like it…
▽ More
Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at \url{google.github.io/df-conformer/wavefit/}.
△ Less
Submitted 3 October, 2022;
originally announced October 2022.
-
Measuring pitch extractors' response to frequency-modulated multi-component signals
Authors:
Hideki Kawahara,
Kohei Yatabe,
Ken-Ichi Sakakibara,
Tatsuya Kitamura,
Hideki Banno,
Masanori Morise
Abstract:
This article focuses on the research tool for investigating the fundamental frequencies of voiced sounds. We introduce an objective and informative measurement method of pitch extractors' response to frequency-modulated tones. The method uses a new test signal for acoustic system analysis. The test signal enables simultaneous measurement of the extractors' responses. They are the modulation freque…
▽ More
This article focuses on the research tool for investigating the fundamental frequencies of voiced sounds. We introduce an objective and informative measurement method of pitch extractors' response to frequency-modulated tones. The method uses a new test signal for acoustic system analysis. The test signal enables simultaneous measurement of the extractors' responses. They are the modulation frequency response and the total distortion, including intermodulation distortions. We applied this method to various pitch extractors and placed them on several performance maps. We used the proposed method to fine-tune one of the extractors to make it the best fit tool for scientific research of voice fundamental frequencies.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
An objective test tool for pitch extractors' response attributes
Authors:
Hideki Kawahara,
Kohei Yatabe,
Ken-Ichi Sakakibara,
Tatsuya Kitamura,
Hideki Banno,
Masanori Morise
Abstract:
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. It enables us to evaluate different pitch extractors with unified criteria. The method uses extended time-stretched pulses combined by binary orthogonal sequences. It provides simultaneous measurement results consisting of the linear and the non-linear time-invariant responses and random and…
▽ More
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. It enables us to evaluate different pitch extractors with unified criteria. The method uses extended time-stretched pulses combined by binary orthogonal sequences. It provides simultaneous measurement results consisting of the linear and the non-linear time-invariant responses and random and time-varying responses. We tested representative pitch extractors using fundamental frequencies spanning 80~Hz to 800~Hz with 1/48 octave steps and produced more than 2000 modulation frequency response plots. We found that making scientific visualization by animating these plots enables us to understand different pitch extractors' behavior at once. Such efficient and effortless inspection is impossible by inspecting all individual plots. The proposed measurement method with visualization leads to further improvement of the performance of one of the extractors mentioned above. In other words, our procedure turns the specific pitch extractor into the best reliable measuring equipment that is crucial for scientific research. We open-sourced MATLAB codes of the proposed objective measurement method and visualization procedure.
△ Less
Submitted 24 June, 2022; v1 submitted 2 April, 2022;
originally announced April 2022.
-
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Authors:
Yuma Koizumi,
Heiga Zen,
Kohei Yatabe,
Nanxin Chen,
Michiel Bacchiani
Abstract:
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality es…
▽ More
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at wavegrad.github.io/specgrad/.
△ Less
Submitted 4 August, 2022; v1 submitted 30 March, 2022;
originally announced March 2022.
-
Wearable SELD dataset: Dataset for sound event localization and detection using wearable devices around head
Authors:
Kento Nagatomo,
Masahiro Yasuda,
Kohei Yatabe,
Shoichiro Saito,
Yasuhiro Oikawa
Abstract:
Sound event localization and detection (SELD) is a combined task of identifying the sound event and its direction. Deep neural networks (DNNs) are utilized to associate them with the sound signals observed by a microphone array. Although ambisonic microphones are popular in the literature of SELD, they might limit the range of applications due to their predetermined geometry. Some applications (in…
▽ More
Sound event localization and detection (SELD) is a combined task of identifying the sound event and its direction. Deep neural networks (DNNs) are utilized to associate them with the sound signals observed by a microphone array. Although ambisonic microphones are popular in the literature of SELD, they might limit the range of applications due to their predetermined geometry. Some applications (including those for pedestrians that perform SELD while walking) require a wearable microphone array whose geometry can be designed to suit the task. In this paper, for the development of such a wearable SELD, we propose a dataset named Wearable SELD dataset. It consists of data recorded by 24 microphones placed on a head and torso simulators (HATS) with some accessories mimicking wearable devices (glasses, earphones, and headphones). We also provide experimental results of SELD using the proposed dataset and SELDNet to investigate the effect of microphone configuration.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
APPLADE: Adjustable Plug-and-play Audio Declipper Combining DNN with Sparse Optimization
Authors:
Tomoro Tanaka,
Kohei Yatabe,
Masahiro Yasuda,
Yasuhiro Oikawa
Abstract:
In this paper, we propose an audio declipping method that takes advantages of both sparse optimization and deep learning. Since sparsity-based audio declipping methods have been developed upon constrained optimization, they are adjustable and well-studied in theory. However, they always uniformly promote sparsity and ignore the individual properties of a signal. Deep neural network (DNN)-based met…
▽ More
In this paper, we propose an audio declipping method that takes advantages of both sparse optimization and deep learning. Since sparsity-based audio declipping methods have been developed upon constrained optimization, they are adjustable and well-studied in theory. However, they always uniformly promote sparsity and ignore the individual properties of a signal. Deep neural network (DNN)-based methods can learn the properties of target signals and use them for audio declipping. Still, they cannot perform well if the training data have mismatches and/or constraints in the time domain are not imposed. In the proposed method, we use a DNN in an optimization algorithm. It is inspired by an idea called plug-and-play (PnP) and enables us to promote sparsity based on the learned information of data, considering constraints in the time domain. Our experiments confirmed that the proposed method is stable and robust to mismatches between training and test data.
△ Less
Submitted 16 February, 2022;
originally announced February 2022.
-
Safeguarding test signals for acoustic measurement using arbitrary sounds
Authors:
Hideki Kawahara,
Kohei Yatabe
Abstract:
We propose a simple method to measure acoustic responses using any sounds by converting them suitable for measurement. This method enables us to use music pieces for measuring acoustic conditions. It is advantageous to measure such conditions without annoying test sounds to listeners. In addition, applying the underlying idea of simultaneous measurement of multiple paths provides practically valua…
▽ More
We propose a simple method to measure acoustic responses using any sounds by converting them suitable for measurement. This method enables us to use music pieces for measuring acoustic conditions. It is advantageous to measure such conditions without annoying test sounds to listeners. In addition, applying the underlying idea of simultaneous measurement of multiple paths provides practically valuable features. For example, it is possible to measure deviations (temporally stable, random, and time-varying) and the impulse response while reproducing slightly modified contents under target conditions. The key idea of the proposed method is to add relatively small deterministic signals that sound like noise to the original sounds. We call the converted sounds safeguarded test signals.
△ Less
Submitted 21 December, 2021;
originally announced December 2021.
-
Objective measurement of pitch extractors' responses to frequency modulated sounds and two reference pitch extraction methods for analyzing voice pitch responses to auditory stimulation
Authors:
Hideki Kawahara,
Kohei Yatabe,
Ken-Ichi Sakakibara,
Tatsuya Kitamura,
Hideki Banno,
Masanori Morise
Abstract:
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. The method simultaneously measures the linear and the non-linear time-invariant responses and random and time-varying responses. It uses extended time-stretched pulses combined by binary orthogonal sequences. Our recent finding of involuntary voice pitch response to auditory stimulation while…
▽ More
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. The method simultaneously measures the linear and the non-linear time-invariant responses and random and time-varying responses. It uses extended time-stretched pulses combined by binary orthogonal sequences. Our recent finding of involuntary voice pitch response to auditory stimulation while voicing motivated this proposal. The involuntary voice pitch response provides means to investigate voice chain subsystems individually and objectively. This response analysis requires reliable and precise pitch extraction. We found that existing pitch extractors failed to correctly analyze signals used for auditory stimulation by using the proposed method. Therefore, we propose two reference pitch extractors based on the instantaneous frequency analysis and multi-resolution power spectrum analysis. The proposed extractors correctly analyze the test signals. We open-sourced MATLAB codes to measure pitch extractors and codes for conducting the voice pitch response experiment on our GitHub repository.
△ Less
Submitted 27 June, 2022; v1 submitted 5 November, 2021;
originally announced November 2021.
-
Design of Tight Minimum-Sidelobe Windows by Riemannian Newton's Method
Authors:
Daichi Kitahara,
Kohei Yatabe
Abstract:
The short-time Fourier transform (STFT), or the discrete Gabor transform (DGT), has been extensively used in signal analysis and processing. Their properties are characterized by a window function. For signal processing, designing a special window called tight window is important because it is known to make DGT-domain processing robust to error. In this paper, we propose a method of designing tigh…
▽ More
The short-time Fourier transform (STFT), or the discrete Gabor transform (DGT), has been extensively used in signal analysis and processing. Their properties are characterized by a window function. For signal processing, designing a special window called tight window is important because it is known to make DGT-domain processing robust to error. In this paper, we propose a method of designing tight windows that minimize the sidelobe energy. It is formulated as a constrained spectral concentration problem, and a Newton's method on an oblique manifold is derived to efficiently obtain a solution. Our numerical example showed that the proposed algorithm requires only several iterations to reach a stationary point.
△ Less
Submitted 5 December, 2021; v1 submitted 2 November, 2021;
originally announced November 2021.
-
Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method
Authors:
Koichi Saito,
Tomohiko Nakamura,
Kohei Yatabe,
Yuma Koizumi,
Hiroshi Saruwatari
Abstract:
Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampli…
▽ More
Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampling frequencies specified in the target applications. However, conventional models based on deep neural networks (DNNs) are trained only at the sampling frequency specified by the training data, and there are no guarantees that they work with unseen sampling frequencies. In this paper, we propose a convolution layer capable of handling arbitrary sampling frequencies by a single DNN. Through music source separation experiments, we show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
△ Less
Submitted 9 May, 2021;
originally announced May 2021.
-
Sparse time-frequency representation via atomic norm minimization
Authors:
Tsubasa Kusano,
Kohei Yatabe,
Yasuhiro Oikawa
Abstract:
Nonstationary signals are commonly analyzed and processed in the time-frequency (T-F) domain that is obtained by the discrete Gabor transform (DGT). The T-F representation obtained by DGT is spread due to windowing, which may degrade the performance of T-F domain analysis and processing. To obtain a well-localized T-F representation, sparsity-aware methods using $\ell_1$-norm have been studied. Ho…
▽ More
Nonstationary signals are commonly analyzed and processed in the time-frequency (T-F) domain that is obtained by the discrete Gabor transform (DGT). The T-F representation obtained by DGT is spread due to windowing, which may degrade the performance of T-F domain analysis and processing. To obtain a well-localized T-F representation, sparsity-aware methods using $\ell_1$-norm have been studied. However, they need to discretize a continuous parameter onto a grid, which causes a model mismatch. In this paper, we propose a method of estimating a sparse T-F representation using atomic norm. The atomic norm enables sparse optimization without discretization of continuous parameters. Numerical experiments show that the T-F representation obtained by the proposed method is sparser than the conventional methods.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
Mixture of orthogonal sequences made from extended time-stretched pulses enables measurement of involuntary voice fundamental frequency response to pitch perturbation
Authors:
Hideki Kawahara,
Toshie Matsui,
Kohei Yatabe,
Ken-Ichi Sakakibara,
Minoru Tsuzaki,
Masanori Morise,
Toshio Irino
Abstract:
Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary control of voice pitch. This involuntary response…
▽ More
Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary control of voice pitch. This involuntary response is difficult to identify and isolate by the conventional paradigm, which uses step-shaped pitch perturbation. We recently developed a versatile measurement method using a mixture of orthogonal sequences made from a set of extended time-stretched pulses (TSP). In this article, we extended our approach and designed a set of test signals using the mixture to modulate the fundamental frequency of artificial signals. For testing the response, the experimenter presents the modulated signal aurally while the subject is voicing sustained vowels. We developed a tool for conducting this test quickly and interactively. We make the tool available as an open-source and also provide executable GUI-based applications. Preliminary tests revealed that the proposed method consistently provides compensatory responses with about 100 ms latency, representing involuntary control. Finally, we discuss future applications of the proposed method for objective and non-invasive auditory response measurements.
△ Less
Submitted 3 April, 2021;
originally announced April 2021.
-
Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech
Authors:
Takuya Fujimura,
Yuma Koizumi,
Kohei Yatabe,
Ryoichi Miyazaki
Abstract:
Deep neural network (DNN)-based speech enhancement ordinarily requires clean speech signals as the training target. However, collecting clean signals is very costly because they must be recorded in a studio. This requirement currently restricts the amount of training data for speech enhancement to less than 1/1000 of that of speech recognition which does not need clean signals. Increasing the amou…
▽ More
Deep neural network (DNN)-based speech enhancement ordinarily requires clean speech signals as the training target. However, collecting clean signals is very costly because they must be recorded in a studio. This requirement currently restricts the amount of training data for speech enhancement to less than 1/1000 of that of speech recognition which does not need clean signals. Increasing the amount of training data is important for improving the performance, and hence the requirement of clean signals should be relaxed. In this paper, we propose a training strategy that does not require clean signals. The proposed method only utilizes noisy signals for training, which enables us to use a variety of speech signals in the wild. Our experimental results showed that the proposed method can achieve the performance similar to that of a DNN trained with clean signals.
△ Less
Submitted 10 May, 2021; v1 submitted 21 January, 2021;
originally announced January 2021.
-
Cascaded all-pass filters with randomized center frequencies and phase polarity for acoustic and speech measurement and data augmentation
Authors:
Hideki Kawahara,
Kohei Yatabe
Abstract:
We introduce a new member of TSP (Time Stretched Pulse) for acoustic and speech measurement infrastructure, based on a simple all-pass filter and systematic randomization. This new infrastructure fundamentally upgrades our previous measurement procedure, which enables simultaneous measurement of multiple attributes, including non-linear ones without requiring extra filtering nor post-processing. O…
▽ More
We introduce a new member of TSP (Time Stretched Pulse) for acoustic and speech measurement infrastructure, based on a simple all-pass filter and systematic randomization. This new infrastructure fundamentally upgrades our previous measurement procedure, which enables simultaneous measurement of multiple attributes, including non-linear ones without requiring extra filtering nor post-processing. Our new proposal establishes a theoretically solid, flexible, and extensible foundation in acoustic measurement. Moreover, it is general enough to provide versatile research tools for other fields, such as biological signal analysis. We illustrate using acoustic measurements and data augmentation as representative examples among various prospective applications. We open-sourced MATLAB implementation. It consists of an interactive and real-time acoustic tool, MATLAB functions, and supporting materials.
△ Less
Submitted 12 February, 2021; v1 submitted 25 October, 2020;
originally announced October 2020.
-
Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
Authors:
Yoshiki Masuyama,
Yoshiaki Bando,
Kohei Yatabe,
Yoko Sasaki,
Masaki Onishi,
Yasuhiro Oikawa
Abstract:
Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-superv…
▽ More
Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-supervised learning uses monaural audio signals and images and cannot distinguish sound source objects having similar appearances due to poor spatial information in audio signals. To solve this problem, this paper presents a self-supervised training method using 360° images and multichannel audio signals. By incorporating with the spatial information in multichannel audio signals, our method trains deep neural networks (DNNs) to distinguish multiple sound source objects. Our system for localizing sound source objects in the image is composed of audio and visual DNNs. The visual DNN is trained to localize sound source candidates within an input image. The audio DNN verifies whether each candidate actually produces sound or not. These DNNs are jointly trained in a self-supervised manner based on a probabilistic spatial audio model. Experimental results with simulated data showed that the DNNs trained by our method localized multiple speakers. We also demonstrate that the visual DNN detected objects including talking visitors and specific exhibits from real data recorded in a science museum.
△ Less
Submitted 27 July, 2020;
originally announced July 2020.
-
Consistent Independent Low-Rank Matrix Analysis for Determined Blind Source Separation
Authors:
Daichi Kitamura,
Kohei Yatabe
Abstract:
Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such a highly developed sour…
▽ More
Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such a highly developed source model can solve the permutation problem of the frequency-domain BSS to a large extent, which is the reason for the excellence of ILRMA. In this paper, we further improve the separation performance of ILRMA by additionally considering the general structure of spectrograms, which is called consistency, and hence we call the proposed method Consistent ILRMA. Since a spectrogram is calculated by an overlapping window (and a window function induces spectral smearing called main- and side-lobes), the time-frequency bins depend on each other. In other words, the time-frequency components are related to each other via the uncertainty principle. Such co-occurrence among the spectral components can function as an assistant for solving the permutation problem, which has been demonstrated by a recent study. On the basis of these facts, we propose an algorithm for realizing Consistent ILRMA by slightly modifying the original algorithm. Its performance was extensively evaluated through experiments performed with various window lengths and shift lengths. The results indicated several tendencies of the original and proposed ILRMA that include some topics not fully discussed in the literature. For example, the proposed Consistent ILRMA tends to outperform the original ILRMA when the window length is sufficiently long compared to the reverberation time of the mixing system.
△ Less
Submitted 1 November, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Gamma Boltzmann Machine for Simultaneously Modeling Linear- and Log-amplitude Spectra
Authors:
Toru Nakashika,
Kohei Yatabe
Abstract:
In audio applications, one of the most important representations of audio signals is the amplitude spectrogram. It is utilized in many machine-learning-based information processing methods including the ones using the restricted Boltzmann machines (RBM). However, the ordinary Gaussian-Bernoulli RBM (the most popular RBM among its variations) cannot directly handle amplitude spectra because the Gau…
▽ More
In audio applications, one of the most important representations of audio signals is the amplitude spectrogram. It is utilized in many machine-learning-based information processing methods including the ones using the restricted Boltzmann machines (RBM). However, the ordinary Gaussian-Bernoulli RBM (the most popular RBM among its variations) cannot directly handle amplitude spectra because the Gaussian distribution is a symmetric model allowing negative values which never appear in the amplitude. In this paper, after proposing a general gamma Boltzmann machine, we propose a practical model called the gamma-Bernoulli RBM that simultaneously handles both linear- and log-amplitude spectrograms. Its conditional distribution of the observable data is given by the gamma distribution, and thus the proposed RBM can naturally handle the data represented by positive numbers as the amplitude spectra. It can also treat amplitude in the logarithmic scale which is important for audio signals from the perceptual point of view. The advantage of the proposed model compared to the ordinary Gaussian-Bernoulli RBM was confirmed by PESQ and MSE in the experiment of representing the amplitude spectrograms of speech signals.
△ Less
Submitted 25 June, 2020; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Consistent ICA: Determined BSS meets spectrogram consistency
Authors:
Kohei Yatabe
Abstract:
Multichannel audio blind source separation (BSS) in the determined situation (the number of microphones is equal to that of the sources), or determined BSS, is performed by multichannel linear filtering in the time-frequency domain to handle the convolutive mixing process. Ordinarily, the filter treats each frequency independently, which causes the well-known permutation problem, i.e., the problem…
▽ More
Multichannel audio blind source separation (BSS) in the determined situation (the number of microphones is equal to that of the sources), or determined BSS, is performed by multichannel linear filtering in the time-frequency domain to handle the convolutive mixing process. Ordinarily, the filter treats each frequency independently, which causes the well-known permutation problem, i.e., the problem of how to align the frequency-wise filters so that each separated component is correctly assigned to the corresponding sources. In this paper, it is shown that the general property of the time-frequency-domain representation called spectrogram consistency can be an assistant for solving the permutation problem.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Determined BSS based on time-frequency masking and its application to harmonic vector analysis
Authors:
Kohei Yatabe,
Daichi Kitamura
Abstract:
This paper proposes harmonic vector analysis (HVA) based on a general algorithmic framework of audio blind source separation (BSS) that is also presented in this paper. BSS for a convolutive audio mixture is usually performed by multichannel linear filtering when the numbers of microphones and sources are equal (determined situation). This paper addresses such determined BSS based on batch process…
▽ More
This paper proposes harmonic vector analysis (HVA) based on a general algorithmic framework of audio blind source separation (BSS) that is also presented in this paper. BSS for a convolutive audio mixture is usually performed by multichannel linear filtering when the numbers of microphones and sources are equal (determined situation). This paper addresses such determined BSS based on batch processing. To estimate the demixing filters, effective modeling of the source signals is important. One successful example is independent vector analysis (IVA) that models the signals via co-occurrence among the frequency components in each source. To give more freedom to the source modeling, a general framework of determined BSS is presented in this paper. It is based on the plug-and-play scheme using a primal-dual splitting algorithm and enables us to model the source signals implicitly through a time-frequency mask. By using the proposed framework, determined BSS algorithms can be developed by designing masks that enhance the source signals. As an example of its application, we propose HVA by defining a time-frequency mask that enhances the harmonic structure of audio signals via sparsity of cepstrum. The experiments showed that HVA outperforms IVA and independent low-rank matrix analysis (ILRMA) for both speech and music signals. A MATLAB code is provided along with the paper for a reference ( https://doi.org/10.24433/CO.9507820.v1 ).
△ Less
Submitted 14 April, 2021; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Stable Training of DNN for Speech Enhancement based on Perceptually-Motivated Black-Box Cost Function
Authors:
Masaki Kawanaka,
Yuma Koizumi,
Ryoichi Miyazaki,
Kohei Yatabe
Abstract:
Improving subjective sound quality of enhanced signals is one of the most important missions in speech enhancement. For evaluating the subjective quality, several methods related to perceptually-motivated objective sound quality assessment (OSQA) have been proposed such as PESQ (perceptual evaluation of speech quality). However, direct use of such measures for training deep neural network (DNN) is…
▽ More
Improving subjective sound quality of enhanced signals is one of the most important missions in speech enhancement. For evaluating the subjective quality, several methods related to perceptually-motivated objective sound quality assessment (OSQA) have been proposed such as PESQ (perceptual evaluation of speech quality). However, direct use of such measures for training deep neural network (DNN) is not allowed in most cases because popular OSQAs are non-differentiable with respect to DNN parameters. Therefore, the previous study has proposed to approximate the score of OSQAs by an auxiliary DNN so that its gradient can be used for training the primary DNN. One problem with this approach is instability of the training caused by the approximation error of the score. To overcome this problem, we propose to use stabilization techniques borrowed from reinforcement learning. The experiments, aimed to increase the score of PESQ as an example, show that the proposed method (i) can stably train a DNN to increase PESQ, (ii) achieved the state-of-the-art PESQ score on a public dataset, and (iii) resulted in better sound quality than conventional methods based on subjective evaluation.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention
Authors:
Yuma Koizumi,
Kohei Yatabe,
Marc Delcroix,
Yoshiki Masuyama,
Daiki Takeuchi
Abstract:
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and s…
▽ More
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our research question is whether a DNN for speech enhancement can be adopted to unknown speakers without any auxiliary guidance signal in test-phase. To achieve this, we adopt multi-task learning of speech enhancement and speaker identification, and use the output of the final hidden layer of speaker identification branch as an auxiliary feature. In addition, we use multi-head self-attention for capturing long-term dependencies in the speech and noise. Experimental results on a public dataset show that our strategy achieves the state-of-the-art performance and also outperform conventional methods in terms of subjective quality.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
Real-time speech enhancement using equilibriated RNN
Authors:
Daiki Takeuchi,
Kohei Yatabe,
Yuma Koizumi,
Yasuhiro Oikawa,
Noboru Harada
Abstract:
We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term mem…
▽ More
We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term memory (LSTM) is often used to alleviate the vanishing/exploding gradient problem which makes the training of an RNN difficult. However, the number of parameters of LSTM is increased as the price of mitigating the difficulty of training, which requires more computational resources. For real-time speech enhancement, it is preferable to use a smaller network without losing the performance. In this paper, we propose to use the equilibriated recurrent neural network~(ERNN) for avoiding the vanishing/exploding gradient problem without increasing the number of parameters. The proposed structure is causal, which requires only the information from the past, in order to apply it in real-time. Compared to the uni- and bi-directional LSTM networks, the proposed method achieved the similar performance with much fewer parameters.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Phase reconstruction based on recurrent phase unwrapping with deep neural networks
Authors:
Yoshiki Masuyama,
Kohei Yatabe,
Yuma Koizumi,
Yasuhiro Oikawa,
Noboru Harada
Abstract:
Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy tas…
▽ More
Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy task because phase is sensitive to the shift of a waveform. To overcome this problem, we propose a DNN-based two-stage phase reconstruction method. In the proposed method, DNNs estimate phase derivatives instead of phase itself, which allows us to avoid the sensitivity problem. Then, phase is recursively estimated based on the estimated derivatives, which is named recurrent phase unwrapping (RPU). The experimental results confirm that the proposed method outperformed the direct phase estimation by a DNN.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Invertible DNN-based nonlinear time-frequency transform for speech enhancement
Authors:
Daiki Takeuchi,
Kohei Yatabe,
Yuma Koizumi,
Yasuhiro Oikawa,
Noboru Harada
Abstract:
We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods ha…
▽ More
We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods have considered end-to-end networks which directly estimate the enhanced signals without T-F transform. While end-to-end methods have shown promising results, they are black boxes and hard to understand. Therefore, some end-to-end methods used a DNN to learn the linear T-F transform which is much easier to understand. However, the learned transform may not have a property important for ordinary signal processing. In this paper, as the important property of the T-F transform, perfect reconstruction is considered. An invertible nonlinear T-F transform is constructed by DNNs and learned from data so that the obtained transform is perfectly reconstructing filterbank.
△ Less
Submitted 13 February, 2020; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement
Authors:
Daiki Takeuchi,
Kohei Yatabe,
Yuma Koizumi,
Yasuhiro Oikawa,
Noboru Harada
Abstract:
We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such…
▽ More
We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such as objective sound quality assessments. However, such a simple cost function inherits strong assumptions on the statistics of the target and/or noise which is often not satisfied, and the mismatch of assumption results in degraded performance. In this paper, we propose to design the frequency scale of PRFB from training data so that the assumption on MSE is satisfied. For designing the frequency scale, the warped filterbank frame (WFBF) is considered as PRFB. The frequency characteristic of learned WFBF was in between STFT and the wavelet transform, and its effectiveness was confirmed by comparison with a standard STFT-based DNN whose input feature is compressed into the mel scale.
△ Less
Submitted 21 March, 2019;
originally announced March 2019.
-
Low-rankness of Complex-valued Spectrogram and Its Application to Phase-aware Audio Processing
Authors:
Yoshiki Masuyama,
Kohei Yatabe,
Yasuhiro Oikawa
Abstract:
Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-va…
▽ More
Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-valued spectrogram and show a complex-valued spectrogram of a sum of sinusoids can be approximately low-rank by modifying its phase. For evaluating the applicability of the proposed low-rank representation, we further propose a convex prior emphasizing harmonic signals, and it is applied to audio denoising.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Phase-aware Harmonic/Percussive Source Separation via Convex Optimization
Authors:
Yoshiki Masuyama,
Kohei Yatabe,
Yasuhiro Oikawa
Abstract:
Decomposition of an audio mixture into harmonic and percussive components, namely harmonic/percussive source separation (HPSS), is a useful pre-processing tool for many audio applications. Popular approaches to HPSS exploit the distinctive source-specific structures of power spectrograms. However, such approaches consider only power spectrograms, and the phase remains intact for resynthesizing the…
▽ More
Decomposition of an audio mixture into harmonic and percussive components, namely harmonic/percussive source separation (HPSS), is a useful pre-processing tool for many audio applications. Popular approaches to HPSS exploit the distinctive source-specific structures of power spectrograms. However, such approaches consider only power spectrograms, and the phase remains intact for resynthesizing the separated signals. In this paper, we propose a phase-aware HPSS method based on the structure of the phase of harmonic components. It is formulated as a convex optimization problem in the time domain, which enables the simultaneous treatment of both amplitude and phase. The numerical experiment validates the effectiveness of the proposed method.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Deep Griffin-Lim Iteration
Authors:
Yoshiki Masuyama,
Kohei Yatabe,
Yuma Koizumi,
Yasuhiro Oikawa,
Noboru Harada
Abstract:
This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of…
▽ More
This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of the short-time Fourier transform. However, GLA often involves many iterations and produces low-quality signals owing to the lack of prior knowledge of the target signal. In order to address these issues, in this study, we propose an architecture which stacks a sub-block including two GLA-inspired fixed layers and a DNN. The number of stacked sub-blocks is adjustable, and we can trade the performance and computational load based on requirements of applications. The effectiveness of the proposed method is investigated by reconstructing phases from amplitude spectrograms of speeches.
△ Less
Submitted 10 March, 2019;
originally announced March 2019.
-
Designing nearly tight window for improving time-frequency masking
Authors:
Tsubasa Kusano,
Yoshiki Masuyama,
Kohei Yatabe,
Yasuhiro Oikawa
Abstract:
Many audio signal processing methods are formulated in the time-frequency (T-F) domain which is obtained by the short-time Fourier transform (STFT). The properties of the STFT are fully characterized by window function, number of frequency channels, and time-shift. Thus, designing a better window is important for improving the performance of the processing especially when a less redundant T-F repr…
▽ More
Many audio signal processing methods are formulated in the time-frequency (T-F) domain which is obtained by the short-time Fourier transform (STFT). The properties of the STFT are fully characterized by window function, number of frequency channels, and time-shift. Thus, designing a better window is important for improving the performance of the processing especially when a less redundant T-F representation is desirable. While many window functions have been proposed in the literature, they are designed to have a good frequency response for analysis, which may not perform well in terms of signal processing. The window design must take the effect of the reconstruction (from the T-F domain into the time domain) into account for improving the performance. In this paper, an optimization-based design method of a nearly tight window is proposed to obtain a window performing well for the T-F domain signal processing.
△ Less
Submitted 4 February, 2019; v1 submitted 17 November, 2018;
originally announced November 2018.