Search | arXiv e-print repository

Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

Abstract: In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then… ▽ More In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among the distributed microphones and investigated improvements to beamforming. Finally, for ASR, we developed several models exploiting Whisper and WavLM speech foundation models. We present the results we submitted to the challenge and updated results we obtained afterward. Our strongest system achieves a 63% relative macro tcpWER improvement over the baseline and outperforms the challenge best results on the NOTSOFAR-1 meeting evaluation data among geometry-independent systems. △ Less

Submitted 18 June, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: 55 pages, 12 figures

arXiv:2409.05554 [pdf, other]

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

Abstract: We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach… ▽ More We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.3 % on the dev set, which is a 57 % relative improvement over the baseline. △ Less

Submitted 9 September, 2024; originally announced September 2024.

Comments: 5 pages, 4 figures, CHiME8 challenge

arXiv:2404.14860 [pdf, other]

Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

Authors: Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE f… ▽ More It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 13 pages, 6 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing

arXiv:2311.11599 [pdf, other]

How does end-to-end speech recognition training impact speech enhancement artifacts?

Authors: Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

Abstract: Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise a… ▽ More Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise and artifact errors. The experimental analyses provide two novel findings: 1) ASR-level training of the SE front-end reduces the artifact errors while increasing the noise errors, and 2) simply interpolating the enhanced and observed signals, which achieves a similar effect of reducing artifacts and increasing noise, improves ASR performance without jointly modifying the SE and ASR modules, even for a strong ASR back-end using a WavLM feature extractor. Our findings provide a better understanding of the effect of joint training and a novel insight for designing an ASR agnostic SE front-end. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 5 pages, 1 figure, 1 table

arXiv:2311.11595 [pdf, ps, other]

Neural network-based virtual microphone estimation with virtual microphone and beamformer-level multi-task loss

Authors: Hanako Segawa, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Shoko Araki, Takeshi Yamada, Shoji Makino

Abstract: Array processing performance depends on the number of microphones available. Virtual microphone estimation (VME) has been proposed to increase the number of microphone signals artificially. Neural network-based VME (NN-VME) trains an NN with a VM-level loss to predict a signal at a microphone location that is available during training but not at inference. However, this training objective may not… ▽ More Array processing performance depends on the number of microphones available. Virtual microphone estimation (VME) has been proposed to increase the number of microphone signals artificially. Neural network-based VME (NN-VME) trains an NN with a VM-level loss to predict a signal at a microphone location that is available during training but not at inference. However, this training objective may not be optimal for a specific array processing back-end, such as beamforming. An alternative approach is to use a training objective considering the array-processing back-end, such as a loss on the beamformer output. This approach may generate signals optimal for beamforming but not physically grounded. To combine the advantages of both approaches, this paper proposes a multi-task loss for NN-VME that combines both VM-level and beamformer-level losses. We evaluate the proposed multi-task NN-VME on multi-talker underdetermined conditions and show that it achieves a 33.1 % relative WER improvement compared to using only real microphones and 10.8 % compared to using a prior NN-VME approach. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 5 pages, 2 figures, 1 table

arXiv:2306.12820 [pdf, other]

NoisyILRMA: Diffuse-Noise-Aware Independent Low-Rank Matrix Analysis for Fast Blind Source Extraction

Authors: Koki Nishida, Norihiro Takamune, Rintaro Ikeshita, Daichi Kitamura, Hiroshi Saruwatari, Tomohiro Nakatani

Abstract: In this paper, we address the multichannel blind source extraction (BSE) of a single source in diffuse noise environments. To solve this problem even faster than by fast multichannel nonnegative matrix factorization (FastMNMF) and its variant, we propose a BSE method called NoisyILRMA, which is a modification of independent low-rank matrix analysis (ILRMA) to account for diffuse noise. NoisyILRMA… ▽ More In this paper, we address the multichannel blind source extraction (BSE) of a single source in diffuse noise environments. To solve this problem even faster than by fast multichannel nonnegative matrix factorization (FastMNMF) and its variant, we propose a BSE method called NoisyILRMA, which is a modification of independent low-rank matrix analysis (ILRMA) to account for diffuse noise. NoisyILRMA can achieve considerably fast BSE by incorporating an algorithm developed for independent vector extraction. In addition, to improve the BSE performance of NoisyILRMA, we propose a mechanism to switch the source model with ILRMA-like nonnegative matrix factorization to a more expressive source model during optimization. In the experiment, we show that NoisyILRMA runs faster than a FastMNMF algorithm while maintaining the BSE performance. We also confirm that the switching mechanism improves the BSE performance of NoisyILRMA. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2023 (EUSIPCO 2023)

arXiv:2202.00875 [pdf, other]

ISS2: An Extension of Iterative Source Steering Algorithm for Majorization-Minimization-Based Independent Vector Analysis

Authors: Rintaro Ikeshita, Tomohiro Nakatani

Abstract: A majorization-minimization (MM) algorithm for independent vector analysis optimizes a separation matrix $W = [w_1, \ldots, w_m]^h \in \mathbb{C}^{m \times m}$ by minimizing a surrogate function of the form $\mathcal{L}(W) = \sum_{i = 1}^m w_i^h V_i w_i - \log | \det W |^2$, where $m \in \mathbb{N}$ is the number of sensors and positive definite matrices… ▽ More A majorization-minimization (MM) algorithm for independent vector analysis optimizes a separation matrix $W = [w_1, \ldots, w_m]^h \in \mathbb{C}^{m \times m}$ by minimizing a surrogate function of the form $\mathcal{L}(W) = \sum_{i = 1}^m w_i^h V_i w_i - \log | \det W |^2$, where $m \in \mathbb{N}$ is the number of sensors and positive definite matrices $V_1,\ldots,V_m \in \mathbb{C}^{m \times m}$ are constructed in each MM iteration. For $m \geq 3$, no algorithm has been found to obtain a global minimum of $\mathcal{L}(W)$. Instead, block coordinate descent (BCD) methods with closed-form update formulas have been developed for minimizing $\mathcal{L}(W)$ and shown to be effective. One such BCD is called iterative projection (IP) that updates one or two rows of $W$ in each iteration. Another BCD is called iterative source steering (ISS) that updates one column of the mixing matrix $A = W^{-1}$ in each iteration. Although the time complexity per iteration of ISS is $m$ times smaller than that of IP, the conventional ISS converges slower than the current fastest IP (called $\text{IP}_2$) that updates two rows of $W$ in each iteration. We here extend this ISS to $\text{ISS}_2$ that can update two columns of $A$ in each iteration while maintaining its small time complexity. To this end, we provide a unified way for developing new ISS type methods from which $\text{ISS}_2$ as well as the conventional ISS can be immediately obtained in a systematic manner. Numerical experiments to separate reverberant speech mixtures show that our $\text{ISS}_2$ converges in fewer MM iterations than the conventional ISS, and is comparable to $\text{IP}_2$. △ Less

Submitted 16 June, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

Comments: Accepted for publication in the 30th European Signal Processing Conference (EUSIPCO 2022)

arXiv:2201.06685 [pdf, other]

How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

Authors: Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as t… ▽ More It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as the SE error signal that cannot be represented as a linear combination of speech and noise sources. We propose manually scaling the error components to analyze their impact on ASR. We experimentally identify the artifact component as the main cause of performance degradation, and we find that mitigating the artifact can greatly improve ASR performance. Furthermore, we demonstrate that the simple observation adding (OA) technique (i.e., adding a scaled version of the observed signal to the enhanced speech) can monotonically increase the signal-to-artifact ratio under a mild condition. Accordingly, we experimentally confirm that OA improves ASR performance for both simulated and real recordings. The findings of this paper provide a better understanding of the influence of SE errors on ASR and open the door to future research on novel approaches for designing effective single-channel SE front-ends for ASR. △ Less

Submitted 30 March, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: 5 pages, 5 figures, submitted to Interspeech 2022

arXiv:2111.10574 [pdf, ps, other]

Switching Independent Vector Analysis and Its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms

Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Naoyuki Kamo, Shoko Araki

Abstract: This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate N sources from their sound mixture even with diffuse noise when a sufficiently large number (=M) of microphones are available (i.e., M>>N). Howev… ▽ More This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate N sources from their sound mixture even with diffuse noise when a sufficiently large number (=M) of microphones are available (i.e., M>>N). However, the estimation accuracy seriously degrades as the number of microphones, or more specifically M-N (>=0), decreases. To overcome this limitation of IVA, we propose switching IVA (swIVA) in this paper. With swIVA, time frames of an observed signal with time-varying characteristics are clustered into several groups, each of which can be well handled by IVA using a small number of microphones, and thus accurate estimation can be achieved by applying IVA individually to each of the groups. Conventionally, a switching mechanism was introduced into a beamformer; however, no blind source separation algorithms with a switching mechanism have been successfully developed until this paper. In order to incorporate dereverberation capability, this paper further extends swIVA to blind Convolutional beamforming algorithm (swCIVA). It integrates swIVA and switching Weighted Prediction Error-based dereverberation (swWPE) in a jointly optimal way. We show that both swIVA and swCIVA can be optimized effectively based on blind signal processing, and that their performance can be further improved using a spatial guide for the initialization. Experiments show that both proposed methods largely outperform conventional IVA and its Convolutional beamforming extension (CIVA) in terms of objective signal quality and automatic speech recognition scores when using a relatively small number of microphones. △ Less

Submitted 24 February, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 27 July 2021, accepted on 22 Feb. 2022

arXiv:2108.01836 [pdf, ps, other]

doi 10.1109/ICASSP39728.2021.9414264

Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation

Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki

Abstract: This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractab… ▽ More This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractable, we incorporate two techniques into the approach: the Source-Wise Factorization (SW-Fact) of a CBF and the Independent Vector Extraction (IVE). To further improve the performance, we develop a method that integrates a neural network(NN) based source power spectra estimation with CBF optimization by an inverse-Gamma prior. Experiments using noisy reverberant mixtures reveal that our proposed method with both blind and NN-guided scenarios greatly outperforms the conventional state-of-the-art NN-supported mask-based CBF in terms of the improvement in automatic speech recognition and signal distortion reduction performance. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Comments: Accepted by IEEE ICASSP 2021

arXiv:2106.05529 [pdf, other]

Independent Deeply Learned Tensor Analysis for Determined Audio Source Separation

Authors: Naoki Narisawa, Rintaro Ikeshita, Norihiro Takamune, Daichi Kitamura, Tomohiko Nakamura, Hiroshi Saruwatari, Tomohiro Nakatani

Abstract: We address the determined audio source separation problem in the time-frequency domain. In independent deeply learned matrix analysis (IDLMA), it is assumed that the inter-frequency correlation of each source spectrum is zero, which is inappropriate for modeling nonstationary signals such as music signals. To account for the correlation between frequencies, independent positive semidefinite tensor… ▽ More We address the determined audio source separation problem in the time-frequency domain. In independent deeply learned matrix analysis (IDLMA), it is assumed that the inter-frequency correlation of each source spectrum is zero, which is inappropriate for modeling nonstationary signals such as music signals. To account for the correlation between frequencies, independent positive semidefinite tensor analysis has been proposed. This unsupervised (blind) method, however, severely restrict the structure of frequency covariance matrices (FCMs) to reduce the number of model parameters. As an extension of these conventional approaches, we here propose a supervised method that models FCMs using deep neural networks (DNNs). It is difficult to directly infer FCMs using DNNs. Therefore, we also propose a new FCM model represented as a convex combination of a diagonal FCM and a rank-1 FCM. Our FCM model is flexible enough to not only consider inter-frequency correlation, but also capture the dynamics of time-varying FCMs of nonstationary signals. We infer the proposed FCMs using two DNNs: DNN for power spectrum estimation and DNN for time-domain signal estimation. An experimental result of separating music signals shows that the proposed method provides higher separation performance than IDLMA. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 5 pages, 2 figures, accepted for European Signal Processing Conference 2021 (EUSIPCO 2021)

arXiv:2102.04696 [pdf, other]

doi 10.1109/LSP.2021.3074321

Independent Vector Extraction for Fast Joint Blind Source Separation and Dereverberation

Authors: Rintaro Ikeshita, Tomohiro Nakatani

Abstract: We address a blind source separation (BSS) problem in a noisy reverberant environment in which the number of microphones $M$ is greater than the number of sources of interest, and the other noise components can be approximated as stationary and Gaussian distributed. Conventional BSS algorithms for the optimization of a multi-input multi-output convolutional beamformer have suffered from a huge com… ▽ More We address a blind source separation (BSS) problem in a noisy reverberant environment in which the number of microphones $M$ is greater than the number of sources of interest, and the other noise components can be approximated as stationary and Gaussian distributed. Conventional BSS algorithms for the optimization of a multi-input multi-output convolutional beamformer have suffered from a huge computational cost when $M$ is large. We here propose a computationally efficient method that integrates a weighted prediction error (WPE) dereverberation method and a fast BSS method called independent vector extraction (IVE), which has been developed for less reverberant environments. We show that, given the power spectrum for each source, the optimization problem of the new method can be reduced to that of IVE by exploiting the stationary condition, which makes the optimization easy to handle and computationally efficient. An experiment of speech signal separation shows that, compared to a conventional method that integrates WPE and independent vector analysis, our proposed method achieves much faster convergence while maintaining its separation performance. △ Less

Submitted 21 April, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: Accepted to IEEE Signal Processing Letters

arXiv:2101.08563 [pdf, ps, other]

A Joint Diagonalization Based Efficient Approach to Underdetermined Blind Audio Source Separation Using the Multichannel Wiener Filter

Authors: Nobutaka Ito, Rintaro Ikeshita, Hiroshi Sawada, Tomohiro Nakatani

Abstract: This paper presents a computationally efficient approach to blind source separation (BSS) of audio signals, applicable even when there are more sources than microphones (i.e., the underdetermined case). When there are as many sources as microphones (i.e., the determined case), BSS can be performed computationally efficiently by independent component analysis (ICA). Unfortunately, however, ICA is b… ▽ More This paper presents a computationally efficient approach to blind source separation (BSS) of audio signals, applicable even when there are more sources than microphones (i.e., the underdetermined case). When there are as many sources as microphones (i.e., the determined case), BSS can be performed computationally efficiently by independent component analysis (ICA). Unfortunately, however, ICA is basically inapplicable to the underdetermined case. Another BSS approach using the multichannel Wiener filter (MWF) is applicable even to this case, and encompasses full-rank spatial covariance analysis (FCA) and multichannel non-negative matrix factorization (MNMF). However, these methods require massive numbers of matrix inversions to design the MWF, and are thus computationally inefficient. To overcome this drawback, we exploit the well-known property of diagonal matrices that matrix inversion amounts to mere inversion of the diagonal elements and can thus be performed computationally efficiently. This makes it possible to drastically reduce the computational cost of the above matrix inversions based on a joint diagonalization (JD) idea, leading to computationally efficient BSS. Specifically, we restrict the N spatial covariance matrices (SCMs) of all N sources to a class of (exactly) jointly diagonalizable matrices. Based on this approach, we present FastFCA, a computationally efficient extension of FCA. We also present a unified framework for underdetermined and determined audio BSS, which highlights a theoretical connection between FastFCA and other methods. Moreover, we reveal that FastFCA can be regarded as a regularized version of approximate joint diagonalization (AJD). △ Less

Submitted 21 January, 2021; originally announced January 2021.

Comments: submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2101.04315 [pdf, ps, other]

Neural Network-based Virtual Microphone Estimator

Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki

Abstract: Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approa… ▽ More Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approach, we propose a neural network-based virtual microphone estimator (NN-VME). The NN-VME estimates virtual microphone signals directly in the time domain, by utilizing the precise estimation capability of the recent time-domain neural networks. We adopt a fully supervised learning framework that uses actual observations at the locations of the virtual microphones at training time. Consequently, the NN-VME can be trained using only multi-channel observations and thus directly on real recordings, avoiding the need for unrealistic physical model-based assumptions. Experiments on the CHiME-4 corpus show that the proposed NN-VME achieves high virtual microphone estimation performance even for real recordings and that a beamformer augmented with the NN-VME improves both the speech enhancement and recognition performance. △ Less

Submitted 12 January, 2021; originally announced January 2021.

Comments: 5 pages, 2 figures, submitted to ICASSP 2021

arXiv:2010.08959 [pdf, other]

doi 10.1109/TSP.2021.3076884

Block Coordinate Descent Algorithms for Auxiliary-Function-Based Independent Vector Extraction

Authors: Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki

Abstract: In this paper, we address the problem of extracting all super-Gaussian source signals from a linear mixture in which (i) the number of super-Gaussian sources $K$ is less than that of sensors $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that do not need to be extracted. To solve this problem, independent vector extraction (IVE) using a majorization minimization and block coordin… ▽ More In this paper, we address the problem of extracting all super-Gaussian source signals from a linear mixture in which (i) the number of super-Gaussian sources $K$ is less than that of sensors $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that do not need to be extracted. To solve this problem, independent vector extraction (IVE) using a majorization minimization and block coordinate descent (BCD) algorithms has been developed, attaining robust source extraction and low computational cost. We here improve the conventional BCDs for IVE by carefully exploiting the stationarity of the Gaussian noise components. We also newly develop a BCD for a semiblind IVE in which the transfer functions for several super-Gaussian sources are given a priori. Both algorithms consist of a closed-form formula and a generalized eigenvalue decomposition. In a numerical experiment of extracting speech signals from noisy mixtures, we show that when $K = 1$ in a blind case or at least $K - 1$ transfer functions are given in a semiblind case, the convergence of our proposed BCDs is significantly faster than those of the conventional ones. △ Less

Submitted 3 May, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

Comments: Accepted by IEEE Transactions on Signal Processing

arXiv:2005.09843 [pdf, ps, other]

doi 10.1109/TASLP.2020.3013118

Jointly optimal denoising, dereverberation, and source separation

Authors: Tomohiro Nakatani, Christoph Boeddeker, Keisuke Kinoshita, Rintaro Ikeshita, Marc Delcroix, Reinhold Haeb-Umbach

Abstract: This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response beamformer has been usedas… ▽ More This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response beamformer has been usedas the state-of-the-art frontend of far-field speech recognition, however, overall optimality of this approach is not guaranteed. In the blind signal processing area, an approach for jointly optimizing dereverberation and source separation (DR+SS) has been proposed, however, this approach requires huge computing cost, and has not been extended for application to DN+DR+SS. To overcome the above limitations, this paper develops new approaches for jointly optimizing DN+DR+SS in a computationally much more efficient way. To this end, we first present an objective function to optimize a CBF for performing DN+DR+SS based on the maximum likelihood estimation, on an assumption that the steering vectors of the target signals are given or can be estimated, e.g., using a neural network. This paper refers to a CBF optimized by this objective function as a weighted Minimum-Power Distortionless Response (wMPDR) CBF. Then, we derive two algorithms for optimizing a wMPDR CBF based on two different ways of factorizing a CBF into WPE filters and beamformers. Experiments using noisy reverberant sound mixtures show that the proposed optimization approaches greatly improve the performance of the speech enhancement in comparison with the conventional cascade configuration in terms of the signal distortion measures and ASR performance. It is also shown that the proposed approaches can greatly reduce the computing cost with improved estimation accuracy in comparison with the conventional joint optimization approach. △ Less

Submitted 2 August, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 14 July 2020

arXiv:2003.02458 [pdf, ps, other]

doi 10.1109/ICASSP40776.2020.9053790

Overdetermined independent vector analysis

Authors: Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki

Abstract: We address the convolutive blind source separation problem for the (over-)determined case where (i) the number of nonstationary target-sources $K$ is less than that of microphones $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that need not to be extracted. Independent vector analysis (IVA) can solve the problem by separating into $M$ sources and selecting the top $K$ highly nons… ▽ More We address the convolutive blind source separation problem for the (over-)determined case where (i) the number of nonstationary target-sources $K$ is less than that of microphones $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that need not to be extracted. Independent vector analysis (IVA) can solve the problem by separating into $M$ sources and selecting the top $K$ highly nonstationary signals among them, but this approach suffers from a waste of computation especially when $K \ll M$. Channel reductions in preprocessing of IVA by, e.g., principle component analysis have the risk of removing the target signals. We here extend IVA to resolve these issues. One such extension has been attained by assuming the orthogonality constraint (OC) that the sample correlation between the target and noise signals is to be zero. The proposed IVA, on the other hand, does not rely on OC and exploits only the independence between sources and the stationarity of the noises. This enables us to develop several efficient algorithms based on block coordinate descent methods with a problem specific acceleration. We clarify that one such algorithm exactly coincides with the conventional IVA with OC, and also explain that the other newly developed algorithms are faster than it. Experimental results show the improved computational load of the new algorithms compared to the conventional methods. In particular, a new algorithm specialized for $K = 1$ outperforms the others. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

arXiv:2002.08582 [pdf, ps, other]

Convergence-guaranteed Independent Positive Semidefinite Tensor Analysis Based on Student's t Distribution

Authors: Tatsuki Kondo, Kanta Fukushige, Norihiro Takamune, Daichi Kitamura, Hiroshi Saruwatari, Rintaro Ikeshita, Tomohiro Nakatani

Abstract: In this paper, we address a blind source separation (BSS) problem and propose a new extended framework of independent positive semidefinite tensor analysis (IPSDTA). IPSDTA is a state-of-the-art BSS method that enables us to take interfrequency correlations into account, but the generative model is limited within the multivariate Gaussian distribution and its parameter optimization algorithm does… ▽ More In this paper, we address a blind source separation (BSS) problem and propose a new extended framework of independent positive semidefinite tensor analysis (IPSDTA). IPSDTA is a state-of-the-art BSS method that enables us to take interfrequency correlations into account, but the generative model is limited within the multivariate Gaussian distribution and its parameter optimization algorithm does not guarantee stable convergence. To resolve these problems, first, we propose to extend the generative model to a parametric multivariate Student's t distribution that can deal with various types of signal. Secondly, we derive a new parameter optimization algorithm that guarantees the monotonic nonincrease in the cost function, providing stable convergence. Experimental results reveal that the cost function in the conventional IPSDTA does not display monotonically nonincreasing properties. On the other hand, the proposed method guarantees the monotonic nonincrease in the cost function and outperforms the conventional ILRMA and IPSDTA in the source-separation performance. △ Less

Submitted 20 February, 2020; originally announced February 2020.

Comments: 5 pages, 3 figures, to appear in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020

Showing 1–18 of 18 results for author: Ikeshita, R