Search | arXiv e-print repository

Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

Authors: Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

Abstract: Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This… ▽ More Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately. We first introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs. Using this model, we alternately conduct i) knowledge distillation from a multi-channel model to a single-channel model and ii) finetuning from the distilled single-channel model to a multi-channel model. Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE SLT 2022

arXiv:2207.00216 [pdf, other]

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

Authors: Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi

Abstract: In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model. Conventional approaches require extra parameters of the same size as the model for optimization, and it is difficult to apply these approaches to end-to-end ASR models because they have a huge amount of parameters. To solve this problem… ▽ More In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model. Conventional approaches require extra parameters of the same size as the model for optimization, and it is difficult to apply these approaches to end-to-end ASR models because they have a huge amount of parameters. To solve this problem, we first investigate which parts of end-to-end ASR models contribute to high accuracy in the target domain while preventing catastrophic forgetting. We conduct experiments on incremental domain adaptation from the LibriSpeech dataset to the AMI meeting corpus with two popular end-to-end ASR models and found that adapting only the linear layers of their encoders can prevent catastrophic forgetting. Then, on the basis of this finding, we develop an element-wise parameter selection focused on specific layers to further reduce the number of fine-tuning parameters. Experimental results show that our approach consistently prevents catastrophic forgetting compared to parameter selection from the whole model. △ Less

Submitted 1 July, 2022; originally announced July 2022.

Comments: Accepted for Interspeech 2022

arXiv:2206.02432 [pdf, other]

doi 10.1109/TASLP.2022.3233237

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Authors: Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, Yohei Kawaguchi

Abstract: A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of spea… ▽ More A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally, the results of each block are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences. △ Less

Submitted 22 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: Accepted to IEEE/ACM TASLP

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 706-720, 2023

arXiv:2203.12561 [pdf, other]

Scatter Ptychography

Authors: Qian Huang, Zhipeng Dong, Yuzuru Takashima, Timothy J. Schulz, David J. Brady

Abstract: Coherent illumination reflected by a remote target may be secondarily scattered by intermediate objects or materials. Here we show that phase retrieval on remotely observed images of such scattered fields enables imaging of the illuminated object at resolution proportional to $λR_s/A_s$, where $R_s$ is the range between the scatterer and the target and $A_s$ is the diameter of the observed scatter… ▽ More Coherent illumination reflected by a remote target may be secondarily scattered by intermediate objects or materials. Here we show that phase retrieval on remotely observed images of such scattered fields enables imaging of the illuminated object at resolution proportional to $λR_s/A_s$, where $R_s$ is the range between the scatterer and the target and $A_s$ is the diameter of the observed scatter. This resolution may exceed the resolution of directly viewing the target by the factor $R_cA_s/R_sA_c$, where $R_c$ is the range between the observer and the target and $A_c$ is the observing aperture. Here we use this technique to demonstrate $\approx 32\times$ resolution improvement relative to direct imaging. △ Less

Submitted 23 March, 2022; originally announced March 2022.

arXiv:2111.03517 [pdf, other]

doi 10.1364/OE.447499

Snapshot Ptychography on Array cameras

Authors: Chengyu Wang, Minghao Hu, Yuzuru Takashima, Timothy J. Schulz, David J. Brady

Abstract: We use convolutional neural networks to recover images optically down-sampled by $6.7\times$ using coherent aperture synthesis over a 16 camera array. Where conventional ptychography relies on scanning and oversampling, here we apply decompressive neural estimation to recover full resolution image from a single snapshot, although as shown in simulation multiple snapshots can be used to improve SNR… ▽ More We use convolutional neural networks to recover images optically down-sampled by $6.7\times$ using coherent aperture synthesis over a 16 camera array. Where conventional ptychography relies on scanning and oversampling, here we apply decompressive neural estimation to recover full resolution image from a single snapshot, although as shown in simulation multiple snapshots can be used to improve SNR. In place training on experimental measurements eliminates the need to directly calibrate the measurement system. We also present simulations of diverse array camera sampling strategies to explore how snapshot compressive systems might be optimized. △ Less

Submitted 5 November, 2021; originally announced November 2021.

arXiv:2110.04694 [pdf, other]

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Authors: Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei Kawaguchi

Abstract: Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of t… ▽ More Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of the number and geometry of microphones and suitable for distributed microphone settings. We also propose a model adaptation method using only single-channel recordings. With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input. We also showed that the proposed method performed well even when spatial information is inoperative given multi-channel inputs, such as in hybrid meetings in which the utterances of multiple remote participants are played back from the same loudspeaker. △ Less

Submitted 28 March, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

Comments: Accepted to ICASSP 2022

arXiv:2107.01545 [pdf, other]

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

Authors: Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki Takashima, Yohei Kawaguchi

Abstract: Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an un… ▽ More Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods. △ Less

Submitted 23 September, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

Comments: Accepted to ASRU 2021

arXiv:2106.04764 [pdf, other]

Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

Authors: Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola García, Kenji Nagamatsu

Abstract: In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a well-tuned model, EEND requires labeled data for all the joint speech activities of every speaker at each tim… ▽ More In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a well-tuned model, EEND requires labeled data for all the joint speech activities of every speaker at each time frame in a recording. In this paper, we explore a pseudo-labeling approach that employs unlabeled data. First, we propose an iterative pseudo-label method for EEND, which trains the model using unlabeled data of a target condition. Then, we also propose a committee-based training method to improve the performance of EEND. To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data. Experimental results on the CALLHOME dataset show that our proposed pseudo-label achieved a 37.4% relative diarization error rate reduction compared to a seed model. Moreover, we analyzed the results of semi-supervised adaptation with pseudo-labeling. We also show the effectiveness of our approach on the third DIHARD dataset. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Accepted for Interspeech 2021

arXiv:2106.04078 [pdf, other]

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

Authors: Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola García, Kenji Nagamatsu

Abstract: In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker… ▽ More In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency. We optimize speaker diarization conditioned on speech activity and overlap detection that are subtasks of speaker diarization, based on the probabilistic chain rule. Experimental results show that our proposed method can leverage a subtask to effectively model speaker diarization, and outperforms conventional EEND systems in terms of diarization error rate. △ Less

Submitted 7 June, 2021; originally announced June 2021.

Comments: Accepted for SLT 2021

Journal ref: IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 849-856

arXiv:2102.01363 [pdf, other]

The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Authors: Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

Abstract: This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After… ▽ More This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After the DOVER-Lap based system combination, it achieved diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and 16.94 % and 20.01 % in Track 2 full and core, respectively. With their results, we won second place in all the tasks of the challenge. △ Less

Submitted 2 February, 2021; originally announced February 2021.

arXiv:2101.08473 [pdf, other]

Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

Authors: Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu

Abstract: We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech. In our previous study, the speaker-tracing buffer (STB) mechanism was proposed to achieve a chunk-wise streaming diarization using a pre-trained EEND model. STB traces the speaker information in previous chunks to map the speakers in a… ▽ More We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech. In our previous study, the speaker-tracing buffer (STB) mechanism was proposed to achieve a chunk-wise streaming diarization using a pre-trained EEND model. STB traces the speaker information in previous chunks to map the speakers in a new chunk. However, it only worked with two-speaker recordings. In this paper, we propose an extended STB for flexible numbers of speakers, FLEX-STB. The proposed method uses a zero-padding followed by speaker-tracing, which alleviates the difference in the number of speakers between a buffer and a current chunk. We also examine buffer update strategies to select important frames for tracing multiple speakers. Experiments on CALLHOME and DIHARD II datasets show that the proposed method achieves comparable performance to the offline EEND method with 1-second latency. The results also show that our proposed method outperforms recently proposed chunk-wise diarization methods based on EEND (BW-EDA-EEND). △ Less

Submitted 6 April, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

Showing 1–11 of 11 results for author: Takashima, Y