-
Single-Channel Target Speech Extraction Utilizing Distance and Room Clues
Authors:
Runwu Shi,
Zirui Lin,
Benjamin Yen,
Jiang Wang,
Ragib Amin Nihal,
Kazuhiro Nakadai
Abstract:
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influen…
▽ More
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available at https://runwushi.github.io/distance-room-demo-page/.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
An Efficient GPU-based Implementation for Noise Robust Sound Source Localization
Authors:
Zirui Lin,
Masayuki Takigahira,
Naoya Terakado,
Haris Gulzar,
Monikka Roslianna Busto,
Takeharu Eda,
Katsutoshi Itoyama,
Kazuhiro Nakadai,
Hideharu Amano
Abstract:
Robot audition, encompassing Sound Source Localization (SSL), Sound Source Separation (SSS), and Automatic Speech Recognition (ASR), enables robots and smart devices to acquire auditory capabilities similar to human hearing. Despite their wide applicability, processing multi-channel audio signals from microphone arrays in SSL involves computationally intensive matrix operations, which can hinder e…
▽ More
Robot audition, encompassing Sound Source Localization (SSL), Sound Source Separation (SSS), and Automatic Speech Recognition (ASR), enables robots and smart devices to acquire auditory capabilities similar to human hearing. Despite their wide applicability, processing multi-channel audio signals from microphone arrays in SSL involves computationally intensive matrix operations, which can hinder efficient deployment on Central Processing Units (CPUs), particularly in embedded systems with limited CPU resources. This paper introduces a GPU-based implementation of SSL for robot audition, utilizing the Generalized Singular Value Decomposition-based Multiple Signal Classification (GSVD-MUSIC), a noise-robust algorithm, within the HARK platform, an open-source software suite. For a 60-channel microphone array, the proposed implementation achieves significant performance improvements. On the Jetson AGX Orin, an embedded device powered by an NVIDIA GPU and ARM Cortex-A78AE v8.2 64-bit CPUs, we observe speedups of 5648.7x for GSVD calculations and 10.7x for the SSL module, while speedups of 4245.1x for GSVD calculation and 17.3x for the entire SSL module on a server configured with an NVIDIA A100 GPU and AMD EPYC 7352 CPUs, making real-time processing feasible for large-scale microphone arrays and providing ample capacity for real-time processing of potential subsequent machine learning or deep learning tasks.
△ Less
Submitted 8 May, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring
Authors:
Ragib Amin Nihal,
Benjamin Yen,
Runwu Shi,
Kazuhiro Nakadai
Abstract:
Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features w…
▽ More
Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection. Tests on Antarctic whale data show longer contexts improve classification (F1: 0.8-0.9) while medium instances ensure localization precision (0.65-0.70). This suggests MIL can enhance scalable marine monitoring. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning
Authors:
Runwu Shi,
Katsutoshi Itoyama,
Kazuhiro Nakadai
Abstract:
This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the…
▽ More
This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the note or syllable level. To extend the processing level to the entire song instead of cutting into segments, this paper regards each vocalization as the generalized and discriminative part and uses two encoders to learn these two parts. The proposed method is evaluated on the Great Tits dataset according to the clustering performance, and the results outperform the compared pre-trained models and vanilla VAE. Finally, this paper analyzes the informative part of the embedding, further compresses its dimension, and explains the disentangled performance of bird vocalizations.
△ Less
Submitted 28 December, 2024;
originally announced December 2024.
-
Distance Based Single-Channel Target Speech Extraction
Authors:
Runwu Shi,
Benjamin Yen,
Kazuhiro Nakadai
Abstract:
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance…
▽ More
This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.
△ Less
Submitted 28 December, 2024;
originally announced December 2024.
-
Can all variations within the unified mask-based beamformer framework achieve identical peak extraction performance?
Authors:
Atsuo Hiroe,
Katsutoshi Itoyama,
Kazuhiro Nakadai
Abstract:
This study investigates mask-based beamformers (BFs), which estimate filters for target sound extraction (TSE) using time-frequency masks. Although multiple mask-based BFs have been proposed, no consensus has been reached on which one offers the best target-extraction performance. Previously, we found that maximum signal-to-noise ratio and minimum mean square error (MSE) BFs can achieve the same e…
▽ More
This study investigates mask-based beamformers (BFs), which estimate filters for target sound extraction (TSE) using time-frequency masks. Although multiple mask-based BFs have been proposed, no consensus has been reached on which one offers the best target-extraction performance. Previously, we found that maximum signal-to-noise ratio and minimum mean square error (MSE) BFs can achieve the same extraction performance as the theoretical upper-bound performance, with each BF containing a different optimal mask. However, two issues remained unsolved: only two BFs were covered, excluding the minimum variance distortionless response BF, and ideal scaling (IS) was employed to ideally adjust the output scale, which is not applicable to realistic scenarios. To address these issues, this study proposes a unified framework for mask-based BFs comprising two processes: filter estimation that can cover all possible BFs and scaling applicable to realistic scenarios by employing a mask to generate a scaling reference. Based on the operators and covariance matrices used in BF formulas, all possible BFs can be classified into 12 variations, including two new ones. Optimal masks for both processes are obtained by minimizing the MSE between the target and BF output. The experimental results using the CHiME-4 dataset suggested that 1) all 12 variations can achieve the theoretical upper-bound performance, and 2) mask-based scaling can behave like IS, even when constraining the temporal mean of a non-negative mask to one. These results can be explained by considering the practical parameter count of the masks. These findings contribute to 1) designing a TSE system, 2) improving scaling accuracy through mask-based scaling, and 3) estimating the extraction performance of a BF.
△ Less
Submitted 22 February, 2025; v1 submitted 21 July, 2024;
originally announced July 2024.
-
Is the Ideal Ratio Mask Really the Best? -- Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers
Authors:
Atsuo Hiroe,
Katsutoshi Itoyama,
Kazuhiro Nakadai
Abstract:
This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best…
▽ More
This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with the CHiME-3 dataset, we verify that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, whereas the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Hence, this study contributes to the design of mask-based BFs.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation
Authors:
Yui Sudo,
Kazuya Hata,
Kazuhiro Nakadai
Abstract:
End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji charact…
▽ More
End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji characters. Since such NE words tend to be important keywords, ASR easily loses user trust if it misrecognizes them. To solve these problems, this paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model when selecting personal names as a target NE.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
Metric-based multimodal meta-learning for human movement identification via footstep recognition
Authors:
Muhammad Shakeel,
Katsutoshi Itoyama,
Kenji Nishida,
Kazuhiro Nakadai
Abstract:
We describe a novel metric-based learning approach that introduces a multimodal framework and uses deep audio and geophone encoders in siamese configuration to design an adaptable and lightweight supervised model. This framework eliminates the need for expensive data labeling procedures and learns general-purpose representations from low multisensory data obtained from omnipresent sensing systems.…
▽ More
We describe a novel metric-based learning approach that introduces a multimodal framework and uses deep audio and geophone encoders in siamese configuration to design an adaptable and lightweight supervised model. This framework eliminates the need for expensive data labeling procedures and learns general-purpose representations from low multisensory data obtained from omnipresent sensing systems. These sensing systems provide numerous applications and various use cases in activity recognition tasks. Here, we intend to explore the human footstep movements from indoor environments and analyze representations from a small self-collected dataset of acoustic and vibration-based sensors. The core idea is to learn plausible similarities between two sensory traits and combining representations from audio and geophone signals. We present a generalized framework to learn embeddings from temporal and spatial features extracted from audio and geophone signals. We then extract the representations in a shared space to maximize the learning of a compatibility function between acoustic and geophone features. This, in turn, can be used effectively to carry out a classification task from the learned model, as demonstrated by assigning high similarity to the pairs with a human footstep movement and lower similarity to pairs containing no footstep movement. Performance analyses show that our proposed multimodal framework achieves a 19.99\% accuracy increase (in absolute terms) and avoided overfitting on the evaluation set when the training samples were increased from 200 pairs to just 500 pairs while satisfactorily learning the audio and geophone representations. Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity and perform human movement identification with limited data size.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments
Authors:
Nelson Yalta,
Shinji Watanabe,
Takaaki Hori,
Kazuhiro Nakadai,
Tetsuya Ogata
Abstract:
Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this st…
▽ More
Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments. The system comprises of an attention-based encoder-decoder neural network that directly generates a text as an output from a sound input. The multichannel CNN encoder, which uses residual connections and batch renormalization, is trained with augmented data, including white noise injection. The experimental results show that the word error rate is reduced by 8.5% and 0.6% absolute from a single channel end-to-end and the best baseline (LF-MMI TDNN) on the CHiME-5 corpus, respectively.
△ Less
Submitted 20 June, 2019; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation
Authors:
Nelson Yalta,
Shinji Watanabe,
Kazuhiro Nakadai,
Tetsuya Ogata
Abstract:
Synthesizing human's movements such as dancing is a flourishing research field which has several applications in computer graphics. Recent studies have demonstrated the advantages of deep neural networks (DNNs) for achieving remarkable performance in motion and music tasks with little effort for feature pre-processing. However, applying DNNs for generating dance to a piece of music is nevertheless…
▽ More
Synthesizing human's movements such as dancing is a flourishing research field which has several applications in computer graphics. Recent studies have demonstrated the advantages of deep neural networks (DNNs) for achieving remarkable performance in motion and music tasks with little effort for feature pre-processing. However, applying DNNs for generating dance to a piece of music is nevertheless challenging, because of 1) DNNs need to generate large sequences while mapping the music input, 2) the DNN needs to constraint the motion beat to the music, and 3) DNNs require a considerable amount of hand-crafted data. In this study, we propose a weakly supervised deep recurrent method for real-time basic dance generation with audio power spectrum as input. The proposed model employs convolutional layers and a multilayered Long Short-Term memory (LSTM) to process the audio input. Then, another deep LSTM layer decodes the target dance sequence. Notably, this end-to-end approach has 1) an auto-conditioned decode configuration that reduces accumulation of feedback error of large dance sequence, 2) uses a contrastive cost function to regulate the mapping between the music and motion beat, and 3) trains with weak labels generated from the motion beat, reducing the amount of hand-crafted data. We evaluate the proposed network based on i) the similarities between generated and the baseline dancer motion with a cross entropy measure for large dance sequences, and ii) accurate timing between the music and motion beat with an F-measure. Experimental results revealed that, after training using a small dataset, the model generates basic dance steps with low cross entropy and maintains an F-measure score similar to that of a baseline dancer.
△ Less
Submitted 20 June, 2019; v1 submitted 3 July, 2018;
originally announced July 2018.