Skip to main content

Showing 1–11 of 11 results for author: Nakadai, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.14433  [pdf, other

    eess.AS cs.SD

    Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

    Authors: Runwu Shi, Zirui Lin, Benjamin Yen, Jiang Wang, Ragib Amin Nihal, Kazuhiro Nakadai

    Abstract: This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influen… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: 5 pages, 3 figures, accepted by Eusipco 2025

  2. arXiv:2504.03373  [pdf, other

    cs.SD cs.RO eess.AS

    An Efficient GPU-based Implementation for Noise Robust Sound Source Localization

    Authors: Zirui Lin, Masayuki Takigahira, Naoya Terakado, Haris Gulzar, Monikka Roslianna Busto, Takeharu Eda, Katsutoshi Itoyama, Kazuhiro Nakadai, Hideharu Amano

    Abstract: Robot audition, encompassing Sound Source Localization (SSL), Sound Source Separation (SSS), and Automatic Speech Recognition (ASR), enables robots and smart devices to acquire auditory capabilities similar to human hearing. Despite their wide applicability, processing multi-channel audio signals from microphone arrays in SSL involves computationally intensive matrix operations, which can hinder e… ▽ More

    Submitted 8 May, 2025; v1 submitted 4 April, 2025; originally announced April 2025.

    Comments: 6 pages, 2 figures

  3. arXiv:2502.20838  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring

    Authors: Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Kazuhiro Nakadai

    Abstract: Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features w… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  4. arXiv:2412.20146  [pdf, other

    eess.AS cs.SD eess.SP

    Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

    Authors: Runwu Shi, Katsutoshi Itoyama, Kazuhiro Nakadai

    Abstract: This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the… ▽ More

    Submitted 28 December, 2024; originally announced December 2024.

    Comments: Presented on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR 2024), https://vihar-2024.vihar.org/assets/VIHAR_2024_proceedings.pdf

  5. arXiv:2412.20144  [pdf, other

    eess.AS cs.SD

    Distance Based Single-Channel Target Speech Extraction

    Authors: Runwu Shi, Benjamin Yen, Kazuhiro Nakadai

    Abstract: This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance… ▽ More

    Submitted 28 December, 2024; originally announced December 2024.

    Comments: 5 pages, 3 figures, accepted by ICASSP 2025

  6. arXiv:2407.15310  [pdf, other

    eess.SP cs.SD eess.AS

    Can all variations within the unified mask-based beamformer framework achieve identical peak extraction performance?

    Authors: Atsuo Hiroe, Katsutoshi Itoyama, Kazuhiro Nakadai

    Abstract: This study investigates mask-based beamformers (BFs), which estimate filters for target sound extraction (TSE) using time-frequency masks. Although multiple mask-based BFs have been proposed, no consensus has been reached on which one offers the best target-extraction performance. Previously, we found that maximum signal-to-noise ratio and minimum mean square error (MSE) BFs can achieve the same e… ▽ More

    Submitted 22 February, 2025; v1 submitted 21 July, 2024; originally announced July 2024.

    Comments: Accepted for publication in EURASIP journal on Audio, Speech, and Music Processing

    Journal ref: J Audio Speech Music Proc. 2024, 66 (2024)

  7. arXiv:2309.12065  [pdf, other

    eess.AS cs.SD eess.SP

    Is the Ideal Ratio Mask Really the Best? -- Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers

    Authors: Atsuo Hiroe, Katsutoshi Itoyama, Kazuhiro Nakadai

    Abstract: This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: Accepted in APSIPA 2023

  8. arXiv:2305.17846  [pdf, other

    cs.SD cs.CL eess.AS

    Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation

    Authors: Yui Sudo, Kazuya Hata, Kazuhiro Nakadai

    Abstract: End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji charact… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: accepted by INTERSPEECH2023

  9. arXiv:2111.07979  [pdf, other

    cs.SD cs.AI cs.LG eess.AS eess.SY q-bio.NC

    Metric-based multimodal meta-learning for human movement identification via footstep recognition

    Authors: Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

    Abstract: We describe a novel metric-based learning approach that introduces a multimodal framework and uses deep audio and geophone encoders in siamese configuration to design an adaptable and lightweight supervised model. This framework eliminates the need for expensive data labeling procedures and learns general-purpose representations from low multisensory data obtained from omnipresent sensing systems.… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  10. arXiv:1811.02735  [pdf, other

    eess.AS cs.CL cs.SD

    CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

    Authors: Nelson Yalta, Shinji Watanabe, Takaaki Hori, Kazuhiro Nakadai, Tetsuya Ogata

    Abstract: Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this st… ▽ More

    Submitted 20 June, 2019; v1 submitted 6 November, 2018; originally announced November 2018.

    Comments: 5 pages, 1 figure, EUSIPCO 2019

  11. arXiv:1807.01126  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Weakly Supervised Deep Recurrent Neural Networks for Basic Dance Step Generation

    Authors: Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, Tetsuya Ogata

    Abstract: Synthesizing human's movements such as dancing is a flourishing research field which has several applications in computer graphics. Recent studies have demonstrated the advantages of deep neural networks (DNNs) for achieving remarkable performance in motion and music tasks with little effort for feature pre-processing. However, applying DNNs for generating dance to a piece of music is nevertheless… ▽ More

    Submitted 20 June, 2019; v1 submitted 3 July, 2018; originally announced July 2018.

    Comments: 8 pages, 7 figures. Proc. IJCNN 2019