-
Cardiomyopathy Diagnosis Model from Endomyocardial Biopsy Specimens: Appropriate Feature Space and Class Boundary in Small Sample Size Data
Authors:
Masaya Mori,
Yuto Omae,
Yutaka Koyama,
Kazuyuki Hara,
Jun Toyotani,
Yasuo Okumura,
Hiroyuki Hao
Abstract:
As the number of patients with heart failure increases, machine learning (ML) has garnered attention in cardiomyopathy diagnosis, driven by the shortage of pathologists. However, endomyocardial biopsy specimens are often small sample size and require techniques such as feature extraction and dimensionality reduction. This study aims to determine whether texture features are effective for feature e…
▽ More
As the number of patients with heart failure increases, machine learning (ML) has garnered attention in cardiomyopathy diagnosis, driven by the shortage of pathologists. However, endomyocardial biopsy specimens are often small sample size and require techniques such as feature extraction and dimensionality reduction. This study aims to determine whether texture features are effective for feature extraction in the pathological diagnosis of cardiomyopathy. Furthermore, model designs that contribute toward improving generalization performance are examined by applying feature selection (FS) and dimensional compression (DC) to several ML models. The obtained results were verified by visualizing the inter-class distribution differences and conducting statistical hypothesis testing based on texture features. Additionally, they were evaluated using predictive performance across different model designs with varying combinations of FS and DC (applied or not) and decision boundaries. The obtained results confirmed that texture features may be effective for the pathological diagnosis of cardiomyopathy. Moreover, when the ratio of features to the sample size is high, a multi-step process involving FS and DC improved the generalization performance, with the linear kernel support vector machine achieving the best results. This process was demonstrated to be potentially effective for models with reduced complexity, regardless of whether the decision boundaries were linear, curved, perpendicular, or parallel to the axes. These findings are expected to facilitate the development of an effective cardiomyopathy diagnostic model for its rapid adoption in medical practice.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
FontCraft: Multimodal Font Design Using Interactive Bayesian Optimization
Authors:
Yuki Tatsukawa,
I-Chao Shen,
Mustafa Doga Dogan,
Anran Qi,
Yuki Koyama,
Ariel Shamir,
Takeo Igarashi
Abstract:
Creating new fonts requires a lot of human effort and professional typographic knowledge. Despite the rapid advancements of automatic font generation models, existing methods require users to prepare pre-designed characters with target styles using font-editing software, which poses a problem for non-expert users. To address this limitation, we propose FontCraft, a system that enables font generat…
▽ More
Creating new fonts requires a lot of human effort and professional typographic knowledge. Despite the rapid advancements of automatic font generation models, existing methods require users to prepare pre-designed characters with target styles using font-editing software, which poses a problem for non-expert users. To address this limitation, we propose FontCraft, a system that enables font generation without relying on pre-designed characters. Our approach integrates the exploration of a font-style latent space with human-in-the-loop preferential Bayesian optimization and multimodal references, facilitating efficient exploration and enhancing user control. Moreover, FontCraft allows users to revisit previous designs, retracting their earlier choices in the preferential Bayesian optimization process. Once users finish editing the style of a selected character, they can propagate it to the remaining characters and further refine them as needed. The system then generates a complete outline font in OpenType format. We evaluated the effectiveness of FontCraft through a user study comparing it to a baseline interface. Results from both quantitative and qualitative evaluations demonstrate that FontCraft enables non-expert users to design fonts efficiently.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Music Foundation Model as Generic Booster for Music Downstream Tasks
Authors:
WeiHsiang Liao,
Yuhta Takida,
Yukara Ikemiya,
Zhi Zhong,
Chieh-Hsin Lai,
Giorgio Fabbro,
Kazuki Shimada,
Keisuke Toyama,
Kinwai Cheuk,
Marco A. Martínez-Ramírez,
Shusuke Takahashi,
Stefan Uhlich,
Taketo Akama,
Woosung Choi,
Yuichiro Koyama,
Yuki Mitsufuji
Abstract:
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across var…
▽ More
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
△ Less
Submitted 5 November, 2024; v1 submitted 2 November, 2024;
originally announced November 2024.
-
A Practical Style Transfer Pipeline for 3D Animation: Insights from Production R&D
Authors:
Hideki Todo,
Yuki Koyama,
Kunihiro Sakai,
Akihiro Komiya,
Jun Kato
Abstract:
Our animation studio has developed a practical style transfer pipeline for creating stylized 3D animation, which is suitable for complex real-world production. This paper presents the insights from our development process, where we explored various options to balance quality, artist control, and workload, leading to several key decisions. For example, we chose patch-based texture synthesis over ma…
▽ More
Our animation studio has developed a practical style transfer pipeline for creating stylized 3D animation, which is suitable for complex real-world production. This paper presents the insights from our development process, where we explored various options to balance quality, artist control, and workload, leading to several key decisions. For example, we chose patch-based texture synthesis over machine learning for better control and to avoid training data issues. We also addressed specifying style exemplars, managing multiple colors within a scene, controlling outlines and shadows, and reducing temporal noise. These insights were used to further refine our pipeline, ultimately enabling us to produce an experimental short film showcasing various styles.
△ Less
Submitted 31 October, 2024;
originally announced October 2024.
-
FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications
Authors:
Yuki Tatsukawa,
I-Chao Shen,
Anran Qi,
Yuki Koyama,
Takeo Igarashi,
Ariel Shamir
Abstract:
Acquiring the desired font for various design tasks can be challenging and requires professional typographic knowledge. While previous font retrieval or generation works have alleviated some of these difficulties, they often lack support for multiple languages and semantic attributes beyond the training data domains. To solve this problem, we present FontCLIP: a model that connects the semantic un…
▽ More
Acquiring the desired font for various design tasks can be challenging and requires professional typographic knowledge. While previous font retrieval or generation works have alleviated some of these difficulties, they often lack support for multiple languages and semantic attributes beyond the training data domains. To solve this problem, we present FontCLIP: a model that connects the semantic understanding of a large vision-language model with typographical knowledge. We integrate typography-specific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model through a novel finetuning approach. We propose to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters. FontCLIP's semantic typographic latent space demonstrates two unprecedented generalization abilities. First, FontCLIP generalizes to different languages including Chinese, Japanese, and Korean (CJK), capturing the typographical features of fonts across different languages, even though it was only finetuned using fonts of Roman characters. Second, FontCLIP can recognize the semantic attributes that are not presented in the training data. FontCLIP's dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization, reducing the burden of obtaining desired fonts.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Zero- and Few-shot Sound Event Localization and Detection
Authors:
Kazuki Shimada,
Kengo Uchida,
Yuichiro Koyama,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji,
Tatsuya Kawahara
Abstract:
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few…
▽ More
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.
△ Less
Submitted 17 January, 2024; v1 submitted 17 September, 2023;
originally announced September 2023.
-
STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events
Authors:
Kazuki Shimada,
Archontis Politis,
Parthasaarathy Sudarsanam,
Daniel Krause,
Kengo Uchida,
Sharath Adavanne,
Aapo Hakala,
Yuichiro Koyama,
Naoya Takahashi,
Shusuke Takahashi,
Tuomas Virtanen,
Yuki Mitsufuji
Abstract:
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information…
▽ More
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.
△ Less
Submitted 14 November, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
Authors:
Hao Shi,
Kazuki Shimada,
Masato Hirano,
Takashi Shibuya,
Yuichiro Koyama,
Zhi Zhong,
Shusuke Takahashi,
Tatsuya Kawahara,
Yuki Mitsufuji
Abstract:
Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us…
▽ More
Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+).
△ Less
Submitted 28 February, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Diffusion-based Signal Refiner for Speech Separation
Authors:
Masato Hirano,
Kazuki Shimada,
Yuichiro Koyama,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion mod…
▽ More
We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion models motivated us to tackle this problem by restoring the degraded parts of initial separations with a generative approach. Utilizing the denoising diffusion restoration model (DDRM) as a basis, we propose a shared DDRM-based refiner that generates samples conditioned on the global information of preceding outputs from arbitrary speech separation models. We experimentally show that our refiner can provide a clearer harmonic structure of speech and improves the reference-free metric of perceptual quality for arbitrary preceding model architectures. Furthermore, we tune the variance of the measurement noise based on preceding outputs, which results in higher scores in both reference-free and reference-based metrics. The separation quality can also be further improved by blending the discriminative and generative outputs.
△ Less
Submitted 12 May, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Vitreoretinal Surgical Robotic System with Autonomous Orbital Manipulation using Vector-Field Inequalities
Authors:
Yuki Koyama,
Murilo Marques Marinho,
Kanako Harada
Abstract:
Vitreoretinal surgery pertains to the treatment of delicate tissues on the fundus of the eye using thin instruments. Surgeons frequently rotate the eye during surgery, which is called orbital manipulation, to observe regions around the fundus without moving the patient. In this paper, we propose the autonomous orbital manipulation of the eye in robot-assisted vitreoretinal surgery with our tele-op…
▽ More
Vitreoretinal surgery pertains to the treatment of delicate tissues on the fundus of the eye using thin instruments. Surgeons frequently rotate the eye during surgery, which is called orbital manipulation, to observe regions around the fundus without moving the patient. In this paper, we propose the autonomous orbital manipulation of the eye in robot-assisted vitreoretinal surgery with our tele-operated surgical system. In a simulation study, we preliminarily investigated the increase in the manipulability of our system using orbital manipulation. Furthermore, we demonstrated the feasibility of our method in experiments with a physical robot and a realistic eye model, showing an increase in the view-able area of the fundus when compared to a conventional technique. Source code and minimal example available at https://github.com/mmmarinho/icra2023_orbitalmanipulation.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
A Structure-Guided Diffusion Model for Large-Hole Image Completion
Authors:
Daichi Horita,
Jiaolong Yang,
Dong Chen,
Yuki Koyama,
Kiyoharu Aizawa,
Nicu Sebe
Abstract:
Image completion techniques have made significant progress in filling missing regions (i.e., holes) in images. However, large-hole completion remains challenging due to limited structural information. In this paper, we address this problem by integrating explicit structural guidance into diffusion-based image completion, forming our structure-guided diffusion model (SGDM). It consists of two casca…
▽ More
Image completion techniques have made significant progress in filling missing regions (i.e., holes) in images. However, large-hole completion remains challenging due to limited structural information. In this paper, we address this problem by integrating explicit structural guidance into diffusion-based image completion, forming our structure-guided diffusion model (SGDM). It consists of two cascaded diffusion probabilistic models: structure and texture generators. The structure generator generates an edge image representing plausible structures within the holes, which is then used for guiding the texture generation process. To train both generators jointly, we devise a novel strategy that leverages optimal Bayesian denoising, which denoises the output of the structure generator in a single step and thus allows backpropagation. Our diffusion-based approach enables a diversity of plausible completions, while the editable edges allow for editing parts of an image. Our experiments on natural scene (Places) and face (CelebA-HQ) datasets demonstrate that our method achieves a superior or comparable visual quality compared to state-of-the-art approaches. The code is available for research purposes at https://github.com/UdonDa/Structure_Guided_Diffusion_Model.
△ Less
Submitted 6 September, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events
Authors:
Archontis Politis,
Kazuki Shimada,
Parthasaarathy Sudarsanam,
Sharath Adavanne,
Daniel Krause,
Yuichiro Koyama,
Naoya Takahashi,
Shusuke Takahashi,
Yuki Mitsufuji,
Tuomas Virtanen
Abstract:
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr…
▽ More
This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880.
△ Less
Submitted 2 September, 2022; v1 submitted 4 June, 2022;
originally announced June 2022.
-
Distortion Audio Effects: Learning How to Recover the Clean Signal
Authors:
Johannes Imort,
Giorgio Fabbro,
Marco A. Martínez Ramírez,
Stefan Uhlich,
Yuichiro Koyama,
Yuki Mitsufuji
Abstract:
Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system. This paper focuses on removing distortion audio effects applied to guitar tracks in music production. We explore whether effect removal can be solved by neural networks designed for source separation and audio effect…
▽ More
Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system. This paper focuses on removing distortion audio effects applied to guitar tracks in music production. We explore whether effect removal can be solved by neural networks designed for source separation and audio effect modeling.
Our approach proves particularly effective for effects that mix the processed and clean signals. The models achieve better quality and significantly faster inference compared to state-of-the-art solutions based on sparse optimization. We demonstrate that the models are suitable not only for declipping but also for other types of distortion effects. By discussing the results, we stress the usefulness of multiple evaluation metrics to assess different aspects of reconstruction in distortion effect removal.
△ Less
Submitted 13 September, 2022; v1 submitted 3 February, 2022;
originally announced February 2022.
-
Improving segmentation of calcified and non-calcified plaques on CCTA-CPR scans via masking of the artery wall
Authors:
Antonio Tejero-de-Pablos,
Hiroaki Yamane,
Yusuke Kurose,
Junichi Iho,
Youji Tokunaga,
Makoto Horie,
Keisuke Nishizawa,
Yusaku Hayashi,
Yasushi Koyama,
Tatsuya Harada
Abstract:
The presence of plaques in the coronary arteries is a major risk to the patients' life. In particular, non-calcified plaques pose a great challenge, as they are harder to detect and more likely to rupture than calcified plaques. While current deep learning techniques allow precise segmentation of real-life images, the performance in medical images is still low. This is caused mostly by blurriness…
▽ More
The presence of plaques in the coronary arteries is a major risk to the patients' life. In particular, non-calcified plaques pose a great challenge, as they are harder to detect and more likely to rupture than calcified plaques. While current deep learning techniques allow precise segmentation of real-life images, the performance in medical images is still low. This is caused mostly by blurriness and ambiguous voxel intensities of unrelated parts that fall on the same value range. In this paper, we propose a novel methodology for segmenting calcified and non-calcified plaques in CCTA-CPR scans of coronary arteries. The input slices are masked so only the voxels within the wall vessel are considered for segmentation, thus, reducing ambiguity. This mask can be automatically generated via a deep learning-based vessel detector, that provides not only the contour of the outer artery wall, but also the inner contour. For evaluation, we utilized a dataset in which each voxel is carefully annotated as one of five classes: background, lumen, artery wall, calcified plaque, or non-calcified plaque. We also provide an exhaustive evaluation by applying different types of masks, in order to validate the potential of vessel masking for plaque segmentation. Our methodology results in a prominent boost in segmentation performance, in both quantitative and qualitative evaluation, achieving accurate plaque shapes even for the challenging non-calcified plaques. Furthermore, when using highly accurate masks, difficult cases such as stenosis become segmentable. We believe our findings can lead the future research for high-performance plaque segmentation.
△ Less
Submitted 10 April, 2023; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training
Authors:
Kazuki Shimada,
Yuichiro Koyama,
Shusuke Takahashi,
Naoya Takahashi,
Emiru Tsunoo,
Yuki Mitsufuji
Abstract:
Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target u…
▽ More
Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlapping events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters.
△ Less
Submitted 27 March, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection
Authors:
Yuichiro Koyama,
Kazuhide Shigemi,
Masafumi Takahashi,
Kazuki Shimada,
Naoya Takahashi,
Emiru Tsunoo,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events…
▽ More
Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events make it difficult to accurately extract spatial characteristics from target sound events. To address this problem, we propose an impulse response simulation framework (IRS) that augments spatial characteristics using simulated room impulse responses (RIR). RIRs corresponding to a microphone array assumed to be placed in various rooms are accurately simulated, and the source signals of the target sound events are extracted from a mixture. The simulated RIRs are then convolved with the extracted source signals to obtain an augmented multi-channel training dataset. Evaluation results obtained using the TAU-NIGENS Spatial Sound Events 2021 dataset show that the IRS contributes to improving the overall SELD performance. Additionally, we conducted an ablation study to discuss the contribution and need for each component within the IRS.
△ Less
Submitted 28 April, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Music Source Separation with Deep Equilibrium Models
Authors:
Yuichiro Koyama,
Naoki Murata,
Stefan Uhlich,
Giorgio Fabbro,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while keeping the number of parameters small. This…
▽ More
While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while keeping the number of parameters small. This makes DEQ also attractive for MSS, especially as it was originally applied to sequential modeling tasks in natural language processing and thus should in principle be also suited for MSS. However, an investigation of a good architecture and training scheme for MSS with DEQ is needed as the characteristics of acoustic signals are different from those of natural language data. Hence, in this paper we propose an architecture and training scheme for MSS with DEQ. Starting with the architecture of Open-Unmix (UMX), we replace its sequence model with DEQ. We refer to our proposed method as DEQ-based UMX (DEQ-UMX). Experimental results show that DEQ-UMX performs better than the original UMX while reducing its number of parameters by 30%.
△ Less
Submitted 28 April, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection
Authors:
Ricardo Falcon-Perez,
Kazuki Shimada,
Yuichiro Koyama,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full…
▽ More
Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full 3D audio scene. We propose Spatial Mixup, as an application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. Similarly to beamforming, these modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced. Therefore enabling deep learning models to achieve invariance to small spatial perturbations. The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline, and compares to other well known augmentation methods. Furthermore, combining spatial mixup with other methods greatly improves performance.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Autonomous Coordinated Control of the Light Guide for Positioning in Vitreoretinal Surgery
Authors:
Yuki Koyama,
Murilo M. Marinho,
Mamoru Mitsuishi,
Kanako Harada
Abstract:
Vitreoretinal surgery is challenging even for expert surgeons owing to the delicate target tissues and the diminutive workspace in the retina. In addition to improved dexterity and accuracy, robot assistance allows for (partial) task automation. In this work, we propose a strategy to automate the motion of the light guide with respect to the surgical instrument. This automation allows the instrume…
▽ More
Vitreoretinal surgery is challenging even for expert surgeons owing to the delicate target tissues and the diminutive workspace in the retina. In addition to improved dexterity and accuracy, robot assistance allows for (partial) task automation. In this work, we propose a strategy to automate the motion of the light guide with respect to the surgical instrument. This automation allows the instrument's shadow to always be inside the microscopic view, which is an important cue for the accurate positioning of the instrument in the retina. We show simulations and experiments demonstrating that the proposed strategy is effective in a 700-point grid in the retina of a surgical phantom. Furthermore, we integrated the proposed strategy with image processing and succeeded in positioning the surgical instrument's tip in the retina, relying on only the robot's geometric information and microscopic images.
△ Less
Submitted 20 January, 2022; v1 submitted 26 July, 2021;
originally announced July 2021.
-
Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection
Authors:
Kazuki Shimada,
Naoya Takahashi,
Yuichiro Koyama,
Shusuke Takahashi,
Emiru Tsunoo,
Masafumi Takahashi,
Yuki Mitsufuji
Abstract:
This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augme…
▽ More
This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location-dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several systems trained with different conditions such as input features, training folds, and model architectures. We also use the event independent network v2 (EINV2)-based system to increase the diversity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simulated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline system on the development dataset.
△ Less
Submitted 20 June, 2021;
originally announced June 2021.
-
Tool- and Domain-Agnostic Parameterization of Style Transfer Effects Leveraging Pretrained Perceptual Metrics
Authors:
Hiromu Yakura,
Yuki Koyama,
Masataka Goto
Abstract:
Current deep learning techniques for style transfer would not be optimal for design support since their "one-shot" transfer does not fit exploratory design processes. To overcome this gap, we propose parametric transcription, which transcribes an end-to-end style transfer effect into parameter values of specific transformations available in an existing content editing tool. With this approach, use…
▽ More
Current deep learning techniques for style transfer would not be optimal for design support since their "one-shot" transfer does not fit exploratory design processes. To overcome this gap, we propose parametric transcription, which transcribes an end-to-end style transfer effect into parameter values of specific transformations available in an existing content editing tool. With this approach, users can imitate the style of a reference sample in the tool that they are familiar with and thus can easily continue further exploration by manipulating the parameters. To enable this, we introduce a framework that utilizes an existing pretrained model for style transfer to calculate a perceptual style distance to the reference sample and uses black-box optimization to find the parameters that minimize this distance. Our experiments with various third-party tools, such as Instagram and Blender, show that our framework can effectively leverage deep learning techniques for computational design support.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection
Authors:
Kazuki Shimada,
Yuichiro Koyama,
Naoya Takahashi,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task in…
▽ More
Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task increases system complexity and network size. To address these problems, we propose an activity-coupled Cartesian DOA (ACCDOA) representation, which assigns a sound event activity to the length of a corresponding Cartesian DOA vector. The ACCDOA representation enables us to solve a SELD task with a single target and has two advantages: avoiding the necessity of balancing the objectives and model size increase. In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size. The ACCDOA-based SELD system also performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection.
△ Less
Submitted 14 February, 2021; v1 submitted 28 October, 2020;
originally announced October 2020.
-
Generative Melody Composition with Human-in-the-Loop Bayesian Optimization
Authors:
Yijun Zhou,
Yuki Koyama,
Masataka Goto,
Takeo Igarashi
Abstract:
Deep generative models allow even novice composers to generate various melodies by sampling latent vectors. However, finding the desired melody is challenging since the latent space is unintuitive and high-dimensional. In this work, we present an interactive system that supports generative melody composition with human-in-the-loop Bayesian optimization (BO). This system takes a mixed-initiative ap…
▽ More
Deep generative models allow even novice composers to generate various melodies by sampling latent vectors. However, finding the desired melody is challenging since the latent space is unintuitive and high-dimensional. In this work, we present an interactive system that supports generative melody composition with human-in-the-loop Bayesian optimization (BO). This system takes a mixed-initiative approach; the system generates candidate melodies to evaluate, and the user evaluates them and provides preferential feedback (i.e., picking the best melody among the candidates) to the system. This process is iteratively performed based on BO techniques until the user finds the desired melody. We conducted a pilot study using our prototype system, suggesting the potential of this approach.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
Exploring Optimal DNN Architecture for End-to-End Beamformers Based on Time-frequency References
Authors:
Yuichiro Koyama,
Bhiksha Raj
Abstract:
Acoustic beamformers have been widely used to enhance audio signals. Currently, the best methods are the deep neural network (DNN)-powered variants of the generalized eigenvalue and minimum-variance distortionless response beamformers and the DNN-based filter-estimation methods that are used to directly compute beamforming filters. Both approaches are effective; however, they have blind spots in t…
▽ More
Acoustic beamformers have been widely used to enhance audio signals. Currently, the best methods are the deep neural network (DNN)-powered variants of the generalized eigenvalue and minimum-variance distortionless response beamformers and the DNN-based filter-estimation methods that are used to directly compute beamforming filters. Both approaches are effective; however, they have blind spots in their generalizability. Therefore, we propose a novel approach for combining these two methods into a single framework that attempts to exploit the best features of both. The resulting model, called the W-Net beamformer, includes two components; the first computes time-frequency references that the second uses to estimate beamforming filters. The results on data that include a wide variety of room and noise conditions, including static and mobile noise sources, show that the proposed beamformer outperforms other methods on all tested evaluation metrics, which signifies that the proposed architecture allows for effective computation of the beamforming filters.
△ Less
Submitted 11 August, 2020; v1 submitted 23 May, 2020;
originally announced May 2020.
-
Efficient Integration of Multi-channel Information for Speaker-independent Speech Separation
Authors:
Yuichiro Koyama,
Oluwafemi Azeez,
Bhiksha Raj
Abstract:
Although deep-learning-based methods have markedly improved the performance of speech separation over the past few years, it remains an open question how to integrate multi-channel signals for speech separation. We propose two methods, namely, early-fusion and late-fusion methods, to integrate multi-channel information based on the time-domain audio separation network, which has been proven effect…
▽ More
Although deep-learning-based methods have markedly improved the performance of speech separation over the past few years, it remains an open question how to integrate multi-channel signals for speech separation. We propose two methods, namely, early-fusion and late-fusion methods, to integrate multi-channel information based on the time-domain audio separation network, which has been proven effective in single-channel speech separation. We also propose channel-sequential-transfer learning, which is a transfer learning framework that applies the parameters trained for a lower-channel network as the initial values of a higher-channel network. For fair comparison, we evaluated our proposed methods using a spatialized version of the wsj0-2mix dataset, which is open-sourced. It was found that our proposed methods can outperform multi-channel deep clustering and improve the performance proportionally to the number of microphones. It was also proven that the performance of the late-fusion method is consistently higher than that of the single-channel method regardless of the angle difference between speakers.
△ Less
Submitted 11 August, 2020; v1 submitted 23 May, 2020;
originally announced May 2020.
-
Exploring the Best Loss Function for DNN-Based Low-latency Speech Enhancement with Temporal Convolutional Networks
Authors:
Yuichiro Koyama,
Tyler Vuong,
Stefan Uhlich,
Bhiksha Raj
Abstract:
Recently, deep neural networks (DNNs) have been successfully used for speech enhancement, and DNN-based speech enhancement is becoming an attractive research area. While time-frequency masking based on the short-time Fourier transform (STFT) has been widely used for DNN-based speech enhancement over the last years, time domain methods such as the time-domain audio separation network (TasNet) have…
▽ More
Recently, deep neural networks (DNNs) have been successfully used for speech enhancement, and DNN-based speech enhancement is becoming an attractive research area. While time-frequency masking based on the short-time Fourier transform (STFT) has been widely used for DNN-based speech enhancement over the last years, time domain methods such as the time-domain audio separation network (TasNet) have also been proposed. The most suitable method depends on the scale of the dataset and the type of task. In this paper, we explore the best speech enhancement algorithm on two different datasets. We propose a STFT-based method and a loss function using problem-agnostic speech encoder (PASE) features to improve subjective quality for the smaller dataset. Our proposed methods are effective on the Voice Bank + DEMAND dataset and compare favorably to other state-of-the-art methods. We also implement a low-latency version of TasNet, which we submitted to the DNS Challenge and made public by open-sourcing it. Our model achieves excellent performance on the DNS Challenge dataset.
△ Less
Submitted 20 August, 2020; v1 submitted 23 May, 2020;
originally announced May 2020.
-
Sequential Gallery for Interactive Visual Design Optimization
Authors:
Yuki Koyama,
Issei Sato,
Masataka Goto
Abstract:
Visual design tasks often involve tuning many design parameters. For example, color grading of a photograph involves many parameters, some of which non-expert users might be unfamiliar with. We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set by exploring such a high-dimensional design space through much easier two-dimensional…
▽ More
Visual design tasks often involve tuning many design parameters. For example, color grading of a photograph involves many parameters, some of which non-expert users might be unfamiliar with. We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set by exploring such a high-dimensional design space through much easier two-dimensional search subtasks. This method, called sequential plane search, is based on Bayesian optimization to keep necessary queries to users as few as possible. To help users respond to plane-search queries, we also propose using a gallery-based interface that provides options in the two-dimensional subspace arranged in an adaptive grid view. We call this interactive framework Sequential Gallery since users sequentially select the best option from the options provided by the interface. Our experiment with synthetic functions shows that our sequential plane search can find satisfactory solutions in fewer iterations than baselines. We also conducted a preliminary user study, results of which suggest that novices can effectively complete search tasks with Sequential Gallery in a photo-enhancement scenario.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
MirrorNet: A Deep Bayesian Approach to Reflective 2D Pose Estimation from Human Images
Authors:
Takayuki Nakatsuka,
Kazuyoshi Yoshii,
Yuki Koyama,
Satoru Fukayama,
Masataka Goto,
Shigeo Morishima
Abstract:
This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we propose a semi-supervised method that can make effect…
▽ More
This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we propose a semi-supervised method that can make effective use of images with and without pose annotations. Specifically, we formulate a hierarchical generative model of poses and images by integrating a deep generative model of poses from pose features with that of images from poses and image features. We then introduce a deep recognition model that infers poses from images. Given images as observed data, these models can be trained jointly in a hierarchical variational autoencoding (image-to-pose-to-feature-to-pose-to-image) manner. The results of experiments show that the proposed reflective architecture makes estimated poses anatomically plausible, and the performance of pose estimation improved by integrating the recognition and generative models and also by feeding non-annotated images.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Coronary Wall Segmentation in CCTA Scans via a Hybrid Net with Contours Regularization
Authors:
Kaikai Huang,
Antonio Tejero-de-Pablos,
Hiroaki Yamane,
Yusuke Kurose,
Junichi Iho,
Youji Tokunaga,
Makoto Horie,
Keisuke Nishizawa,
Yusaku Hayashi,
Yasushi Koyama,
Tatsuya Harada
Abstract:
Providing closed and well-connected boundaries of coronary artery is essential to assist cardiologists in the diagnosis of coronary artery disease (CAD). Recently, several deep learning-based methods have been proposed for boundary detection and segmentation in a medical image. However, when applied to coronary wall detection, they tend to produce disconnected and inaccurate boundaries. In this pa…
▽ More
Providing closed and well-connected boundaries of coronary artery is essential to assist cardiologists in the diagnosis of coronary artery disease (CAD). Recently, several deep learning-based methods have been proposed for boundary detection and segmentation in a medical image. However, when applied to coronary wall detection, they tend to produce disconnected and inaccurate boundaries. In this paper, we propose a novel boundary detection method for coronary arteries that focuses on the continuity and connectivity of the boundaries. In order to model the spatial continuity of consecutive images, our hybrid architecture takes a volume (i.e., a segment of the coronary artery) as input and detects the boundary of the target slice (i.e., the central slice of the segment). Then, to ensure closed boundaries, we propose a contour-constrained weighted Hausdorff distance loss. We evaluate our method on a dataset of 34 patients of coronary CT angiography scans with curved planar reconstruction (CCTA-CPR) of the arteries (i.e., cross-sections). Experiment results show that our method can produce smooth closed boundaries outperforming the state-of-the-art accuracy.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
Computational Design with Crowds
Authors:
Yuki Koyama,
Takeo Igarashi
Abstract:
Computational design is aimed at supporting or automating design processes using computational techniques. However, some classes of design tasks involve criteria that are difficult to handle only with computers. For example, visual design tasks seeking to fulfill aesthetic goals are difficult to handle purely with computers. One promising approach is to leverage human computation; that is, to inco…
▽ More
Computational design is aimed at supporting or automating design processes using computational techniques. However, some classes of design tasks involve criteria that are difficult to handle only with computers. For example, visual design tasks seeking to fulfill aesthetic goals are difficult to handle purely with computers. One promising approach is to leverage human computation; that is, to incorporate human input into the computation process. Crowdsourcing platforms provide a convenient way to integrate such human computation into a working system.
In this chapter, we discuss such computational design with crowds in the domain of parameter tweaking tasks in visual design. Parameter tweaking is often performed to maximize the aesthetic quality of designed objects. Computational design powered by crowds can solve this maximization problem by leveraging human computation. We discuss the opportunities and challenges of computational design with crowds with two illustrative examples: (1) estimating the objective function (specifically, preference learning from crowds' pairwise comparisons) to facilitate interactive design exploration by a designer and (2) directly searching for the optimal parameter setting that maximizes the objective function (specifically, crowds-in-the-loop Bayesian optimization).
△ Less
Submitted 20 February, 2020;
originally announced February 2020.
-
Recreation of the Periodic Table with an Unsupervised Machine Learning Algorithm
Authors:
Minoru Kusaba,
Chang Liu,
Yukinori Koyama,
Kiyoyuki Terakura,
Ryo Yoshida
Abstract:
In 1869, the first draft of the periodic table was published by Russian chemist Dmitri Mendeleev. In terms of data science, his achievement can be viewed as a successful example of feature embedding based on human cognition: chemical properties of all known elements at that time were compressed onto the two-dimensional grid system for tabular display. In this study, we seek to answer the question…
▽ More
In 1869, the first draft of the periodic table was published by Russian chemist Dmitri Mendeleev. In terms of data science, his achievement can be viewed as a successful example of feature embedding based on human cognition: chemical properties of all known elements at that time were compressed onto the two-dimensional grid system for tabular display. In this study, we seek to answer the question of whether machine learning can reproduce or recreate the periodic table by using observed physicochemical properties of the elements. To achieve this goal, we developed a periodic table generator (PTG). The PTG is an unsupervised machine learning algorithm based on the generative topographic mapping (GTM), which can automate the translation of high-dimensional data into a tabular form with varying layouts on-demand. The PTG autonomously produced various arrangements of chemical symbols, which organized a two-dimensional array such as Mendeleev's periodic table or three-dimensional spiral table according to the underlying periodicity in the given data. We further showed what the PTG learned from the element data and how the element features, such as melting point and electronegativity, are compressed to the lower-dimensional latent spaces.
△ Less
Submitted 28 February, 2021; v1 submitted 23 December, 2019;
originally announced December 2019.
-
W-Net BF: DNN-based Beamformer Using Joint Training Approach
Authors:
Yuichiro Koyama,
Bhiksha Raj
Abstract:
Acoustic beamformers have been widely used to enhance audio signals. The best current methods are DNN-powered variants of the generalized eigenvalue beamformer, and DNN-based filterestimation methods that directly compute beamforming filters. Both approaches, while effective, have blindspots in their generalizability. We propose a novel approach that combines both approaches into a single framewor…
▽ More
Acoustic beamformers have been widely used to enhance audio signals. The best current methods are DNN-powered variants of the generalized eigenvalue beamformer, and DNN-based filterestimation methods that directly compute beamforming filters. Both approaches, while effective, have blindspots in their generalizability. We propose a novel approach that combines both approaches into a single framework that attempts to exploit the best features of both. The resulting model, called a W-Net beamformer, includes two components: the first computes a noise-masked reference which the second uses to estimate beamforming filters. Results on data that include a wide variety of room and noise conditions, including static and mobile noise sources, show that the proposed beamformer outperforms other methods in all tested evaluation metrics.
△ Less
Submitted 29 February, 2020; v1 submitted 31 October, 2019;
originally announced October 2019.
-
Metric Learning with Background Noise Class for Few-shot Detection of Rare Sound Events
Authors:
Kazuki Shimada,
Yuichiro Koyama,
Akira Inoue
Abstract:
Few-shot learning systems for sound event recognition have gained interests since they require only a few examples to adapt to new target classes without fine-tuning. However, such systems have only been applied to chunks of sounds for classification or verification. In this paper, we aim to achieve few-shot detection of rare sound events, from query sequence that contain not only the target event…
▽ More
Few-shot learning systems for sound event recognition have gained interests since they require only a few examples to adapt to new target classes without fine-tuning. However, such systems have only been applied to chunks of sounds for classification or verification. In this paper, we aim to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise. Therefore, it is required to prevent false positive reactions to both the other events and background noise. We propose metric learning with background noise class for the few-shot detection. The contribution is to present the explicit inclusion of background noise as an independent class, a suitable loss function that emphasizes this additional class, and a corresponding sampling strategy that assists training. It provides a feature space where the event classes and the background noise class are sufficiently separated. Evaluations on few-shot detection tasks, using DCASE 2017 task2 and ESC-50, show that our proposed method outperforms metric learning without considering the background noise class. The few-shot detection performance is also comparable to that of the DCASE 2017 task2 baseline system, which requires huge amount of annotated audio data.
△ Less
Submitted 18 February, 2020; v1 submitted 30 October, 2019;
originally announced October 2019.