Search | arXiv e-print repository

Pre-training with Synthetic Patterns for Audio

Authors: Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki

Abstract: In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities wit… ▽ More In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods. △ Less

Submitted 1 October, 2024; originally announced October 2024.

Comments: Submitted to ICASSP'25

arXiv:2403.12477 [pdf, other]

Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation

Authors: Yuto Ishikawa, Kohei Konaka, Tomohiko Nakamura, Norihiro Takamune, Hiroshi Saruwatari

Abstract: Real-time speech extraction is an important challenge with various applications such as speech recognition in a human-like avatar/robot. In this paper, we propose the real-time extension of a speech extraction method based on independent low-rank matrix analysis (ILRMA) and rank-constrained spatial covariance matrix estimation (RCSCME). The RCSCME-based method is a multichannel blind speech extrac… ▽ More Real-time speech extraction is an important challenge with various applications such as speech recognition in a human-like avatar/robot. In this paper, we propose the real-time extension of a speech extraction method based on independent low-rank matrix analysis (ILRMA) and rank-constrained spatial covariance matrix estimation (RCSCME). The RCSCME-based method is a multichannel blind speech extraction method that demonstrates superior speech extraction performance in diffuse noise environments. To improve the performance, we introduce spatial regularization into the ILRMA part of the RCSCME-based speech extraction and design two regularizers. Speech extraction experiments demonstrated that the proposed methods can function in real time and the designed regularizers improve the speech extraction performance. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 5 pages, 3 figures, accepted at HSCMA 2024

arXiv:2311.09646 [pdf, other]

doi 10.1109/ACCESS.2023.3314340

Reconstructing Continuous Light Field From Single Coded Image

Authors: Yuya Ishikawa, Keita Takahashi, Chihiro Tsutake, Toshiaki Fujii

Abstract: We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an obse… ▽ More We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an observed image, but in previous works, it was used only for reconstructing discretized light-field views. NeRF-based neural rendering enables high quality view synthesis of a 3-D scene from continuous viewpoints, but when only a single image is given as the input, it struggles to achieve satisfactory quality. Our method integrates these two techniques into an efficient and end-to-end trainable pipeline. Trained on a wide variety of scenes, our method can reconstruct continuous light fields accurately and efficiently without any test time optimization. To our knowledge, this is the first work to bridge two worlds: camera design for efficiently acquiring 3-D information and neural rendering. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Journal ref: IEEE Access, Volume 11, Pages 99387-99396, 2023

arXiv:1904.06851 [pdf]

doi 10.3389/fpsyg.2020.00316.

Proximal binaural sound can induce subjective frisson

Authors: Shiori Honda, Yuri Ishikawa, Rei Konno, Eiko Imai, Natsumi Nomiyama, Kazuki Sakurada, Takuya Koumura, Hirohito M. Kondo, Shigeto Furukawa, Shinya Fujii, Masashi Nakatani

Abstract: Auditory frisson is the experience of feeling of cold or shivering related to sound in the absence of a physical cold stimulus. Multiple examples of frisson-inducing sounds have been reported, but the mechanism of auditory frisson remains elusive. Typical frisson-inducing sounds may contain a looming effect, in which a sound appears to approach the listener's peripersonal space. Previous studies o… ▽ More Auditory frisson is the experience of feeling of cold or shivering related to sound in the absence of a physical cold stimulus. Multiple examples of frisson-inducing sounds have been reported, but the mechanism of auditory frisson remains elusive. Typical frisson-inducing sounds may contain a looming effect, in which a sound appears to approach the listener's peripersonal space. Previous studies on sound in peripersonal space have provided objective measurements of sound-inducing effects, but few have investigated the subjective experience of frisson-inducing sounds. Here we explored whether it is possible to produce subjective feelings of frisson by moving a noise sound (white noise, rolling beads noise, or frictional noise produced by rubbing a plastic bag) stimulus around a listener's head. Our results demonstrated that sound-induced frisson can be experienced stronger when auditory stimuli are rotated around the head (binaural moving sounds) than the one without the rotation (monaural static sounds), regardless of the source of the noise sound. Pearson's correlation analysis showed that several acoustic features of auditory stimuli, such as variance of interaural level difference (ILD), loudness, and sharpness, were correlated with the magnitude of subjective frisson. We had also observed that the subjective feelings of frisson by moving a musical sound had increased comparing with a static musical sound. △ Less

Submitted 8 April, 2020; v1 submitted 15 April, 2019; originally announced April 2019.

Comments: 21 pages, 3 figures, 3 tables, 3 supplemental figures, 3 supplemental tables

Journal ref: Front Psychol. 2020 Mar 3;11:316

Showing 1–4 of 4 results for author: Ishikawa, Y