-
Pre-training with Synthetic Patterns for Audio
Authors:
Yuchi Ishikawa,
Tatsuya Komatsu,
Yoshimitsu Aoki
Abstract:
In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities wit…
▽ More
In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.
△ Less
Submitted 1 October, 2024;
originally announced October 2024.
-
Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation
Authors:
Yuto Ishikawa,
Kohei Konaka,
Tomohiko Nakamura,
Norihiro Takamune,
Hiroshi Saruwatari
Abstract:
Real-time speech extraction is an important challenge with various applications such as speech recognition in a human-like avatar/robot. In this paper, we propose the real-time extension of a speech extraction method based on independent low-rank matrix analysis (ILRMA) and rank-constrained spatial covariance matrix estimation (RCSCME). The RCSCME-based method is a multichannel blind speech extrac…
▽ More
Real-time speech extraction is an important challenge with various applications such as speech recognition in a human-like avatar/robot. In this paper, we propose the real-time extension of a speech extraction method based on independent low-rank matrix analysis (ILRMA) and rank-constrained spatial covariance matrix estimation (RCSCME). The RCSCME-based method is a multichannel blind speech extraction method that demonstrates superior speech extraction performance in diffuse noise environments. To improve the performance, we introduce spatial regularization into the ILRMA part of the RCSCME-based speech extraction and design two regularizers. Speech extraction experiments demonstrated that the proposed methods can function in real time and the designed regularizers improve the speech extraction performance.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Reconstructing Continuous Light Field From Single Coded Image
Authors:
Yuya Ishikawa,
Keita Takahashi,
Chihiro Tsutake,
Toshiaki Fujii
Abstract:
We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an obse…
▽ More
We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an observed image, but in previous works, it was used only for reconstructing discretized light-field views. NeRF-based neural rendering enables high quality view synthesis of a 3-D scene from continuous viewpoints, but when only a single image is given as the input, it struggles to achieve satisfactory quality. Our method integrates these two techniques into an efficient and end-to-end trainable pipeline. Trained on a wide variety of scenes, our method can reconstruct continuous light fields accurately and efficiently without any test time optimization. To our knowledge, this is the first work to bridge two worlds: camera design for efficiently acquiring 3-D information and neural rendering.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Proximal binaural sound can induce subjective frisson
Authors:
Shiori Honda,
Yuri Ishikawa,
Rei Konno,
Eiko Imai,
Natsumi Nomiyama,
Kazuki Sakurada,
Takuya Koumura,
Hirohito M. Kondo,
Shigeto Furukawa,
Shinya Fujii,
Masashi Nakatani
Abstract:
Auditory frisson is the experience of feeling of cold or shivering related to sound in the absence of a physical cold stimulus. Multiple examples of frisson-inducing sounds have been reported, but the mechanism of auditory frisson remains elusive. Typical frisson-inducing sounds may contain a looming effect, in which a sound appears to approach the listener's peripersonal space. Previous studies o…
▽ More
Auditory frisson is the experience of feeling of cold or shivering related to sound in the absence of a physical cold stimulus. Multiple examples of frisson-inducing sounds have been reported, but the mechanism of auditory frisson remains elusive. Typical frisson-inducing sounds may contain a looming effect, in which a sound appears to approach the listener's peripersonal space. Previous studies on sound in peripersonal space have provided objective measurements of sound-inducing effects, but few have investigated the subjective experience of frisson-inducing sounds. Here we explored whether it is possible to produce subjective feelings of frisson by moving a noise sound (white noise, rolling beads noise, or frictional noise produced by rubbing a plastic bag) stimulus around a listener's head. Our results demonstrated that sound-induced frisson can be experienced stronger when auditory stimuli are rotated around the head (binaural moving sounds) than the one without the rotation (monaural static sounds), regardless of the source of the noise sound. Pearson's correlation analysis showed that several acoustic features of auditory stimuli, such as variance of interaural level difference (ILD), loudness, and sharpness, were correlated with the magnitude of subjective frisson. We had also observed that the subjective feelings of frisson by moving a musical sound had increased comparing with a static musical sound.
△ Less
Submitted 8 April, 2020; v1 submitted 15 April, 2019;
originally announced April 2019.