Search | arXiv e-print repository

SoundSil-DS: Deep Denoising and Segmentation of Sound-field Images with Silhouettes

Authors: Risako Tanigawa, Kenji Ishikawa, Noboru Harada, Yasuhiro Oikawa

Abstract: Development of optical technology has enabled imaging of two-dimensional (2D) sound fields. This acousto-optic sensing enables understanding of the interaction between sound and objects such as reflection and diffraction. Moreover, it is expected to be used an advanced measurement technology for sonars in self-driving vehicles and assistive robots. However, the low sound-pressure sensitivity of th… ▽ More Development of optical technology has enabled imaging of two-dimensional (2D) sound fields. This acousto-optic sensing enables understanding of the interaction between sound and objects such as reflection and diffraction. Moreover, it is expected to be used an advanced measurement technology for sonars in self-driving vehicles and assistive robots. However, the low sound-pressure sensitivity of the acousto-optic sensing results in high intensity of noise on images. Therefore, denoising is an essential task to visualize and analyze the sound fields. In addition to denoising, segmentation of sound and object silhouette is also required to analyze interactions between them. In this paper, we propose sound-field-images-with-object-silhouette denoising and segmentation (SoundSil-DS) that jointly perform denoising and segmentation for sound fields and object silhouettes on a visualized image. We developed a new model based on the current state-of-the-art denoising network. We also created a dataset to train and evaluate the proposed method through acoustic simulation. The proposed method was evaluated using both simulated and measured data. We confirmed that our method can applied to experimentally measured data. These results suggest that the proposed method may improve the post-processing for sound fields, such as physical model-based three-dimensional reconstruction since it can remove unwanted noise and separate sound fields and other object silhouettes. Our code is available at https://github.com/nttcslab/soundsil-ds. △ Less

Submitted 11 November, 2024; originally announced November 2024.

Comments: 13 pages, 12 figures, 5 tables. Accepted by WACV 2025

arXiv:2211.08246 [pdf, other]

doi 10.1109/TASLP.2022.3221041

Online Phase Reconstruction via DNN-based Phase Differences Estimation

Authors: Yoshiki Masuyama, Kohei Yatabe, Kento Nagatomo, Yasuhiro Oikawa

Abstract: This paper presents a two-stage online phase reconstruction framework using causal deep neural networks (DNNs). Phase reconstruction is a task of recovering phase of the short-time Fourier transform (STFT) coefficients only from the corresponding magnitude. However, phase is sensitive to waveform shifts and not easy to estimate from the magnitude even with a DNN. To overcome this problem, we propo… ▽ More This paper presents a two-stage online phase reconstruction framework using causal deep neural networks (DNNs). Phase reconstruction is a task of recovering phase of the short-time Fourier transform (STFT) coefficients only from the corresponding magnitude. However, phase is sensitive to waveform shifts and not easy to estimate from the magnitude even with a DNN. To overcome this problem, we propose to use DNNs for estimating differences of phase between adjacent time-frequency bins. We show that convolutional neural networks are suitable for phase difference estimation, according to the theoretical relation between partial derivatives of STFT phase and magnitude. The estimated phase differences are used for reconstructing phase by solving a weighted least squares problem in a frame-by-frame manner. In contrast to existing DNN-based phase reconstruction methods, the proposed framework is causal and does not require any iterative procedure. The experiments showed that the proposed method outperforms existing online methods and a DNN-based method for phase reconstruction. △ Less

Submitted 12 November, 2022; originally announced November 2022.

Comments: Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing

arXiv:2202.08458 [pdf, other]

Wearable SELD dataset: Dataset for sound event localization and detection using wearable devices around head

Authors: Kento Nagatomo, Masahiro Yasuda, Kohei Yatabe, Shoichiro Saito, Yasuhiro Oikawa

Abstract: Sound event localization and detection (SELD) is a combined task of identifying the sound event and its direction. Deep neural networks (DNNs) are utilized to associate them with the sound signals observed by a microphone array. Although ambisonic microphones are popular in the literature of SELD, they might limit the range of applications due to their predetermined geometry. Some applications (in… ▽ More Sound event localization and detection (SELD) is a combined task of identifying the sound event and its direction. Deep neural networks (DNNs) are utilized to associate them with the sound signals observed by a microphone array. Although ambisonic microphones are popular in the literature of SELD, they might limit the range of applications due to their predetermined geometry. Some applications (including those for pedestrians that perform SELD while walking) require a wearable microphone array whose geometry can be designed to suit the task. In this paper, for the development of such a wearable SELD, we propose a dataset named Wearable SELD dataset. It consists of data recorded by 24 microphones placed on a head and torso simulators (HATS) with some accessories mimicking wearable devices (glasses, earphones, and headphones). We also provide experimental results of SELD using the proposed dataset and SELDNet to investigate the effect of microphone configuration. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 5 pages, 6 figures, accepted to IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022

arXiv:2202.08028 [pdf, other]

APPLADE: Adjustable Plug-and-play Audio Declipper Combining DNN with Sparse Optimization

Authors: Tomoro Tanaka, Kohei Yatabe, Masahiro Yasuda, Yasuhiro Oikawa

Abstract: In this paper, we propose an audio declipping method that takes advantages of both sparse optimization and deep learning. Since sparsity-based audio declipping methods have been developed upon constrained optimization, they are adjustable and well-studied in theory. However, they always uniformly promote sparsity and ignore the individual properties of a signal. Deep neural network (DNN)-based met… ▽ More In this paper, we propose an audio declipping method that takes advantages of both sparse optimization and deep learning. Since sparsity-based audio declipping methods have been developed upon constrained optimization, they are adjustable and well-studied in theory. However, they always uniformly promote sparsity and ignore the individual properties of a signal. Deep neural network (DNN)-based methods can learn the properties of target signals and use them for audio declipping. Still, they cannot perform well if the training data have mismatches and/or constraints in the time domain are not imposed. In the proposed method, we use a DNN in an optimization algorithm. It is inspired by an idea called plug-and-play (PnP) and enables us to promote sparsity based on the learned information of data, considering constraints in the time domain. Our experiments confirmed that the proposed method is stable and robust to mismatches between training and test data. △ Less

Submitted 16 February, 2022; originally announced February 2022.

Comments: 5 pages, 7 figures, accepted to IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2022

arXiv:2007.13976 [pdf, other]

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Authors: Yoshiki Masuyama, Yoshiaki Bando, Kohei Yatabe, Yoko Sasaki, Masaki Onishi, Yasuhiro Oikawa

Abstract: Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-superv… ▽ More Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-supervised learning uses monaural audio signals and images and cannot distinguish sound source objects having similar appearances due to poor spatial information in audio signals. To solve this problem, this paper presents a self-supervised training method using 360° images and multichannel audio signals. By incorporating with the spatial information in multichannel audio signals, our method trains deep neural networks (DNNs) to distinguish multiple sound source objects. Our system for localizing sound source objects in the image is composed of audio and visual DNNs. The visual DNN is trained to localize sound source candidates within an input image. The audio DNN verifies whether each candidate actually produces sound or not. These DNNs are jointly trained in a self-supervised manner based on a probabilistic spatial audio model. Experimental results with simulated data showed that the DNNs trained by our method localized multiple speakers. We also demonstrate that the visual DNN detected objects including talking visitors and specific exhibits from real data recorded in a science museum. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: Accepted for publication in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2002.05843 [pdf, other]

Real-time speech enhancement using equilibriated RNN

Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

Abstract: We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term mem… ▽ More We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term memory (LSTM) is often used to alleviate the vanishing/exploding gradient problem which makes the training of an RNN difficult. However, the number of parameters of LSTM is increased as the price of mitigating the difficulty of training, which requires more computational resources. For real-time speech enhancement, it is preferable to use a smaller network without losing the performance. In this paper, we propose to use the equilibriated recurrent neural network~(ERNN) for avoiding the vanishing/exploding gradient problem without increasing the number of parameters. The proposed structure is causal, which requires only the information from the past, in order to apply it in real-time. Compared to the uni- and bi-directional LSTM networks, the proposed method achieved the similar performance with much fewer parameters. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

arXiv:2002.05832 [pdf, other]

Phase reconstruction based on recurrent phase unwrapping with deep neural networks

Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

Abstract: Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy tas… ▽ More Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy task because phase is sensitive to the shift of a waveform. To overcome this problem, we propose a DNN-based two-stage phase reconstruction method. In the proposed method, DNNs estimate phase derivatives instead of phase itself, which allows us to avoid the sensitivity problem. Then, phase is recursively estimated based on the estimated derivatives, which is named recurrent phase unwrapping (RPU). The experimental results confirm that the proposed method outperformed the direct phase estimation by a DNN. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

arXiv:1911.10764 [pdf, other]

Invertible DNN-based nonlinear time-frequency transform for speech enhancement

Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

Abstract: We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods ha… ▽ More We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods have considered end-to-end networks which directly estimate the enhanced signals without T-F transform. While end-to-end methods have shown promising results, they are black boxes and hard to understand. Therefore, some end-to-end methods used a DNN to learn the linear T-F transform which is much easier to understand. However, the learned transform may not have a property important for ordinary signal processing. In this paper, as the important property of the T-F transform, perfect reconstruction is considered. An invertible nonlinear T-F transform is constructed by DNNs and learned from data so that the obtained transform is perfectly reconstructing filterbank. △ Less

Submitted 13 February, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

arXiv:1903.08876 [pdf, other]

Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement

Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

Abstract: We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such… ▽ More We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such as objective sound quality assessments. However, such a simple cost function inherits strong assumptions on the statistics of the target and/or noise which is often not satisfied, and the mismatch of assumption results in degraded performance. In this paper, we propose to design the frequency scale of PRFB from training data so that the assumption on MSE is satisfied. For designing the frequency scale, the warped filterbank frame (WFBF) is considered as PRFB. The frequency characteristic of learned WFBF was in between STFT and the wavelet transform, and its effectiveness was confirmed by comparison with a standard STFT-based DNN whose input feature is compressed into the mel scale. △ Less

Submitted 21 March, 2019; originally announced March 2019.

Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P8.8, Session: Spatial Audio, Audio Enhancement and Bandwidth Extension)

arXiv:1903.05603 [pdf, other]

Low-rankness of Complex-valued Spectrogram and Its Application to Phase-aware Audio Processing

Authors: Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

Abstract: Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-va… ▽ More Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-valued spectrogram and show a complex-valued spectrogram of a sum of sinusoids can be approximately low-rank by modifying its phase. For evaluating the applicability of the proposed low-rank representation, we further propose a convex prior emphasizing harmonic signals, and it is applied to audio denoising. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P13.9, Session: Acoustic Scene Classification and Music Signal Analysis)

arXiv:1903.05600 [pdf, other]

Phase-aware Harmonic/Percussive Source Separation via Convex Optimization

Authors: Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

Abstract: Decomposition of an audio mixture into harmonic and percussive components, namely harmonic/percussive source separation (HPSS), is a useful pre-processing tool for many audio applications. Popular approaches to HPSS exploit the distinctive source-specific structures of power spectrograms. However, such approaches consider only power spectrograms, and the phase remains intact for resynthesizing the… ▽ More Decomposition of an audio mixture into harmonic and percussive components, namely harmonic/percussive source separation (HPSS), is a useful pre-processing tool for many audio applications. Popular approaches to HPSS exploit the distinctive source-specific structures of power spectrograms. However, such approaches consider only power spectrograms, and the phase remains intact for resynthesizing the separated signals. In this paper, we propose a phase-aware HPSS method based on the structure of the phase of harmonic components. It is formulated as a convex optimization problem in the time domain, which enables the simultaneous treatment of both amplitude and phase. The numerical experiment validates the effectiveness of the proposed method. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P16.5, Session: Music Signal Analysis, Feedback and Echo Cancellation and Equalization)

arXiv:1903.03971 [pdf, other]

Deep Griffin-Lim Iteration

Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

Abstract: This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of… ▽ More This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of the short-time Fourier transform. However, GLA often involves many iterations and produces low-quality signals owing to the lack of prior knowledge of the target signal. In order to address these issues, in this study, we propose an architecture which stacks a sub-block including two GLA-inspired fixed layers and a DNN. The number of stacked sub-blocks is adjustable, and we can trade the performance and computational load based on requirements of applications. The effectiveness of the proposed method is investigated by reconstructing phases from amplitude spectrograms of speeches. △ Less

Submitted 10 March, 2019; originally announced March 2019.

Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1, Session: Source Separation and Speech Enhancement I)

arXiv:1811.08783 [pdf, other]

Designing nearly tight window for improving time-frequency masking

Authors: Tsubasa Kusano, Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

Abstract: Many audio signal processing methods are formulated in the time-frequency (T-F) domain which is obtained by the short-time Fourier transform (STFT). The properties of the STFT are fully characterized by window function, number of frequency channels, and time-shift. Thus, designing a better window is important for improving the performance of the processing especially when a less redundant T-F repr… ▽ More Many audio signal processing methods are formulated in the time-frequency (T-F) domain which is obtained by the short-time Fourier transform (STFT). The properties of the STFT are fully characterized by window function, number of frequency channels, and time-shift. Thus, designing a better window is important for improving the performance of the processing especially when a less redundant T-F representation is desirable. While many window functions have been proposed in the literature, they are designed to have a good frequency response for analysis, which may not perform well in terms of signal processing. The window design must take the effect of the reconstruction (from the T-F domain into the time domain) into account for improving the performance. In this paper, an optimization-based design method of a nearly tight window is proposed to obtain a window performing well for the T-F domain signal processing. △ Less

Submitted 4 February, 2019; v1 submitted 17 November, 2018; originally announced November 2018.

Showing 1–13 of 13 results for author: Oikawa, Y