-
Combined assessment of auditory distance perception and externalization
Authors:
Henning Hoppe,
Steven van de Par,
Virginia Flanagin,
Stephan D. Ewert
Abstract:
This study investigates frontal auditory distance perception (ADP) and externalization in virtual audio-visual environments, considering effects of headphone rendering method, room size, reverberation, and visual representation of the room. Either head-related impulse responses from an artificial head or a spherical head model were used for diotic (monophonic) and binaural auralizations with and w…
▽ More
This study investigates frontal auditory distance perception (ADP) and externalization in virtual audio-visual environments, considering effects of headphone rendering method, room size, reverberation, and visual representation of the room. Either head-related impulse responses from an artificial head or a spherical head model were used for diotic (monophonic) and binaural auralizations with and without real-time head tracking. The visuals were presented through a head-mounted display. Two differently sized rooms as well as an infinitely extending space (echoic and anechoic) were used in which an invisible frontal virtual sound source was located. Additionally, the effect of a freely movable loudspeaker for visually indicating perceived distances was investigated. Both ADP and externalization were significantly affected by room size, but otherwise the two perceptual quantities differed in their outcomes. Room visibility significantly affected ADP, leading to considerable overestimations and more variability in the absence of a visual environment, although externalization was not affected. The movable loudspeaker improved distance estimation significantly, however, did not affect externalization. For reverberation, a (non-significant) trend of improved ADP was observed, however, externalization was significantly improved. Different headphone renderings did not significantly affect ADP or externalization, although a clear trend was observed for externalization.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
The effect of self-motion and room familiarity on sound source localization in virtual environments
Authors:
Niklas Isserstedt,
Stephan D. Ewert,
Virginia Flanagin,
Steven van de Par
Abstract:
This paper investigates the influence of lateral horizontal self-motion of participants during signal presentation on distance and azimuth perception for frontal sound sources in a rectangular room. Additionally, the effect of deviating room acoustics for a single sound presentation embedded in a sequence of presentations using a baseline room acoustics for familiarization is analyzed. For this pu…
▽ More
This paper investigates the influence of lateral horizontal self-motion of participants during signal presentation on distance and azimuth perception for frontal sound sources in a rectangular room. Additionally, the effect of deviating room acoustics for a single sound presentation embedded in a sequence of presentations using a baseline room acoustics for familiarization is analyzed. For this purpose, two experiments were conducted using audiovisual virtual reality technology with dynamic head-tracking and real-time auralization over headphones combined with visual rendering of the room using a head-mounted display. Results show an improved distance perception accuracy when participants moved laterally during signal presentation instead of staying at a fixed position, with only head movements allowed. Adaptation to the room acoustics also improves distance perception accuracy. Azimuth perception seems to be independent of lateral movements during signal presentation and could even be negatively influenced by the familiarity of the used room acoustics.
△ Less
Submitted 25 August, 2024;
originally announced August 2024.
-
Evaluation of Virtual Acoustic Environments with Different Acoustic Level of Detail
Authors:
Stefan Fichna,
Steven van de Par,
Stephan D. Ewert
Abstract:
Virtual acoustic environments enable the creation and simulation of realistic and ecologically valid daily-life situations with applications in hearing research and audiology. Hereby, reverberant indoor environments play an important role. For real-time applications, simplifications in the room acoustics simulation are required, however, it remains unclear what acoustic level of detail (ALOD) is n…
▽ More
Virtual acoustic environments enable the creation and simulation of realistic and ecologically valid daily-life situations with applications in hearing research and audiology. Hereby, reverberant indoor environments play an important role. For real-time applications, simplifications in the room acoustics simulation are required, however, it remains unclear what acoustic level of detail (ALOD) is necessary to capture all perceptually relevant effects. This study investigates the effect of varying ALOD in the simulation of three different real environments, a living room with a coupled kitchen, a pub, and an underground station. ALOD was varied by generating different numbers of image sources for early reflections, or by excluding geometrical room details specific for each environment. The simulations were perceptually evaluated using headphones in comparison to binaural room impulse responses measured with a dummy head in the corresponding real environments, and partly using loudspeakers. The study assessed the perceived overall difference for a pulse, and a speech token. Furthermore, plausibility and externalization were evaluated. The results show that a strong reduction in ALOD is possible while obtaining similar plausibility and externalization as with the dummy head recordings. The number and accuracy of early reflections appear less relevant, provided diffuse late reverberation is appropriately accounted for.
△ Less
Submitted 10 August, 2023; v1 submitted 29 June, 2023;
originally announced June 2023.
-
On the relevance of acoustic measurements for creating realistic virtual acoustic environments
Authors:
Siegfried Gündert,
Stephan D. Ewert,
Steven van de Par
Abstract:
Geometrical approaches for room acoustics simulation have the advantage of requiring limited computational resources while still achieving a high perceptual plausibility. A common approach is using the image source model for direct and early reflections in connection with further simplified models such as a feedback delay network for the diffuse reverberant tail. When recreating real spaces as vir…
▽ More
Geometrical approaches for room acoustics simulation have the advantage of requiring limited computational resources while still achieving a high perceptual plausibility. A common approach is using the image source model for direct and early reflections in connection with further simplified models such as a feedback delay network for the diffuse reverberant tail. When recreating real spaces as virtual acoustic environments using room acoustics simulation, the perceptual relevance of individual parameters in the simulation is unclear. Here we investigate the importance of underlying acoustical measurements and technical evaluation methods to obtain high-quality room acoustics simulations in agreement with dummy-head recordings of a real space. We focus on the role of source directivity. The effect of including measured, modelled, and omnidirectional source directivity in room acoustics simulations was assessed in comparison to the measured reference. Technical evaluation strategies to verify and improve the accuracy of various elements in the simulation processing chain from source, the room properties, to the receiver are presented. Perceptual results from an ABX listening experiment with random speech tokens are shown and compared with technical measures for a ranking of simulation approaches.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Computationally-efficient and perceptually-motivated rendering of diffuse reflections in room acoustics simulation
Authors:
Stephan D. Ewert,
Nico Gößling,
Oliver Buttler,
Steven van de Par,
Hongmei Hu
Abstract:
Geometrical acoustics is well suited for simulating room reverberation in interactive real-time applications. While the image source model (ISM) is exceptionally fast, the restriction to specular reflections impacts its perceptual plausibility. To account for diffuse late reverberation, hybrid approaches have been proposed, e.g., using a feedback delay network (FDN) in combination with the ISM. He…
▽ More
Geometrical acoustics is well suited for simulating room reverberation in interactive real-time applications. While the image source model (ISM) is exceptionally fast, the restriction to specular reflections impacts its perceptual plausibility. To account for diffuse late reverberation, hybrid approaches have been proposed, e.g., using a feedback delay network (FDN) in combination with the ISM. Here, a computationally-efficient, digital-filter approach is suggested to account for effects of non-specular reflections in the ISM and to couple scattered sound into a diffuse reverberation model using a spatially rendered FDN. Depending on the scattering coefficient of a room boundary, energy of each image source is split into a specular and a scattered part which is added to the diffuse sound field. Temporal effects as observed for an infinite ideal diffuse (Lambertian) reflector are simulated using cascaded all-pass filters. Effects of scattering and multiple (inter-) reflections caused by larger geometric disturbances at walls and by objects in the room are accounted for in a highly simplified manner. Using a single parameter to quantify deviations from an empty shoebox room, each reflection is temporally smeared using cascaded all-pass filters. The proposed method was perceptually evaluated against dummy head recordings of real rooms.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages
Authors:
Simon Durand,
Daniel Stoller,
Sebastian Ewert
Abstract:
Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can…
▽ More
Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can limit alignment accuracy. In this paper, we use instead a contrastive learning procedure that derives cross-modal embeddings linking the audio and text domains. This way, we obtain a novel system that is simple to train end-to-end, can make use of weakly annotated training data, jointly learns a powerful text model, and is tailored to alignment. The system is not only the first to yield an average absolute error below 0.2 seconds on the standard Jamendo dataset but it is also robust to other languages, even when trained on English data only. Finally, we release word-level alignments for the JamendoLyrics Multi-Lang dataset.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio
Authors:
Yin-Jyun Luo,
Sebastian Ewert,
Simon Dixon
Abstract:
Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentangleme…
▽ More
Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentanglement of the underlying local and global factors. In this paper, we show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables, and is prone to collapse the static latent variable. As a countermeasure, we propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions, which are subsequently employed to regularise the model and facilitate auxiliary objectives to promote disentanglement. The proposed framework is fully unsupervised and robust against the global factor collapse problem across a wide range of model configurations. It also avoids typical solutions such as adversarial training which usually involves laborious parameter tuning, and domain-specific data augmentation. We conduct quantitative and qualitative evaluations to demonstrate its robustness in terms of disentanglement on both artificial and real-world music audio datasets.
△ Less
Submitted 14 June, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation
Authors:
Rachel M. Bittner,
Juan José Bosch,
David Rubinstein,
Gabriel Meseguer-Brocal,
Sebastian Ewert
Abstract:
Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating fram…
▽ More
Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise $f_0$ values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.
△ Less
Submitted 12 May, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
Improving Lyrics Alignment through Joint Pitch Detection
Authors:
Jiawen Huang,
Emmanouil Benetos,
Sebastian Ewert
Abstract:
In recent years, the accuracy of automatic lyrics alignment methods has increased considerably. Yet, many current approaches employ frameworks designed for automatic speech recognition (ASR) and do not exploit properties specific to music. Pitch is one important musical attribute of singing voice but it is often ignored by current systems as the lyrics content is considered independent of the pitc…
▽ More
In recent years, the accuracy of automatic lyrics alignment methods has increased considerably. Yet, many current approaches employ frameworks designed for automatic speech recognition (ASR) and do not exploit properties specific to music. Pitch is one important musical attribute of singing voice but it is often ignored by current systems as the lyrics content is considered independent of the pitch. In practice, however, there is a temporal correlation between the two as note starts often correlate with phoneme starts. At the same time the pitch is usually annotated with high temporal accuracy in ground truth data while the timing of lyrics is often only available at the line (or word) level. In this paper, we propose a multi-task learning approach for lyrics alignment that incorporates pitch and thus can make use of a new source of highly accurate temporal information. Our results show that the accuracy of the alignment result is indeed improved by our approach. As an additional contribution, we show that integrating boundary detection in the forced-alignment algorithm reduces cross-line errors, which improves the accuracy even further.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
Lower Interaural Coherence in Off-Signal Bands Impairs Binaural Detection
Authors:
Bernhard Eurich,
Jörg Encke,
Stephan D. Ewert,
Mathias Dietz
Abstract:
Differences in interaural phase configuration between a target and a masker can lead to substantial binaural unmasking. This effect is decreased for masking noises with an interaural time difference (ITD). Adding a second noise with an opposing ITD in most cases further reduces binaural unmasking. Thus far, modeling of these detection thresholds required both a mechanism for internal ITD compensat…
▽ More
Differences in interaural phase configuration between a target and a masker can lead to substantial binaural unmasking. This effect is decreased for masking noises with an interaural time difference (ITD). Adding a second noise with an opposing ITD in most cases further reduces binaural unmasking. Thus far, modeling of these detection thresholds required both a mechanism for internal ITD compensation and an increased binaural bandwidth. An alternative explanation for the reduction is that unmasking is impaired by the lower interaural coherence in off-frequency regions caused by the second masker (Marquardt & McAlpine, 2009, JASA pp. EL177 - EL182). Based on this hypothesis, the current work proposes a quantitative multi-channel model using monaurally derived peripheral filter bandwidths and an across-channel incoherence interference mechanism. This mechanism differs from wider filters since it has no effect when the masker coherence is constant across frequency bands. Combined with a monaural energy discrimination pathway, the model predicts the differences between a single delayed noise and two opposingly delayed noises, as well as four other data sets. It helps resolve the inconsistency explaining some data sets requires wide filters while others require narrow filters.
△ Less
Submitted 7 March, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Prediction of tone detection thresholds in interaurally delayed noise based on interaural phase difference fluctuations
Authors:
Mathias Dietz,
Jörg Encke,
Kristin I. Bracklo,
Stephan D. Ewert
Abstract:
Differences between the interaural phase of a noise and a target tone improve detection thresholds. The maximum masking release is obtained for detecting an antiphasic tone (S$π$) in diotic noise (N0). It has been shown in several studies that this benefit gradually declines as an interaural delay is applied to the N0S$π$ complex. This decline has been attributed to the reduced interaural coherenc…
▽ More
Differences between the interaural phase of a noise and a target tone improve detection thresholds. The maximum masking release is obtained for detecting an antiphasic tone (S$π$) in diotic noise (N0). It has been shown in several studies that this benefit gradually declines as an interaural delay is applied to the N0S$π$ complex. This decline has been attributed to the reduced interaural coherence of the noise. Here, we report detection thresholds for a 500 Hz tone in masking noise with up to 8 ms interaural delay and bandwidths from 25 to 1000 Hz. When reducing the noise bandwidth from 100 to 50 and 25 Hz, the masking release at 8 ms delay increases, as expected for increasing temporal coherence with decreasing bandwidth. For bandwidths of 100 to 1000 Hz, no significant difference was observed and detection thresholds with these noises have a delay dependence that is fully described by the temporal coherence imposed by the typical monaurally determined auditory filter bandwidth. A minimalistic binaural model is suggested based on interaural phase difference fluctuations without the assumption of delay lines.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Computationally efficient spatial rendering of late reverberation in virtual acoustic environments
Authors:
Christoph Kirsch,
Josef Poppitz,
Torben Wendt,
Steven van de Par,
Stephan D. Ewert
Abstract:
For 6-DOF (degrees of freedom) interactive virtual acoustic environments (VAEs), the spatial rendering of diffuse late reverberation in addition to early (specular) reflections is important. In the interest of computational efficiency, the acoustic simulation of the late reverberation can be simplified by using a limited number of spatially distributed virtual reverb sources (VRS) each radiating i…
▽ More
For 6-DOF (degrees of freedom) interactive virtual acoustic environments (VAEs), the spatial rendering of diffuse late reverberation in addition to early (specular) reflections is important. In the interest of computational efficiency, the acoustic simulation of the late reverberation can be simplified by using a limited number of spatially distributed virtual reverb sources (VRS) each radiating incoherent signals. A sufficient number of VRS is needed to approximate spatially anisotropic late reverberation, e.g., in a room with inhomogeneous distribution of absorption at the boundaries. Here, a highly efficient and perceptually plausible method to generate and spatially render late reverberation is suggested, extending the room acoustics simulator RAZR [Wendt et al., J. Audio Eng. Soc., 62, 11 (2014)]. The room dimensions and frequency-dependent absorption coefficients at the wall boundaries are used to determine the parameters of a physically-based feedback delay network (FDN) to generate the incoherent VRS signals. The VRS are spatially distributed around the listener with weighting factors representing the spatially subsampled distribution of absorption coefficients on the wall boundaries. The minimum number of VRS required to be perceptually distinguishable from the maximum (reference) number of 96 VRS was assessed in a listening test conducted with a spherical loudspeaker array within an anechoic room. For the resulting low numbers of VRS suited for spatial rendering, optimal physically-based parameter choices for the FDN are discussed.
△ Less
Submitted 30 June, 2021;
originally announced July 2021.
-
Communication conditions in virtual acoustic scenes in an underground station
Authors:
Ľuboš Hládek,
Stephan D. Ewert,
Bernhard U. Seeber
Abstract:
Underground stations are a common communication situation in towns: we talk with friends or colleagues, listen to announcements or shop for titbits while background noise and reverberation are challenging communication. Here, we perform an acoustical analysis of two communication scenes in an underground station in Munich and test speech intelligibility. The acoustical conditions were measured in…
▽ More
Underground stations are a common communication situation in towns: we talk with friends or colleagues, listen to announcements or shop for titbits while background noise and reverberation are challenging communication. Here, we perform an acoustical analysis of two communication scenes in an underground station in Munich and test speech intelligibility. The acoustical conditions were measured in the station and are compared to simulations in the real-time Simulated Open Field Environment (rtSOFE). We compare binaural room impulse responses measured with an artificial head in the station to modeled impulse responses for free-field auralization via 60 loudspeakers in the rtSOFE. We used the image source method to model early reflections and a set of multi-microphone recordings to model late reverberation. The first communication scene consists of 12 equidistant (1.6 m) horizontally spaced source positions around a listener, simulating different direction-dependent spatial unmasking conditions. The second scene mimics an approaching speaker across six radially spaced source positions (from 1 m to 10 m) with varying direct sound level and thus direct-to-reverberant energy. The acoustic parameters of the underground station show a moderate amount of reverberation (T30 in octave bands was between 2.3 s and 0.6 s and early-decay times between 1.46 s and 0.46 s). The binaural and energetic parameters of the auralization were in a close match to the measurement. Measured speech reception thresholds were within the error of the speech test, letting us to conclude that the auralized simulation reproduces acoustic and perceptually relevant parameters for speech intelligibility with high accuracy.
△ Less
Submitted 2 November, 2021; v1 submitted 30 June, 2021;
originally announced June 2021.
-
Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments
Authors:
Stefan Fichna,
Thomas Biberger,
Bernhard U. Seeber,
Stephan D. Ewert
Abstract:
In daily life, social interaction and acoustic communication often take place in complex acoustic environments (CAE) with a variety of interfering sounds and reverberation. For hearing research and the evaluation of hearing systems, simulated CAEs using virtual reality techniques have gained interest in the context of ecological validity. In the current study, the effect of scene complexity and vi…
▽ More
In daily life, social interaction and acoustic communication often take place in complex acoustic environments (CAE) with a variety of interfering sounds and reverberation. For hearing research and the evaluation of hearing systems, simulated CAEs using virtual reality techniques have gained interest in the context of ecological validity. In the current study, the effect of scene complexity and visual representation of the scene on psychoacoustic measures like sound source location, distance perception, loudness, speech intelligibility, and listening effort in a virtual audio-visual environment was investigated. A 3-dimensional, 86-channel loudspeaker array was used to render the sound field in combination with or without a head-mounted display (HMD) to create an immersive stereoscopic visual representation of the scene. The scene consisted of a ring of eight (virtual) loudspeakers which played a target speech stimulus and nonsense speech interferers in several spatial conditions. Either an anechoic (snowy outdoor scenery) or echoic environment (loft apartment) with a reverberation time (T60) of about 1.5 s was simulated. In addition to varying the number of interferers, scene complexity was varied by assessing the psychoacoustic measures in isolated consecutive measurements orcsimultaneously. Results showed no significant effect of wearing the HMD on the data. Loudness and distance perception showed significantly different results when they were measured simultaneously instead of consecutively in isolation. The advantage of the suggested setup is that it can be directly transferred to a corresponding real room, enabling a 1:1 comparison and verification of the perception experiments in the real and virtual environment.
△ Less
Submitted 7 November, 2021; v1 submitted 30 June, 2021;
originally announced June 2021.
-
Spatial resolution of late reverberation in virtual acoustic environments
Authors:
Christoph Kirsch,
Josef Poppitz,
Torben Wendt,
Steven van de Par,
Stephan D. Ewert
Abstract:
Late reverberation involves the superposition of many sound reflections resulting in a diffuse sound field. Since the spatially resolved perception of individual diffuse reflections is impossible, simplifications can potentially be made for modelling late reverberation in room acoustics simulations with reduced spatial resolution. Such simplifications are desired for interactive, real-time virtual…
▽ More
Late reverberation involves the superposition of many sound reflections resulting in a diffuse sound field. Since the spatially resolved perception of individual diffuse reflections is impossible, simplifications can potentially be made for modelling late reverberation in room acoustics simulations with reduced spatial resolution. Such simplifications are desired for interactive, real-time virtual acoustic environments with applications in hearing research and for the evaluation of hearing supportive devices. In this context, the number and spatial arrangement of loudspeakers used for playback additionally affect spatial resolution. The current study assessed the minimum number of spatially evenly distributed virtual late reverberation sources required to perceptually approximate spatially highly resolved isotropic and anisotropic late reverberation and to technically approximate a spherically isotropic diffuse sound field. The spatial resolution of the rendering was systematically reduced by using subsets of the loudspeakers of an 86-channel spherical loudspeaker array in an anechoic chamber. It was tested whether listeners can distinguish lower spatial resolutions for the rendering of late reverberation from the highest achievable spatial resolution in different simulated rooms. Rendering of early reflections was kept fixed. The coherence of the sound field across a pair of microphones at ear and behind-the-ear hearing device distance was assessed to separate the effects of number of virtual sources and loudspeaker array geometry. Results show that between 12 and 24 reverberation sources are required.
△ Less
Submitted 30 June, 2021;
originally announced June 2021.
-
Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility
Authors:
Thomas Biberger,
Stephan D. Ewert
Abstract:
Auditory perception involves cues in the monaural auditory pathways as well as binaural cues based on differences between the ears. So far auditory models have often focused on either monaural or binaural experiments in isolation. Although binaural models typically build upon stages of (existing) monaural models, only a few attempts have been made to extend a monaural model by a binaural stage usi…
▽ More
Auditory perception involves cues in the monaural auditory pathways as well as binaural cues based on differences between the ears. So far auditory models have often focused on either monaural or binaural experiments in isolation. Although binaural models typically build upon stages of (existing) monaural models, only a few attempts have been made to extend a monaural model by a binaural stage using a unified decision stage for monaural and binaural cues. In such approaches, a typical prototype of binaural processing has been the classical equalization-cancelation mechanism, which either involves signal-adaptive delays and provides a single channel output or can be implemented with tapped delays providing a high-dimensional multichannel output. This contribution extends the (monaural) generalized envelope power spectrum model by a non-adaptive binaural stage with only a few, fixed output channels. The binaural stage resembles features of physiologically motivated hemispheric binaural processing, as simplified signal processing stages, yielding a 5-channel monaural and binaural matrix feature "decoder" (BMFD). The back end of the existing monaural model is applied to the 5-channel BMFD output and calculates short-time envelope power and power features. The model is evaluated and discussed for a baseline database of monaural and binaural psychoacoustic experiments from the literature.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling
Authors:
Daniel Stoller,
Mi Tian,
Sebastian Ewert,
Simon Dixon
Abstract:
Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolution…
▽ More
Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, prohibiting the use of longer receptive fields in practice. To increase efficiency, we make use of the "slow feature" hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model ("Seq-U-Net") to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance in all tasks.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators
Authors:
Daniel Stoller,
Sebastian Ewert,
Simon Dixon
Abstract:
Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting. However, they typically require large datasets, which are often not available, especially in the context of prediction tasks such as image segmentation that require labels. Therefore, methods such as the CycleGAN use more easily available unlabelled data, but do not offer a way…
▽ More
Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting. However, they typically require large datasets, which are often not available, especially in the context of prediction tasks such as image segmentation that require labels. Therefore, methods such as the CycleGAN use more easily available unlabelled data, but do not offer a way to leverage additional labelled data for improved performance. To address this shortcoming, we show how to factorise the joint data distribution into a set of lower-dimensional distributions along with their dependencies. This allows splitting the discriminator in a GAN into multiple "sub-discriminators" that can be independently trained from incomplete observations. Their outputs can be combined to estimate the density ratio between the joint real and the generator distribution, which enables training generators as in the original GAN framework. We apply our method to image generation, image segmentation and audio source separation, and obtain improved performance over a standard GAN when additional incomplete training examples are available. For the Cityscapes segmentation task in particular, our method also improves accuracy by an absolute 14.9% over CycleGAN while using only 25 additional paired examples.
△ Less
Submitted 30 January, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model
Authors:
Daniel Stoller,
Simon Durand,
Sebastian Ewert
Abstract:
Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training…
▽ More
Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a standard dataset our system outperforms the state-of-the-art by an order of magnitude.
△ Less
Submitted 18 February, 2019;
originally announced February 2019.
-
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
Authors:
Daniel Stoller,
Sebastian Ewert,
Simon Dixon
Abstract:
Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, em…
▽ More
Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Jointly Detecting and Separating Singing Voice: A Multi-Task Approach
Authors:
Daniel Stoller,
Sebastian Ewert,
Simon Dixon
Abstract:
A main challenge in applying deep learning to music processing is the availability of training data. One potential solution is Multi-task Learning, in which the model also learns to solve related auxiliary tasks on additional datasets to exploit their correlation. While intuitive in principle, it can be challenging to identify related tasks and construct the model to optimally share information be…
▽ More
A main challenge in applying deep learning to music processing is the availability of training data. One potential solution is Multi-task Learning, in which the model also learns to solve related auxiliary tasks on additional datasets to exploit their correlation. While intuitive in principle, it can be challenging to identify related tasks and construct the model to optimally share information between tasks. In this paper, we explore vocal activity detection as an additional task to stabilise and improve the performance of vocal separation. Further, we identify problematic biases specific to each dataset that could limit the generalisation capability of separation and detection models, to which our proposed approach is robust. Experiments show improved performance in separation as well as vocal detection compared to single-task baselines. However, we find that the commonly used Signal-to-Distortion Ratio (SDR) metrics did not capture the improvement on non-vocal sections, indicating the need for improved evaluation methodologies.
△ Less
Submitted 4 April, 2018;
originally announced April 2018.
-
Shift-Invariant Kernel Additive Modelling for Audio Source Separation
Authors:
Delia Fano Yela,
Sebastian Ewert,
Ken O'Hanlon,
Mark B. Sandler
Abstract:
A major goal in blind source separation to identify and separate sources is to model their inherent characteristics. While most state-of-the-art approaches are supervised methods trained on large datasets, interest in non-data-driven approaches such as Kernel Additive Modelling (KAM) remains high due to their interpretability and adaptability. KAM performs the separation of a given source applying…
▽ More
A major goal in blind source separation to identify and separate sources is to model their inherent characteristics. While most state-of-the-art approaches are supervised methods trained on large datasets, interest in non-data-driven approaches such as Kernel Additive Modelling (KAM) remains high due to their interpretability and adaptability. KAM performs the separation of a given source applying robust statistics on the time-frequency bins selected by a source-specific kernel function, commonly the K-NN function. This choice assumes that the source of interest repeats in both time and frequency. In practice, this assumption does not always hold. Therefore, we introduce a shift-invariant kernel function capable of identifying similar spectral content even under frequency shifts. This way, we can considerably increase the amount of suitable sound material available to the robust statistics. While this leads to an increase in separation performance, a basic formulation, however, is computationally expensive. Therefore, we additionally present acceleration techniques that lower the overall computational complexity.
△ Less
Submitted 16 February, 2018; v1 submitted 1 November, 2017;
originally announced November 2017.
-
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction
Authors:
Daniel Stoller,
Sebastian Ewert,
Simon Dixon
Abstract:
The state of the art in music source separation employs neural networks trained in a supervised fashion on multi-track databases to estimate the sources from a given mixture. With only few datasets available, often extensive data augmentation is used to combat overfitting. Mixing random tracks, however, can even reduce separation performance as instruments in real music are strongly correlated. Th…
▽ More
The state of the art in music source separation employs neural networks trained in a supervised fashion on multi-track databases to estimate the sources from a given mixture. With only few datasets available, often extensive data augmentation is used to combat overfitting. Mixing random tracks, however, can even reduce separation performance as instruments in real music are strongly correlated. The key concept in our approach is that source estimates of an optimal separator should be indistinguishable from real source signals. Based on this idea, we drive the separator towards outputs deemed as realistic by discriminator networks that are trained to tell apart real from separator samples. This way, we can also use unpaired source and mixture recordings without the drawbacks of creating unrealistic music mixtures. Our framework is widely applicable as it does not assume a specific network architecture or number of sources. To our knowledge, this is the first adoption of adversarial training for music source separation. In a prototype experiment for singing voice separation, separation performance increases with our approach compared to purely supervised training.
△ Less
Submitted 6 April, 2018; v1 submitted 31 October, 2017;
originally announced November 2017.
-
An Augmented Lagrangian Method for Piano Transcription using Equal Loudness Thresholding and LSTM-based Decoding
Authors:
Sebastian Ewert,
Mark B. Sandler
Abstract:
A central goal in automatic music transcription is to detect individual note events in music recordings. An important variant is instrument-dependent music transcription where methods can use calibration data for the instruments in use. However, despite the additional information, results rarely exceed an f-measure of 80%. As a potential explanation, the transcription problem can be shown to be ba…
▽ More
A central goal in automatic music transcription is to detect individual note events in music recordings. An important variant is instrument-dependent music transcription where methods can use calibration data for the instruments in use. However, despite the additional information, results rarely exceed an f-measure of 80%. As a potential explanation, the transcription problem can be shown to be badly conditioned and thus relies on appropriate regularization. A recently proposed method employs a mixture of simple, convex regularizers (to stabilize the parameter estimation process) and more complex terms (to encourage more meaningful structure). In this paper, we present two extensions to this method. First, we integrate a computational loudness model to better differentiate real from spurious note detections. Second, we employ (Bidirectional) Long Short Term Memory networks to re-weight the likelihood of detected note constellations. Despite their simplicity, our two extensions lead to a drop of about 35% in note error rate compared to the state-of-the-art.
△ Less
Submitted 30 July, 2017; v1 submitted 1 July, 2017;
originally announced July 2017.
-
On the Importance of Temporal Context in Proximity Kernels: A Vocal Separation Case Study
Authors:
Delia Fano Yela,
Sebastian Ewert,
Derry FitzGerald,
Mark Sandler
Abstract:
Musical source separation methods exploit source-specific spectral characteristics to facilitate the decomposition process. Kernel Additive Modelling (KAM) models a source applying robust statistics to time-frequency bins as specified by a source-specific kernel, a function defining similarity between bins. Kernels in existing approaches are typically defined using metrics between single time fram…
▽ More
Musical source separation methods exploit source-specific spectral characteristics to facilitate the decomposition process. Kernel Additive Modelling (KAM) models a source applying robust statistics to time-frequency bins as specified by a source-specific kernel, a function defining similarity between bins. Kernels in existing approaches are typically defined using metrics between single time frames. In the presence of noise and other sound sources information from a single-frame, however, turns out to be unreliable and often incorrect frames are selected as similar. In this paper, we incorporate a temporal context into the kernel to provide additional information stabilizing the similarity search. Evaluated in the context of vocal separation, our simple extension led to a considerable improvement in separation quality compared to previous kernels.
△ Less
Submitted 11 April, 2017; v1 submitted 7 February, 2017;
originally announced February 2017.
-
Interference Reduction in Music Recordings Combining Kernel Additive Modelling and Non-Negative Matrix Factorization
Authors:
Delia Fano Yela,
Sebastian Ewert,
Derry FitzGerald,
Mark Sandler
Abstract:
In live and studio recordings unexpected sound events often lead to interferences in the signal. For non-stationary interferences, sound source separation techniques can be used to reduce the interference level in the recording. In this context, we present a novel approach combining the strengths of two algorithmic families: NMF and KAM. The recent KAM approach applies robust statistics on frames…
▽ More
In live and studio recordings unexpected sound events often lead to interferences in the signal. For non-stationary interferences, sound source separation techniques can be used to reduce the interference level in the recording. In this context, we present a novel approach combining the strengths of two algorithmic families: NMF and KAM. The recent KAM approach applies robust statistics on frames selected by a source-specific kernel to perform source separation. Based on semi-supervised NMF, we extend this approach in two ways. First, we locate the interference in the recording based on detected NMF activity. Second, we improve the kernel-based frame selection by incorporating an NMF-based estimate of the clean music signal. Further, we introduce a temporal context in the kernel, taking some musical structure into account. Our experiments show improved separation quality for our proposed method over a state-of-the-art approach for interference reduction.
△ Less
Submitted 8 February, 2017; v1 submitted 20 September, 2016;
originally announced September 2016.
-
Structured Dropout for Weak Label and Multi-Instance Learning and Its Application to Score-Informed Source Separation
Authors:
Sebastian Ewert,
Mark B. Sandler
Abstract:
Many success stories involving deep neural networks are instances of supervised learning, where available labels power gradient-based learning methods. Creating such labels, however, can be expensive and thus there is increasing interest in weak labels which only provide coarse information, with uncertainty regarding time, location or value. Using such labels often leads to considerable challenges…
▽ More
Many success stories involving deep neural networks are instances of supervised learning, where available labels power gradient-based learning methods. Creating such labels, however, can be expensive and thus there is increasing interest in weak labels which only provide coarse information, with uncertainty regarding time, location or value. Using such labels often leads to considerable challenges for the learning process. Current methods for weak-label training often employ standard supervised approaches that additionally reassign or prune labels during the learning process. The information gain, however, is often limited as only the importance of labels where the network already yields reasonable results is boosted. We propose treating weak-label training as an unsupervised problem and use the labels to guide the representation learning to induce structure. To this end, we propose two autoencoder extensions: class activity penalties and structured dropout. We demonstrate the capabilities of our approach in the context of score-informed source separation of music.
△ Less
Submitted 26 December, 2016; v1 submitted 15 September, 2016;
originally announced September 2016.
-
Piano Transcription in the Studio Using an Extensible Alternating Directions Framework
Authors:
Sebastian Ewert,
Mark Sandler
Abstract:
Given a musical audio recording, the goal of automatic music transcription is to determine a score-like representation of the piece underlying the recording. Despite significant interest within the research community, several studies have reported on a 'glass ceiling' effect, an apparent limit on the transcription accuracy that current methods seem incapable of overcoming. In this paper, we explor…
▽ More
Given a musical audio recording, the goal of automatic music transcription is to determine a score-like representation of the piece underlying the recording. Despite significant interest within the research community, several studies have reported on a 'glass ceiling' effect, an apparent limit on the transcription accuracy that current methods seem incapable of overcoming. In this paper, we explore how much this effect can be mitigated by focusing on a specific instrument class and making use of additional information on the recording conditions available in studio or home recording scenarios. In particular, exploiting the availability of single note recordings for the instrument in use we develop a novel signal model employing variable-length spectro-temporal patterns as its central building blocks - tailored for pitched percussive instruments such as the piano. Temporal dependencies between spectral templates are modeled, resembling characteristics of factorial scaled hidden Markov models (FS-HMM) and other methods combining Non-Negative Matrix Factorization with Markov processes. In contrast to FS-HMMs, our parameter estimation is developed in a global, relaxed form within the extensible alternating direction method of multipliers (ADMM) framework, which enables the systematic combination of basic regularizers propagating sparsity and local stationarity in note activity with more complex regularizers imposing temporal semantics. The proposed method achieves an f-measure of 93-95% for note onsets on pieces recorded on a Yamaha Disklavier (MAPS DB).
△ Less
Submitted 27 July, 2016; v1 submitted 2 June, 2016;
originally announced June 2016.
-
Robust Joint Alignment of Multiple Versions of a Piece of Music
Authors:
Siying Wang,
Sebastian Ewert,
Simon Dixon
Abstract:
Large music content libraries often comprise multiple versions of a piece of music. To establish a link between different versions, automatic music alignment methods map each position in one version to a corresponding position in another version. Due to the leeway in interpreting a piece, any two versions can differ significantly, for example, in terms of local tempo, articulation, or playing styl…
▽ More
Large music content libraries often comprise multiple versions of a piece of music. To establish a link between different versions, automatic music alignment methods map each position in one version to a corresponding position in another version. Due to the leeway in interpreting a piece, any two versions can differ significantly, for example, in terms of local tempo, articulation, or playing style. For a given pair of versions, these differences can be significant such that even state-of-the-art methods fail to identify a correct alignment. In this paper, we present a novel method that increases the robustness for difficult to align cases. Instead of aligning only pairs of versions as done in previous methods, our method aligns multiple versions in a joint manner. This way, the alignment can be computed by comparing each version not only with one but with several versions, which stabilizes the comparison and leads to an increase in alignment robustness. Using recordings from the Mazurka Project, the alignment error for our proposed method was 14% lower on average compared to a state-of-the-art method, with significantly less outliers (standard deviation 53% lower).
△ Less
Submitted 28 April, 2016;
originally announced April 2016.
-
Evaluation of spatial audio reproduction schemes for application in hearing aid research
Authors:
Giso Grimm,
Stephan Ewert,
Volker Hohmann
Abstract:
Loudspeaker-based spatial audio reproduction schemes are increasingly used for evaluating hearing aids in complex acoustic conditions. To further establish the feasibility of this approach, this study investigated the interaction between spatial resolution of different reproduction methods and technical and perceptual hearing aid performance measures using computer simulations. Three spatial audio…
▽ More
Loudspeaker-based spatial audio reproduction schemes are increasingly used for evaluating hearing aids in complex acoustic conditions. To further establish the feasibility of this approach, this study investigated the interaction between spatial resolution of different reproduction methods and technical and perceptual hearing aid performance measures using computer simulations. Three spatial audio reproduction methods -- discrete speakers, vector base amplitude panning and higher order ambisonics -- were compared in regular circular loudspeaker arrays with 4 to 72 channels. The influence of reproduction method and array size on performance measures of representative multi-microphone hearing aid algorithm classes with spatially distributed microphones and a representative single channel noise-reduction algorithm was analyzed. Algorithm classes differed in their way of analyzing and exploiting spatial properties of the sound field, requiring different accuracy of sound field reproduction. Performance measures included beam pattern analysis, signal-to-noise ratio analysis, perceptual localization prediction, and quality modeling. The results show performance differences and interaction effects between reproduction method and algorithm class that may be used for guidance when selecting the appropriate method and number of speakers for specific tasks in hearing aid research.
△ Less
Submitted 3 August, 2015; v1 submitted 2 March, 2015;
originally announced March 2015.