Search | arXiv e-print repository

CATSE: A Context-Aware Framework for Causal Target Sound Extraction

Authors: Shrishail Baligar, Mikolaj Kegler, Bryce Irvin, Marko Stamenovic, Shawn Newsam

Abstract: Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE m… ▽ More Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: Submitted to EUSIPCO 2024

arXiv:2403.12182 [pdf, other]

Latent CLAP Loss for Better Foley Sound Synthesis

Authors: Tornike Karchkhadze, Hassan Salami Kavaki, Mohammad Rasool Izadi, Bryce Irvin, Mikolaj Kegler, Ari Hertz, Shuo Zhang, Marko Stamenovic

Abstract: Foley sound generation, the art of creating audio for multimedia, has recently seen notable advancements through text-conditioned latent diffusion models. These systems use multimodal text-audio representation models, such as Contrastive Language-Audio Pretraining (CLAP), whose objective is to map corresponding audio and text prompts into a joint embedding space. AudioLDM, a text-to-audio model, w… ▽ More Foley sound generation, the art of creating audio for multimedia, has recently seen notable advancements through text-conditioned latent diffusion models. These systems use multimodal text-audio representation models, such as Contrastive Language-Audio Pretraining (CLAP), whose objective is to map corresponding audio and text prompts into a joint embedding space. AudioLDM, a text-to-audio model, was the winner of the DCASE2023 task 7 Foley sound synthesis challenge. The winning system fine-tuned the model for specific audio classes and applied a post-filtering method using CLAP similarity scores between output audio and input text at inference time, requiring the generation of extra samples, thus reducing data generation efficiency. We introduce a new loss term to enhance Foley sound generation in AudioLDM without post-filtering. This loss term uses a new module based on the CLAP mode-Latent CLAP encode-to align the latent diffusion output with real audio in a shared CLAP embedding space. Our experiments demonstrate that our method effectively reduces the Frechet Audio Distance (FAD) score of the generated audio and eliminates the need for post-filtering, thus enhancing generation efficiency. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Journal ref: EUSIPCO 2024 Proceedings, ISBN: 978-9-4645-9361-7

arXiv:2303.10204 [pdf, other]

ESP32: QEMU Emulation within a Docker Container

Authors: Michael Howard, R. Bruce Irvin

Abstract: The ESP32 is a popular microcontroller from Espressif that can be used in many embedded applications. Robotic joints, smart car chargers, beer vat agitators and automated bread mixers are a few examples where this system-on-a-chip excels. It is cheap to buy and has a number of vendors providing low-cost development board kits that come with the microcontroller and many external connection points w… ▽ More The ESP32 is a popular microcontroller from Espressif that can be used in many embedded applications. Robotic joints, smart car chargers, beer vat agitators and automated bread mixers are a few examples where this system-on-a-chip excels. It is cheap to buy and has a number of vendors providing low-cost development board kits that come with the microcontroller and many external connection points with peripherals. There is a large software ecosystem for the ESP32. Espressif maintains an SDK containing many C-language sample projects providing a starting point for a huge variety of software services and I/O needs. Third party projects provide additional sample code as well as support for other programming languages. For example, MicroPython is a mature project with sample code and officially supported by Espressif. The SDK provides tools to not just build an application but also merge a flash image, flash to the microcontroller and monitor the output. Is it possible to build the ESP32 load and emulate on another host OS? This paper explores the QEMU emulator and its ability to emulate the ethernet interface for the guest OS. Additionally, we look into the concept of containerizing the entire emulator and ESP32 load package such that a microcontroller flash image can successfully run with a one-step deployment of a Docker container. △ Less

Submitted 25 March, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: 7 pages and 9 figures

ACM Class: F.2.2; I.2.7

arXiv:2211.02542 [pdf, other]

Self-Supervised Learning for Speech Enhancement through Synthesis

Authors: Bryce Irvin, Marko Stamenovic, Mikolaj Kegler, Li-Chia Yang

Abstract: Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative speech synthesis, where the system's output is synthesized by a neural vocoder after an inherently lossy feature-denoising step. In this paper, we propose a denoisin… ▽ More Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative speech synthesis, where the system's output is synthesized by a neural vocoder after an inherently lossy feature-denoising step. In this paper, we propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech. We leverage rich representations from self-supervised learning (SSL) speech models to discover relevant features. We conduct a candidate search across 15 potential SSL front-ends and subsequently train our vocoder adversarially with the best SSL configuration. Additionally, we demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation. Finally, we conduct both objective evaluations and subjective listening studies to show our system improves objective metrics and outperforms an existing state-of-the-art SE model subjectively. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Showing 1–4 of 4 results for author: Irvin, B