Search | arXiv e-print repository

arXiv:2507.02791 [pdf, ps, other]

Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance

Authors: Jakob Kienegger, Alina Mannanova, Huajian Fang, Timo Gerkmann

Abstract: Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction… ▽ More Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction of a target speaker. As this additional computational overhead hinders application in resource-constrained scenarios such as real-time speech enhancement, we present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead. Assuming a causal, sequential processing style, we introduce temporal feedback to leverage the enhanced speech signal of the spatially selective filter to compensate for the limited modeling capabilities of the particle filter. Evaluation on a synthetic dataset illustrates how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance. A listening test with real-world recordings complements these findings by indicating a clear trend towards our proposed self-steering pipeline as preferred choice over comparable methods. △ Less

Submitted 5 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

Comments: Accepted at IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2025. Video demonstration: https://youtu.be/aSKOSh5JZ3o

arXiv:2505.14517 [pdf, ps, other]

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

Authors: Jakob Kienegger, Timo Gerkmann

Abstract: Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's dire… ▽ More Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: Accepted at Interspeech 2025

arXiv:2410.19595 [pdf, other]

Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation

Authors: Jakob Kienegger, Alina Mannanova, Timo Gerkmann

Abstract: Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independenc… ▽ More Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios. △ Less

Submitted 8 January, 2025; v1 submitted 25 October, 2024; originally announced October 2024.

Comments: ©2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Showing 1–3 of 3 results for author: Kienegger, J