Search | arXiv e-print repository

SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio

Authors: Erik Tegler, Magnus Oskarsson, Kalle Åström

Abstract: Time delay estimation or Time-Difference-Of-Arrival estimates is a critical component for multiple localization applications such as multilateration, direction of arrival, and self-calibration. The task is to estimate the time difference between a signal arriving at two different sensors. For the audio sensor modality, most current systems are based on classical methods such as the Generalized Cro… ▽ More Time delay estimation or Time-Difference-Of-Arrival estimates is a critical component for multiple localization applications such as multilateration, direction of arrival, and self-calibration. The task is to estimate the time difference between a signal arriving at two different sensors. For the audio sensor modality, most current systems are based on classical methods such as the Generalized Cross-Correlation Phase Transform (GCC-PHAT) method. In this paper we demonstrate that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data. To overcome the lack of data with ground truth for the task, we train our model on a simulated dataset which is sufficiently large and varied, and that captures the relevant characteristics of the real world problem. We provide our trained model, SONNET (Simulation Optimized Neural Network Estimator of Timeshifts), which is runnable in real-time and works on novel data out of the box for many real data applications, i.e. without re-training. We further demonstrate greatly improved performance on the downstream task of self-calibration when using our model compared to classical methods. △ Less

Submitted 20 November, 2024; originally announced November 2024.

arXiv:2408.17166 [pdf, other]

Learning Multi-Target TDOA Features for Sound Event Localization and Detection

Authors: Axel Berg, Johanna Engman, Jens Gulin, Karl Åström, Magnus Oskarsson

Abstract: Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-c… ▽ More Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: DCASE 2024

arXiv:2408.15771 [pdf, other]

doi 10.1109/IPIN62893.2024.10786105

wav2pos: Sound Source Localization using Masked Autoencoders

Authors: Axel Berg, Jens Gulin, Mark O'Connor, Chuteng Zhou, Karl Åström, Magnus Oskarsson

Abstract: We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked… ▽ More We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: IPIN 2024

arXiv:2309.02961 [pdf, other]

doi 10.1109/JISPIN.2024.3429110

LuViRA Dataset Validation and Discussion: Comparing Vision, Radio, and Audio Sensors for Indoor Localization

Authors: Ilayda Yaman, Guoda Tian, Erik Tegler, Jens Gulin, Nikhil Challa, Fredrik Tufvesson, Ove Edfors, Kalle Astrom, Steffen Malkowsky, Liang Liu

Abstract: We present a unique comparative analysis, and evaluation of vision, radio, and audio based localization algorithms. We create the first baseline for the aforementioned sensors using the recently published Lund University Vision, Radio, and Audio (LuViRA) dataset, where all the sensors are synchronized and measured in the same environment. Some of the challenges of using each specific sensor for in… ▽ More We present a unique comparative analysis, and evaluation of vision, radio, and audio based localization algorithms. We create the first baseline for the aforementioned sensors using the recently published Lund University Vision, Radio, and Audio (LuViRA) dataset, where all the sensors are synchronized and measured in the same environment. Some of the challenges of using each specific sensor for indoor localization tasks are highlighted. Each sensor is paired with a current state-of-the-art localization algorithm and evaluated for different aspects: localization accuracy, reliability and sensitivity to environment changes, calibration requirements, and potential system complexity. Specifically, the evaluation covers the ORB-SLAM3 algorithm for vision-based localization with an RGB-D camera, a machine-learning algorithm for radio-based localization with massive MIMO technology, and the SFS2 algorithm for audio-based localization with distributed microphones. The results can serve as a guideline and basis for further development of robust and high-precision multi-sensory localization systems, e.g., through sensor fusion, context, and environment-aware adaptation. △ Less

Submitted 25 April, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: 10 pages, 11 figures

Journal ref: IEEE Journal of Indoor and Seamless Positioning and Navigation (2024) 1-11

arXiv:2302.05309 [pdf, other]

The LuViRA Dataset: Synchronized Vision, Radio, and Audio Sensors for Indoor Localization

Authors: Ilayda Yaman, Guoda Tian, Martin Larsson, Patrik Persson, Michiel Sandra, Alexander Dürr, Erik Tegler, Nikhil Challa, Henrik Garde, Fredrik Tufvesson, Kalle Åström, Ove Edfors, Steffen Malkowsky, Liang Liu

Abstract: We present a synchronized multisensory dataset for accurate and robust indoor localization: the Lund University Vision, Radio, and Audio (LuViRA) Dataset. The dataset includes color images, corresponding depth maps, inertial measurement unit (IMU) readings, channel response between a 5G massive multiple-input and multiple-output (MIMO) testbed and user equipment, audio recorded by 12 microphones,… ▽ More We present a synchronized multisensory dataset for accurate and robust indoor localization: the Lund University Vision, Radio, and Audio (LuViRA) Dataset. The dataset includes color images, corresponding depth maps, inertial measurement unit (IMU) readings, channel response between a 5G massive multiple-input and multiple-output (MIMO) testbed and user equipment, audio recorded by 12 microphones, and accurate six degrees of freedom (6DOF) pose ground truth of 0.5 mm. We synchronize these sensors to ensure that all data is recorded simultaneously. A camera, speaker, and transmit antenna are placed on top of a slowly moving service robot, and 89 trajectories are recorded. Each trajectory includes 20 to 50 seconds of recorded sensor data and ground truth labels. Data from different sensors can be used separately or jointly to perform localization tasks, and data from the motion capture (mocap) system is used to verify the results obtained by the localization algorithms. The main aim of this dataset is to enable research on sensor fusion with the most commonly used sensors for localization tasks. Moreover, the full dataset or some parts of it can also be used for other research areas such as channel estimation, image classification, etc. Our dataset is available at: https://github.com/ilaydayaman/LuViRA_Dataset △ Less

Submitted 26 April, 2024; v1 submitted 10 February, 2023; originally announced February 2023.

Comments: 7 pages, 7 figures, Accepted to ICRA 2024

arXiv:2208.04654 [pdf, other]

doi 10.21437/Interspeech.2022-524

Extending GCC-PHAT using Shift Equivariant Neural Networks

Authors: Axel Berg, Mark O'Connor, Kalle Åström, Magnus Oskarsson

Abstract: Speaker localization using microphone arrays depends on accurate time delay estimation techniques. For decades, methods based on the generalized cross correlation with phase transform (GCC-PHAT) have been widely adopted for this purpose. Recently, the GCC-PHAT has also been used to provide input features to neural networks in order to remove the effects of noise and reverberation, but at the cost… ▽ More Speaker localization using microphone arrays depends on accurate time delay estimation techniques. For decades, methods based on the generalized cross correlation with phase transform (GCC-PHAT) have been widely adopted for this purpose. Recently, the GCC-PHAT has also been used to provide input features to neural networks in order to remove the effects of noise and reverberation, but at the cost of losing theoretical guarantees in noise-free conditions. We propose a novel approach to extending the GCC-PHAT, where the received signals are filtered using a shift equivariant neural network that preserves the timing information contained in the signals. By extensive experiments we show that our model consistently reduces the error of the GCC-PHAT in adverse environments, with guarantees of exact time delay recovery in ideal conditions. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: Proceedings of INTERSPEECH

Journal ref: Proc. Interspeech 2022, 1791-1795

arXiv:2205.11299 [pdf, other]

Multiple Offsets Multilateration: a new paradigm for sensor network calibration with unsynchronized reference nodes

Authors: Luca Ferranti, Kalle Åström, Magnus Oskarsson, Jani Boutellier, Juho Kannala

Abstract: Positioning using wave signal measurements is used in several applications, such as GPS systems, structure from sound and Wifi based positioning. Mathematically, such problems require the computation of the positions of receivers and/or transmitters as well as time offsets if the devices are unsynchronized. In this paper, we expand the previous state-of-the-art on positioning formulations by intro… ▽ More Positioning using wave signal measurements is used in several applications, such as GPS systems, structure from sound and Wifi based positioning. Mathematically, such problems require the computation of the positions of receivers and/or transmitters as well as time offsets if the devices are unsynchronized. In this paper, we expand the previous state-of-the-art on positioning formulations by introducing Multiple Offsets Multilateration (MOM), a new mathematical framework to compute the receivers positions with pseudoranges from unsynchronized reference transmitters at known positions. This could be applied in several scenarios, for example structure from sound and positioning with LEO satellites. We mathematically describe MOM, determining how many receivers and transmitters are needed for the network to be solvable, a study on the number of possible distinct solutions is presented and stable solvers based on homotopy continuation are derived. The solvers are shown to be efficient and robust to noise both for synthetic and real audio data. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: accepted to ICASSP2022

arXiv:2110.01099 [pdf, other]

Quadrotor Control on $SU(2)\times R^3$ with SLAM Integration

Authors: Marcus Greiff, Patrik Persson, Zhiyong Sun, Karl Åström, Anders Robertsson

Abstract: We present a trajectory tracking controller for a quadrotor unmanned aerial vehicle (UAV) configured on $SU(2)\times R^3$, and relate this result to a family of geometric tracking controllers on $SO(3)\times R^3$. The theoretical results are complemented by simulation examples, and the controller is subsequently implemented in practice and integrated with a simultaneous localization and mapping (S… ▽ More We present a trajectory tracking controller for a quadrotor unmanned aerial vehicle (UAV) configured on $SU(2)\times R^3$, and relate this result to a family of geometric tracking controllers on $SO(3)\times R^3$. The theoretical results are complemented by simulation examples, and the controller is subsequently implemented in practice and integrated with a simultaneous localization and mapping (SLAM) system through an extended Kalman filter (EKF). This facilitates the operation of the UAV without external motion capture systems, and we demonstrate that the proposed control system can be used for inventorying tasks in a supermarket environment without external positioning systems. △ Less

Submitted 3 October, 2021; originally announced October 2021.

Comments: 18 pages, 9 figures, extended version of ACC'22 paper

arXiv:2005.10298 [pdf, ps, other]

Sensor Networks TDOA Self-Calibration: 2D Complexity Analysis and Solutions

Authors: Luca Ferranti, Kalle Åström, Magnus Oskarsson, Jani Boutellier, Juho Kannala

Abstract: Given a network of receivers and transmitters, the process of determining their positions from measured pseudoranges is known as network self-calibration. In this paper we consider 2D networks with synchronized receivers but unsynchronized transmitters and the corresponding calibration techniques, known as Time-Difference-Of-Arrival (TDOA) techniques. Despite previous work, TDOA self-calibration i… ▽ More Given a network of receivers and transmitters, the process of determining their positions from measured pseudoranges is known as network self-calibration. In this paper we consider 2D networks with synchronized receivers but unsynchronized transmitters and the corresponding calibration techniques, known as Time-Difference-Of-Arrival (TDOA) techniques. Despite previous work, TDOA self-calibration is computationally challenging. Iterative algorithms are very sensitive to the initialization, causing convergence issues. In this paper, we present a novel approach, which gives an algebraic solution to two previously unsolved scenarios. We also demonstrate that our solvers produce an excellent initial value for non-linear optimisation algorithms, leading to a full pipeline robust to noise. △ Less

Submitted 22 October, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

Showing 1–9 of 9 results for author: Åström, K