Search | arXiv e-print repository

Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio

Authors: Yan Gao, Javier Fernandez-Marques, Titouan Parcollet, Pedro P. B. de Gusmao, Nicholas D. Lane

Abstract: Self-supervised learning (SSL) has proven vital in speech and audio-related applications. The paradigm trains a general model on unlabeled data that can later be used to solve specific downstream tasks. This type of model is costly to train as it requires manipulating long input sequences that can only be handled by powerful centralised servers. Surprisingly, despite many attempts to increase trai… ▽ More Self-supervised learning (SSL) has proven vital in speech and audio-related applications. The paradigm trains a general model on unlabeled data that can later be used to solve specific downstream tasks. This type of model is costly to train as it requires manipulating long input sequences that can only be handled by powerful centralised servers. Surprisingly, despite many attempts to increase training efficiency through model compression, the effects of truncating input sequence lengths to reduce computation have not been studied. In this paper, we provide the first empirical study of SSL pre-training for different specified sequence lengths and link this to various downstream tasks. We find that training on short sequences can dramatically reduce resource costs while retaining a satisfactory performance for all tasks. This simple one-line change would promote the migration of SSL training from data centres to user-end edge devices for more realistic and personalised applications. △ Less

Submitted 22 November, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

arXiv:2104.14297 [pdf, other]

End-to-End Speech Recognition from Federated Acoustic Models

Authors: Yan Gao, Titouan Parcollet, Salah Zaiem, Javier Fernandez-Marques, Pedro P. B. de Gusmao, Daniel J. Beutel, Nicholas D. Lane

Abstract: Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data di… ▽ More Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data distributions using the French and Italian sets of the CommonVoice dataset, a large heterogeneous dataset containing thousands of different speakers, acoustic environments and noises. We present the first empirical study on attention-based sequence-to-sequence End-to-End (E2E) ASR model with three aggregation weighting strategies -- standard FedAvg, loss-based aggregation and a novel word error rate (WER)-based aggregation, compared in two realistic FL scenarios: cross-silo with 10 clients and cross-device with 2K and 4K clients. Our analysis on E2E ASR from heterogeneous and realistic federated acoustic models provides the foundations for future research and development of realistic FL-based ASR applications. △ Less

Submitted 9 July, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

arXiv:1911.09968 [pdf, other]

SelfVIO: Self-Supervised Deep Monocular Visual-Inertial Odometry and Depth Estimation

Authors: Yasin Almalioglu, Mehmet Turan, Alp Eren Sari, Muhamad Risqi U. Saputra, Pedro P. B. de Gusmão, Andrew Markham, Niki Trigoni

Abstract: In the last decade, numerous supervised deep learning approaches requiring large amounts of labeled data have been proposed for visual-inertial odometry (VIO) and depth map estimation. To overcome the data limitation, self-supervised learning has emerged as a promising alternative, exploiting constraints such as geometric and photometric consistency in the scene. In this study, we introduce a nove… ▽ More In the last decade, numerous supervised deep learning approaches requiring large amounts of labeled data have been proposed for visual-inertial odometry (VIO) and depth map estimation. To overcome the data limitation, self-supervised learning has emerged as a promising alternative, exploiting constraints such as geometric and photometric consistency in the scene. In this study, we introduce a novel self-supervised deep learning-based VIO and depth map recovery approach (SelfVIO) using adversarial training and self-adaptive visual-inertial sensor fusion. SelfVIO learns to jointly estimate 6 degrees-of-freedom (6-DoF) ego-motion and a depth map of the scene from unlabeled monocular RGB image sequences and inertial measurement unit (IMU) readings. The proposed approach is able to perform VIO without the need for IMU intrinsic parameters and/or the extrinsic calibration between the IMU and the camera. estimation and single-view depth recovery network. We provide comprehensive quantitative and qualitative evaluations of the proposed framework comparing its performance with state-of-the-art VIO, VO, and visual simultaneous localization and mapping (VSLAM) approaches on the KITTI, EuRoC and Cityscapes datasets. Detailed comparisons prove that SelfVIO outperforms state-of-the-art VIO approaches in terms of pose estimation and depth recovery, making it a promising approach among existing methods in the literature. △ Less

Submitted 23 July, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 15 pages, submitted to The IEEE Transactions on Robotics (T-RO) journal, under review

arXiv:1909.08356 [pdf, other]

Sensor Fusion for Magneto-Inductive Navigation

Authors: Johan Wahlström, Manon Kok, Pedro Porto Buarque de Gusmao, Traian E. Abrudan, Niki Trigoni, Andrew Markham

Abstract: Magneto-inductive navigation is an inexpensive and easily deployable solution to many of today's navigation problems. By utilizing very low frequency magnetic fields, magneto-inductive technology circumvents the problems with attenuation and multipath that often plague competing modalities. Using triaxial transmitter and receiver coils, it is possible to compute position and orientation estimates… ▽ More Magneto-inductive navigation is an inexpensive and easily deployable solution to many of today's navigation problems. By utilizing very low frequency magnetic fields, magneto-inductive technology circumvents the problems with attenuation and multipath that often plague competing modalities. Using triaxial transmitter and receiver coils, it is possible to compute position and orientation estimates in three dimensions. However, in many situations, additional information is available that constrains the set of possible solutions. For example, the receiver may be known to be coplanar with the transmitter, or orientation information may be available from inertial sensors. We employ a maximum a posteriori estimator to fuse magneto-inductive signals with such complementary information. Further, we derive the Cramer-Rao bound for the position estimates and investigate the problem of detecting distortions caused by ferrous material. The performance of the estimator is compared to the Cramer-Rao bound and a state-of-the-art estimator using both simulations and real-world data. By fusing magneto-inductive signals with accelerometer measurements, the median position error is reduced almost by a factor of two. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Showing 1–4 of 4 results for author: de Gusmão, P P B