-
Non-verbal Hands-free Control for Smart Glasses using Teeth Clicks
Authors:
Payal Mohapatra,
Ali Aroudi,
Anurag Kumar,
Morteza Khaleghimeybodi
Abstract:
Smart glasses are emerging as a popular wearable computing platform potentially revolutionizing the next generation of human-computer interaction. The widespread adoption of smart glasses has created a pressing need for discreet and hands-free control methods. Traditional input techniques, such as voice commands or tactile gestures, can be intrusive and non-discreet. Additionally, voice-based cont…
▽ More
Smart glasses are emerging as a popular wearable computing platform potentially revolutionizing the next generation of human-computer interaction. The widespread adoption of smart glasses has created a pressing need for discreet and hands-free control methods. Traditional input techniques, such as voice commands or tactile gestures, can be intrusive and non-discreet. Additionally, voice-based control may not function well in noisy acoustic conditions. We propose a novel, discreet, non-verbal, and non-tactile approach to controlling smart glasses through subtle vibrations on the skin induced by teeth clicking. We demonstrate that these vibrations can be sensed by accelerometers embedded in the glasses with a low-footprint predictive model. Our proposed method, called STEALTHsense, utilizes a temporal broadcasting-based neural network architecture with just 88K trainable parameters and 7.14M Multiply and Accumulate (MMAC) per inference unit. We benchmark our proposed STEALTHsense against state-of-the-art deep learning approaches and traditional low-footprint machine learning approaches. We conducted a study across 21 participants to collect representative samples for two distinct teeth-clicking patterns and many non-patterns for robust training of STEALTHsense, achieving an average cross-person accuracy of 0.93. Field testing confirmed its effectiveness, even in noisy conditions, underscoring STEALTHsense's potential for real-world applications, offering a promising solution for smart glasses interaction.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses
Authors:
Zhongweiyang Xu,
Ali Aroudi,
Ke Tan,
Ashutosh Pandey,
Jung-Suk Lee,
Buye Xu,
Francesco Nesta
Abstract:
This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with hi…
▽ More
This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with high computational efficiency. The neural network component is designed with ultra-low computation (about 50 MMACS). A multi-channel Wiener filter and a post-processing module are further used to improve perceptual quality. We evaluate our algorithm with a microphone array on smart glasses, providing a configurable, efficient solution for augmented hearing on energy-constrained devices. FoVNet excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation
Authors:
Ali Aroudi,
Stefan Uhlich,
Marc Ferras Font
Abstract:
In recent years, many deep learning techniques for single-channel sound source separation have been proposed using recurrent, convolutional and transformer networks. When multiple microphones are available, spatial diversity between speakers and background noise in addition to spectro-temporal diversity can be exploited by using multi-channel filters for sound source separation. Aiming at end-to-e…
▽ More
In recent years, many deep learning techniques for single-channel sound source separation have been proposed using recurrent, convolutional and transformer networks. When multiple microphones are available, spatial diversity between speakers and background noise in addition to spectro-temporal diversity can be exploited by using multi-channel filters for sound source separation. Aiming at end-to-end multi-channel source separation, in this paper we propose a transformer-recurrent-U network (TRUNet), which directly estimates multi-channel filters from multi-channel input spectra. TRUNet consists of a spatial processing network with an attention mechanism across microphone channels aiming at capturing the spatial diversity, and a spectro-temporal processing network aiming at capturing spectral and temporal diversities. In addition to multi-channel filters, we also consider estimating single-channel filters from multi-channel input spectra using TRUNet. We train the network on a large reverberant dataset using a combined compressed mean-squared error loss function, which further improves the sound separation performance. We evaluate the network on a realistic and challenging reverberant dataset, generated from measured room impulse responses of an actual microphone array. The experimental results on realistic reverberant sound source separation show that the proposed TRUNet outperforms state-of-the-art single-channel and multi-channel source separation methods.
△ Less
Submitted 22 August, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation
Authors:
Ali Aroudi,
Sebastian Braun
Abstract:
Many deep learning techniques are available to perform source separation and reduce background noise. However, designing an end-to-end multi-channel source separation method using deep learning and conventional acoustic signal processing techniques still remains challenging. In this paper we propose a direction-of-arrival-driven beamforming network (DBnet) consisting of direction-of-arrival (DOA)…
▽ More
Many deep learning techniques are available to perform source separation and reduce background noise. However, designing an end-to-end multi-channel source separation method using deep learning and conventional acoustic signal processing techniques still remains challenging. In this paper we propose a direction-of-arrival-driven beamforming network (DBnet) consisting of direction-of-arrival (DOA) estimation and beamforming layers for end-to-end source separation. We propose to train DBnet using loss functions that are solely based on the distances between the separated speech signals and the target speech signals, without a need for the ground-truth DOAs of speakers. To improve the source separation performance, we also propose end-to-end extensions of DBnet which incorporate post masking networks. We evaluate the proposed DBnet and its extensions on a very challenging dataset, targeting realistic far-field sound source separation in reverberant and noisy environments. The experimental results show that the proposed extended DBnet using a convolutional-recurrent post masking network outperforms state-of-the-art source separation methods.
△ Less
Submitted 22 October, 2020;
originally announced October 2020.
-
Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding
Authors:
Ali Aroudi,
Marc Delcroix,
Tomohiro Nakatani,
Keisuke Kinoshita,
Shoko Araki,
Simon Doclo
Abstract:
The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient nois…
▽ More
The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient noise, in this paper we propose a cognitive-driven multi-microphone speech enhancement system, which combines a neural-network-based mask estimator, weighted minimum power distortionless response convolutional beamformers and AAD. To control the suppression of the interfering speaker, we also propose an extension incorporating an interference suppression constraint. The experimental results show that the proposed system outperforms the state-of-the-art cognitive-driven speech enhancement systems in challenging reverberant and noisy conditions.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
Improving auditory attention decoding performance of linear and non-linear methods using state-space model
Authors:
Ali Aroudi,
Tobias de Taillez,
Simon Doclo
Abstract:
Identifying the target speaker in hearing aid applications is crucial to improve speech understanding. Recent advances in electroencephalography (EEG) have shown that it is possible to identify the target speaker from single-trial EEG recordings using auditory attention decoding (AAD) methods. AAD methods reconstruct the attended speech envelope from EEG recordings, based on a linear least-squares…
▽ More
Identifying the target speaker in hearing aid applications is crucial to improve speech understanding. Recent advances in electroencephalography (EEG) have shown that it is possible to identify the target speaker from single-trial EEG recordings using auditory attention decoding (AAD) methods. AAD methods reconstruct the attended speech envelope from EEG recordings, based on a linear least-squares cost function or non-linear neural networks, and then directly compare the reconstructed envelope with the speech envelopes of speakers to identify the attended speaker using Pearson correlation coefficients. Since these correlation coefficients are highly fluctuating, for a reliable decoding a large correlation window is used, which causes a large processing delay. In this paper, we investigate a state-space model using correlation coefficients obtained with a small correlation window to improve the decoding performance of the linear and the non-linear AAD methods. The experimental results show that the state-space model significantly improves the decoding performance.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.