-
Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition
Authors:
Vincent Karas,
Mani Kumar Tellamekala,
Adria Mallol-Ragolta,
Michel Valstar,
Björn W. Schuller
Abstract:
In this paper, we present our submission to 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge. Learningcomplex interactions among multimodal sequences is critical to recognise dimensional affect from in-the-wild audiovisual data. Recurrence and attention are the two widely used sequence modelling mechanisms in the literature. To clearly understand the performance differences between rec…
▽ More
In this paper, we present our submission to 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge. Learningcomplex interactions among multimodal sequences is critical to recognise dimensional affect from in-the-wild audiovisual data. Recurrence and attention are the two widely used sequence modelling mechanisms in the literature. To clearly understand the performance differences between recurrent and attention models in audiovisual affect recognition, we present a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention and cross-modal attention, trained for valence and arousal estimation. Particularly, we study the impact of some key design choices: the modelling complexity of CNN backbones that provide features to the the temporal models, with and without end-to-end learning. We trained the audiovisual affect recognition models on in-the-wild ABAW corpus by systematically tuning the hyper-parameters involved in the network architecture design and training optimisation. Our extensive evaluation of the audiovisual fusion models shows that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones and trained in an end-to-end fashion, implying that attention models may not necessarily be the optimal choice for continuous-time multimodal emotion recognition.
△ Less
Submitted 29 March, 2022; v1 submitted 24 March, 2022;
originally announced March 2022.
-
EMOPAIN Challenge 2020: Multimodal Pain Evaluation from Facial and Bodily Expressions
Authors:
Joy O. Egede,
Siyang Song,
Temitayo A. Olugbade,
Chongyang Wang,
Amanda Williams,
Hongying Meng,
Min Aung,
Nicholas D. Lane,
Michel Valstar,
Nadia Bianchi-Berthouze
Abstract:
The EmoPain 2020 Challenge is the first international competition aimed at creating a uniform platform for the comparison of machine learning and multimedia processing methods of automatic chronic pain assessment from human expressive behaviour, and also the identification of pain-related behaviours. The objective of the challenge is to promote research in the development of assistive technologies…
▽ More
The EmoPain 2020 Challenge is the first international competition aimed at creating a uniform platform for the comparison of machine learning and multimedia processing methods of automatic chronic pain assessment from human expressive behaviour, and also the identification of pain-related behaviours. The objective of the challenge is to promote research in the development of assistive technologies that help improve the quality of life for people with chronic pain via real-time monitoring and feedback to help manage their condition and remain physically active. The challenge also aims to encourage the use of the relatively underutilised, albeit vital bodily expression signals for automatic pain and pain-related emotion recognition. This paper presents a description of the challenge, competition guidelines, bench-marking dataset, and the baseline systems' architecture and performance on the three sub-tasks: pain estimation from facial expressions, pain recognition from multimodal movement, and protective movement behaviour detection.
△ Less
Submitted 9 March, 2020; v1 submitted 21 January, 2020;
originally announced January 2020.
-
Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification
Authors:
Siyang Song,
Shuimei Zhang,
Björn Schuller,
Linlin Shen,
Michel Valstar
Abstract:
The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances…
▽ More
The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker verification systems achieves similar performance when we change the constraints (hyper-parameters) or features, which indicates that it is robust and easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks.
△ Less
Submitted 3 May, 2018;
originally announced May 2018.