-
Multi-Modal interpretable automatic video captioning
Authors:
Antoine Hanna-Asaad,
Decky Aspandi,
Titus Zaharia
Abstract:
Video captioning aims to describe video contents using natural language format that involves understanding and interpreting scenes, actions and events that occurs simultaneously on the view. Current approaches have mainly concentrated on visual cues, often neglecting the rich information available from other important modality of audio information, including their inter-dependencies. In this work,…
▽ More
Video captioning aims to describe video contents using natural language format that involves understanding and interpreting scenes, actions and events that occurs simultaneously on the view. Current approaches have mainly concentrated on visual cues, often neglecting the rich information available from other important modality of audio information, including their inter-dependencies. In this work, we introduce a novel video captioning method trained with multi-modal contrastive loss that emphasizes both multi-modal integration and interpretability. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions. Furthermore, we highlight the importance of interpretability, employing multiple attention mechanisms that provide explanation into the model's decision-making process. Our experimental results demonstrate that our proposed method performs favorably against the state-of the-art models on commonly used benchmark datasets of MSR-VTT and VATEX.
△ Less
Submitted 11 November, 2024;
originally announced November 2024.
-
Deep self-supervised learning with visualisation for automatic gesture recognition
Authors:
Fabien Allemand,
Alessio Mazzela,
Jun Villette,
Decky Aspandi,
Titus Zaharia
Abstract:
Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. However, it is considered difficult to automatically recognise gestures. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervis…
▽ More
Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. However, it is considered difficult to automatically recognise gestures. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervised methods and visualisation based techniques applied to 3D moving skeleton data. Self-supervised learning used to train fully connected, CNN and LSTM method. Then, reconstruction method is applied to unlabelled data in simulated settings using CNN as a backbone where we use the learnt features to perform the prediction in the remaining labelled data. Lastly, Grad-CAM is applied to discover the focus of the models. Our experiments results show that supervised learning method is capable to recognise gesture accurately, with self-supervised learning increasing the accuracy in simulated settings. Finally, Grad-CAM visualisation shows that indeed the models focus on relevant skeleton joints on the associated gesture.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Predicting Eye Gaze Location on Websites
Authors:
Ciheng Zhang,
Decky Aspandi,
Steffen Staab
Abstract:
World-wide-web, with the website and webpage as the main interface, facilitates the dissemination of important information. Hence it is crucial to optimize them for better user interaction, which is primarily done by analyzing users' behavior, especially users' eye-gaze locations. However, gathering these data is still considered to be labor and time intensive. In this work, we enable the developm…
▽ More
World-wide-web, with the website and webpage as the main interface, facilitates the dissemination of important information. Hence it is crucial to optimize them for better user interaction, which is primarily done by analyzing users' behavior, especially users' eye-gaze locations. However, gathering these data is still considered to be labor and time intensive. In this work, we enable the development of automatic eye-gaze estimations given a website screenshots as the input. This is done by the curation of a unified dataset that consists of website screenshots, eye-gaze heatmap and website's layout information in the form of image and text masks. Our pre-processed dataset allows us to propose an effective deep learning-based model that leverages both image and text spatial location, which is combined through attention mechanism for effective eye-gaze prediction. In our experiment, we show the benefit of careful fine-tuning using our unified dataset to improve the accuracy of eye-gaze predictions. We further observe the capability of our model to focus on the targeted areas (images and text) to achieve high accuracy. Finally, the comparison with other alternatives shows the state-of-the-art result of our model establishing the benchmark for the eye-gaze prediction task.
△ Less
Submitted 6 January, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
User Interaction Analysis through Contrasting Websites Experience
Authors:
Decky Aspandi,
Sarah Doosdal,
Victor Ülger,
Lukas Gillich,
Steffen Staab
Abstract:
Current advance of internet allows rapid dissemination of information, accelerating the progress on wide spectrum of society. This has been done mainly through the use of website interface with inherent unique human interactions. In this regards the usability analysis becomes a central part to improve the human interactions. However, This analysis has not yet quantitatively been evaluated through…
▽ More
Current advance of internet allows rapid dissemination of information, accelerating the progress on wide spectrum of society. This has been done mainly through the use of website interface with inherent unique human interactions. In this regards the usability analysis becomes a central part to improve the human interactions. However, This analysis has not yet quantitatively been evaluated through user perception during interaction, especially when dealing wide range of tasks. In this study, we perform the quantitative analysis the usability of websites based on their usage and relevance. We do this by reporting user interactions based user subjective perceptions, eye-tracking data and facial expressions based on the collected data from two different sets of websites. In general, we found that the user interaction parameters are substantially difference across website sets, with a degree of relation with perceived user emotions during interactions.
△ Less
Submitted 12 January, 2022; v1 submitted 10 January, 2022;
originally announced January 2022.
-
Machine Learning-based Lie Detector applied to a Novel Annotated Game Dataset
Authors:
Nuria Rodriguez-Diaz,
Decky Aspandi,
Federico Sukno,
Xavier Binefa
Abstract:
Lie detection is considered a concern for everyone in their day to day life given its impact on human interactions. Thus, people normally pay attention to both what their interlocutors are saying and also to their visual appearances, including faces, to try to find any signs that indicate whether the person is telling the truth or not. While automatic lie detection may help us to understand this l…
▽ More
Lie detection is considered a concern for everyone in their day to day life given its impact on human interactions. Thus, people normally pay attention to both what their interlocutors are saying and also to their visual appearances, including faces, to try to find any signs that indicate whether the person is telling the truth or not. While automatic lie detection may help us to understand this lying characteristics, current systems are still fairly limited, partly due to lack of adequate datasets to evaluate their performance in realistic scenarios. In this work, we have collected an annotated dataset of facial images, comprising both 2D and 3D information of several participants during a card game that encourages players to lie. Using our collected dataset, We evaluated several types of machine learning-based lie detectors in terms of their generalization, person-specific and cross-domain experiments. Our results show that models based on deep learning achieve the best accuracy, reaching up to 57\% for the generalization task and 63\% when dealing with a single participant. Finally, we also highlight the limitation of the deep learning based lie detector when dealing with cross-domain lie detection tasks.
△ Less
Submitted 30 June, 2021; v1 submitted 26 April, 2021;
originally announced April 2021.
-
An Enhanced Adversarial Network with Combined Latent Features for Spatio-Temporal Facial Affect Estimation in the Wild
Authors:
Decky Aspandi,
Federico Sukno,
Björn Schuller,
Xavier Binefa
Abstract:
Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data,…
▽ More
Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data, making training difficult and time consuming. This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules. In our experiments, we show the effectiveness of our approach by reporting our competitive results on both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimates both in qualitative and quantitative terms. Furthermore, we find that the inclusion of attention mechanisms leads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facial movements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length of around 160\,ms to be the optimum one for temporal modelling, which is consistent with other relevant findings utilising similar lengths.
△ Less
Submitted 17 February, 2021;
originally announced February 2021.
-
Adversarial-based neural networks for affect estimations in the wild
Authors:
Decky Aspandi,
Adria Mallol-Ragolta,
Björn Schuller,
Xavier Binefa
Abstract:
There is a growing interest in affective computing research nowadays given its crucial role in bridging humans with computers. This progress has been recently accelerated due to the emergence of bigger data. One recent advance in this field is the use of adversarial learning to improve model learning through augmented samples. However, the use of latent features, which is feasible through adversar…
▽ More
There is a growing interest in affective computing research nowadays given its crucial role in bridging humans with computers. This progress has been recently accelerated due to the emergence of bigger data. One recent advance in this field is the use of adversarial learning to improve model learning through augmented samples. However, the use of latent features, which is feasible through adversarial learning, is not largely explored, yet. This technique may also improve the performance of affective models, as analogously demonstrated in related fields, such as computer vision. To expand this analysis, in this work, we explore the use of latent features through our proposed adversarial-based networks for valence and arousal recognition in the wild. Specifically, our models operate by aggregating several modalities to our discriminator, which is further conditioned to the extracted latent features by the generator. Our experiments on the recently released SEWA dataset suggest the progressive improvements of our results. Finally, we show our competitive results on the Affective Behavior Analysis in-the-Wild (ABAW) challenge dataset
△ Less
Submitted 9 February, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
End-to-end facial and physiological model for Affective Computing and applications
Authors:
Joaquim Comas,
Decky Aspandi,
Xavier Binefa
Abstract:
In recent years, Affective Computing and its applications have become a fast-growing research topic. Furthermore, the rise of Deep Learning has introduced significant improvements in the emotion recognition system compared to classical methods. In this work, we propose a multi-modal emotion recognition model based on deep learning techniques using the combination of peripheral physiological signal…
▽ More
In recent years, Affective Computing and its applications have become a fast-growing research topic. Furthermore, the rise of Deep Learning has introduced significant improvements in the emotion recognition system compared to classical methods. In this work, we propose a multi-modal emotion recognition model based on deep learning techniques using the combination of peripheral physiological signals and facial expressions. Moreover, we present an improvement to proposed models by introducing latent features extracted from our internal Bio Auto-Encoder (BAE). Both models are trained and evaluated on AMIGOS datasets reporting valence, arousal, and emotion state classification. Finally, to demonstrate a possible medical application in affective computing using deep learning techniques, we applied the proposed method to the assessment of anxiety therapy. To this purpose, a reduced multi-modal database has been collected by recording facial expressions and peripheral signals such as Electrocardiogram (ECG) and Galvanic Skin Response (GSR) of each patient. Valence and arousal estimation was extracted using the proposed model from the beginning until the end of the therapy, with successful evaluation to the different emotional changes in the temporal domain.
△ Less
Submitted 20 January, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.