-
Benchmarking Cross-Domain Audio-Visual Deception Detection
Authors:
Xiaobao Guo,
Zitong Yu,
Nithish Muthuchamy Selvaraj,
Bingquan Shen,
Adams Wai-Kin Kong,
Alex C. Kot
Abstract:
Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features d…
▽ More
Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.
△ Less
Submitted 5 October, 2024; v1 submitted 11 May, 2024;
originally announced May 2024.
-
Improving Concept Alignment in Vision-Language Concept Bottleneck Models
Authors:
Nithish Muthuchamy Selvaraj,
Xiaobao Guo,
Adams Wai-Kin Kong,
Alex Kot
Abstract:
Concept Bottleneck Models (CBM) map images to human-interpretable concepts before making class predictions. Recent approaches automate CBM construction by prompting Large Language Models (LLMs) to generate text concepts and employing Vision Language Models (VLMs) to score these concepts for CBM training. However, it is desired to build CBMs with concepts defined by human experts rather than LLM-ge…
▽ More
Concept Bottleneck Models (CBM) map images to human-interpretable concepts before making class predictions. Recent approaches automate CBM construction by prompting Large Language Models (LLMs) to generate text concepts and employing Vision Language Models (VLMs) to score these concepts for CBM training. However, it is desired to build CBMs with concepts defined by human experts rather than LLM-generated ones to make them more trustworthy. In this work, we closely examine the faithfulness of VLM concept scores for such expert-defined concepts in domains like fine-grained bird species and animal classification. Our investigations reveal that VLMs like CLIP often struggle to correctly associate a concept with the corresponding visual input, despite achieving a high classification performance. This misalignment renders the resulting models difficult to interpret and less reliable. To address this issue, we propose a novel Contrastive Semi-Supervised (CSS) learning method that leverages a few labeled concept samples to activate truthful visual concepts and improve concept alignment in the CLIP model. Extensive experiments on three benchmark datasets demonstrate that our method significantly enhances both concept (+29.95) and classification (+3.84) accuracies yet requires only a fraction of human-annotated concept labels. To further improve the classification performance, we introduce a class-level intervention procedure for fine-grained classification problems that identifies the confounding classes and intervenes in their concept space to reduce errors.
△ Less
Submitted 24 August, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning
Authors:
Xiaobao Guo,
Nithish Muthuchamy Selvaraj,
Zitong Yu,
Adams Wai-Kin Kong,
Bingquan Shen,
Alex Kot
Abstract:
Deception detection in conversations is a challenging yet important task, having pivotal applications in many fields such as credibility assessment in business, multimedia anti-frauds, and custom security. Despite this, deception detection research is hindered by the lack of high-quality deception datasets, as well as the difficulties of learning multimodal features effectively. To address this is…
▽ More
Deception detection in conversations is a challenging yet important task, having pivotal applications in many fields such as credibility assessment in business, multimedia anti-frauds, and custom security. Despite this, deception detection research is hindered by the lack of high-quality deception datasets, as well as the difficulties of learning multimodal features effectively. To address this issue, we introduce DOLOS\footnote {The name ``DOLOS" comes from Greek mythology.}, the largest gameshow deception detection dataset with rich deceptive conversations. DOLOS includes 1,675 video clips featuring 213 subjects, and it has been labeled with audio-visual feature annotations. We provide train-test, duration, and gender protocols to investigate the impact of different factors. We benchmark our dataset on previously proposed deception detection approaches. To further improve the performance by fine-tuning fewer parameters, we propose Parameter-Efficient Crossmodal Learning (PECL), where a Uniform Temporal Adapter (UT-Adapter) explores temporal attention in transformer-based architectures, and a crossmodal fusion module, Plug-in Audio-Visual Fusion (PAVF), combines crossmodal information from audio-visual features. Based on the rich fine-grained audio-visual annotations on DOLOS, we also exploit multi-task learning to enhance performance by concurrently predicting deception and audio-visual features. Experimental results demonstrate the desired quality of the DOLOS dataset and the effectiveness of the PECL. The DOLOS dataset and the source codes are available at https://github.com/NMS05/Audio-Visual-Deception-Detection-DOLOS-Dataset-and-Parameter-Efficient-Crossmodal-Learning/tree/main.
△ Less
Submitted 3 August, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers
Authors:
Nithish Muthuchamy Selvaraj,
Xiaobao Guo,
Adams Kong,
Bingquan Shen,
Alex Kot
Abstract:
Continual learning involves training neural networks incrementally for new tasks while retaining the knowledge of previous tasks. However, efficiently fine-tuning the model for sequential tasks with minimal computational resources remains a challenge. In this paper, we propose Task Incremental Continual Learning (TI-CL) of audio classifiers with both parameter-efficient and compute-efficient Audio…
▽ More
Continual learning involves training neural networks incrementally for new tasks while retaining the knowledge of previous tasks. However, efficiently fine-tuning the model for sequential tasks with minimal computational resources remains a challenge. In this paper, we propose Task Incremental Continual Learning (TI-CL) of audio classifiers with both parameter-efficient and compute-efficient Audio Spectrogram Transformers (AST). To reduce the trainable parameters without performance degradation for TI-CL, we compare several Parameter Efficient Transfer (PET) methods and propose AST with Convolutional Adapters for TI-CL, which has less than 5% of trainable parameters of the fully fine-tuned counterparts. To reduce the computational complexity, we introduce a novel Frequency-Time factorized Attention (FTA) method that replaces the traditional self-attention in transformers for audio spectrograms. FTA achieves competitive performance with only a factor of the computations required by Global Self-Attention (GSA). Finally, we formulate our method for TI-CL, called Adapter Incremental Continual Learning (AI-CL), as a combination of the "parameter-efficient" Convolutional Adapter and the "compute-efficient" FTA. Experiments on ESC-50, SpeechCommandsV2 (SCv2), and Audio-Visual Event (AVE) benchmarks show that our proposed method prevents catastrophic forgetting in TI-CL while maintaining a lower computational budget.
△ Less
Submitted 2 January, 2024; v1 submitted 28 February, 2023;
originally announced February 2023.
-
Flexible-modal Deception Detection with Audio-Visual Adapter
Authors:
Zhaoxu Li,
Zitong Yu,
Nithish Muthuchamy Selvaraj,
Xiaobao Guo,
Bingquan Shen,
Adams Wai-Kin Kong,
Alex Kot
Abstract:
Detecting deception by human behaviors is vital in many fields such as custom security and multimedia anti-fraud. Recently, audio-visual deception detection attracts more attention due to its better performance than using only a single modality. However, in real-world multi-modal settings, the integrity of data can be an issue (e.g., sometimes only partial modalities are available). The missing mo…
▽ More
Detecting deception by human behaviors is vital in many fields such as custom security and multimedia anti-fraud. Recently, audio-visual deception detection attracts more attention due to its better performance than using only a single modality. However, in real-world multi-modal settings, the integrity of data can be an issue (e.g., sometimes only partial modalities are available). The missing modality might lead to a decrease in performance, but the model still learns the features of the missed modality. In this paper, to further improve the performance and overcome the missing modality problem, we propose a novel Transformer-based framework with an Audio-Visual Adapter (AVA) to fuse temporal features across two modalities efficiently. Extensive experiments conducted on two benchmark datasets demonstrate that the proposed method can achieve superior performance compared with other multi-modal fusion methods under flexible-modal (multiple and missing modalities) settings.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.