Search | arXiv e-print repository

MHS-STMA: Multimodal Hate Speech Detection via Scalable Transformer-Based Multilevel Attention Framework

Authors: Anusha Chhabra, Dinesh Kumar Vishwakarma

Abstract: Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues in recent years. Text and pictures are two forms of multimodal data that are distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the dist… ▽ More Social media has a significant impact on people's lives. Hate speech on social media has emerged as one of society's most serious issues in recent years. Text and pictures are two forms of multimodal data that are distributed within articles. Unimodal analysis has been the primary emphasis of earlier approaches. Additionally, when doing multimodal analysis, researchers neglect to preserve the distinctive qualities associated with each modality. To address these shortcomings, the present article suggests a scalable architecture for multimodal hate content detection called transformer-based multilevel attention (STMA). This architecture consists of three main parts: a combined attention-based deep learning mechanism, a vision attention-mechanism encoder, and a caption attention-mechanism encoder. To identify hate content, each component uses various attention processes and handles multimodal data in a unique way. Several studies employing multiple assessment criteria on three hate speech datasets such as Hateful memes, MultiOff, and MMHS150K, validate the suggested architecture's efficacy. The outcomes demonstrate that on all three datasets, the suggested strategy performs better than the baseline approaches. △ Less

Submitted 17 September, 2024; v1 submitted 8 September, 2024; originally announced September 2024.

arXiv:2409.05134 [pdf]

Hate Content Detection via Novel Pre-Processing Sequencing and Ensemble Methods

Authors: Anusha Chhabra, Dinesh Kumar Vishwakarma

Abstract: Social media, particularly Twitter, has seen a significant increase in incidents like trolling and hate speech. Thus, identifying hate speech is the need of the hour. This paper introduces a computational framework to curb the hate content on the web. Specifically, this study presents an exhaustive study of pre-processing approaches by studying the impact of changing the sequence of text pre-proce… ▽ More Social media, particularly Twitter, has seen a significant increase in incidents like trolling and hate speech. Thus, identifying hate speech is the need of the hour. This paper introduces a computational framework to curb the hate content on the web. Specifically, this study presents an exhaustive study of pre-processing approaches by studying the impact of changing the sequence of text pre-processing operations for the identification of hate content. The best-performing pre-processing sequence, when implemented with popular classification approaches like Support Vector Machine, Random Forest, Decision Tree, Logistic Regression and K-Neighbor provides a considerable boost in performance. Additionally, the best pre-processing sequence is used in conjunction with different ensemble methods, such as bagging, boosting and stacking to improve the performance further. Three publicly available benchmark datasets (WZ-LS, DT, and FOUNTA), were used to evaluate the proposed approach for hate speech identification. The proposed approach achieves a maximum accuracy of 95.14% highlighting the effectiveness of the unique pre-processing approach along with an ensemble classifier. △ Less

Submitted 8 September, 2024; originally announced September 2024.

arXiv:2409.00896 [pdf]

A Noise and Edge extraction-based dual-branch method for Shallowfake and Deepfake Localization

Authors: Deepak Dagar, Dinesh Kumar Vishwakarma

Abstract: The trustworthiness of multimedia is being increasingly evaluated by advanced Image Manipulation Localization (IML) techniques, resulting in the emergence of the IML field. An effective manipulation model necessitates the extraction of non-semantic differential features between manipulated and legitimate sections to utilize artifacts. This requires direct comparisons between the two regions.. Curr… ▽ More The trustworthiness of multimedia is being increasingly evaluated by advanced Image Manipulation Localization (IML) techniques, resulting in the emergence of the IML field. An effective manipulation model necessitates the extraction of non-semantic differential features between manipulated and legitimate sections to utilize artifacts. This requires direct comparisons between the two regions.. Current models employ either feature approaches based on handcrafted features, convolutional neural networks (CNNs), or a hybrid approach that combines both. Handcrafted feature approaches presuppose tampering in advance, hence restricting their effectiveness in handling various tampering procedures, but CNNs capture semantic information, which is insufficient for addressing manipulation artifacts. In order to address these constraints, we have developed a dual-branch model that integrates manually designed feature noise with conventional CNN features. This model employs a dual-branch strategy, where one branch integrates noise characteristics and the other branch integrates RGB features using the hierarchical ConvNext Module. In addition, the model utilizes edge supervision loss to acquire boundary manipulation information, resulting in accurate localization at the edges. Furthermore, this architecture utilizes a feature augmentation module to optimize and refine the presentation of attributes. The shallowfakes dataset (CASIA, COVERAGE, COLUMBIA, NIST16) and deepfake dataset Faceforensics++ (FF++) underwent thorough testing to demonstrate their outstanding ability to extract features and their superior performance compared to other baseline models. The AUC score achieved an astounding 99%. The model is superior in comparison and easily outperforms the existing state-of-the-art (SoTA) models. △ Less

Submitted 1 September, 2024; originally announced September 2024.

arXiv:2408.16892 [pdf]

Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector

Authors: Deepak Dagar, Dinesh Kumar Vishwakarma

Abstract: Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification probl… ▽ More Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.10248 [pdf]

Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs

Authors: Ananya Pandey, Dinesh Kumar Vishwakarma

Abstract: The natural language processing and multimedia field has seen a notable surge in interest in multimodal sentiment recognition. Hence, this study aims to employ Target-Dependent Multimodal Sentiment Analysis (TDMSA) to identify the level of sentiment associated with every target (aspect) stated within a multimodal post consisting of a visual-caption pair. Despite the recent advancements in multimod… ▽ More The natural language processing and multimedia field has seen a notable surge in interest in multimodal sentiment recognition. Hence, this study aims to employ Target-Dependent Multimodal Sentiment Analysis (TDMSA) to identify the level of sentiment associated with every target (aspect) stated within a multimodal post consisting of a visual-caption pair. Despite the recent advancements in multimodal sentiment recognition, there has been a lack of explicit incorporation of emotional clues from the visual modality, specifically those pertaining to facial expressions. The challenge at hand is to proficiently obtain visual and emotional clues and subsequently synchronise them with the textual content. In light of this fact, this study presents a novel approach called the Visual-to-Emotional-Caption Translation Network (VECTN) technique. The primary objective of this strategy is to effectively acquire visual sentiment clues by analysing facial expressions. Additionally, it effectively aligns and blends the obtained emotional clues with the target attribute of the caption mode. The experimental findings demonstrate that our methodology is capable of producing ground-breaking outcomes when applied to two publicly accessible multimodal Twitter datasets, namely, Twitter-2015 and Twitter-2017. The experimental results show that the suggested model achieves an accuracy of 81.23% and a macro-F1 of 80.61% on the Twitter-15 dataset, while 77.42% and 75.19% on the Twitter-17 dataset, respectively. The observed improvement in performance reveals that our model is better than others when it comes to collecting target-level sentiment in multimodal data using the expressions of the face. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.10246 [pdf]

VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features

Authors: Ananya Pandey, Dinesh Kumar Vishwakarma

Abstract: Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Stil… ▽ More Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Still, it is critical to consider all textual information, audio stream, facial expression, and body position for reliable sarcasm identification. Hence, we propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data and an attentional tokenizer based strategy to extract the most critical context-specific information from the textual data. The following is a list of the key contributions that our experimentation has made in response to performing the task of Multi-modal Sarcasm Recognition: an attentional tokenizer branch to get beneficial features from the glossary content provided by the subtitles; a visual branch for acquiring the most prominent features from the video frames; an utterance-level feature extraction from acoustic content and a multi-headed attention based feature fusion branch to blend features obtained from multiple modalities. Extensive testing on one of the benchmark video datasets, MUSTaRD, yielded an accuracy of 79.86% for speaker dependent and 76.94% for speaker independent configuration demonstrating that our approach is superior to the existing methods. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.02595 [pdf]

Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection

Authors: Sajal Aggarwal, Ananya Pandey, Dinesh Kumar Vishwakarma

Abstract: Sarcasm is a type of irony, characterized by an inherent mismatch between the literal interpretation and the intended connotation. Though sarcasm detection in text has been extensively studied, there are situations in which textual input alone might be insufficient to perceive sarcasm. The inclusion of additional contextual cues, such as images, is essential to recognize sarcasm in social media da… ▽ More Sarcasm is a type of irony, characterized by an inherent mismatch between the literal interpretation and the intended connotation. Though sarcasm detection in text has been extensively studied, there are situations in which textual input alone might be insufficient to perceive sarcasm. The inclusion of additional contextual cues, such as images, is essential to recognize sarcasm in social media data effectively. This study presents a novel framework for multimodal sarcasm detection that can process input triplets. Two components of these triplets comprise the input text and its associated image, as provided in the datasets. Additionally, a supplementary modality is introduced in the form of descriptive image captions. The motivation behind incorporating this visual semantic representation is to more accurately capture the discrepancies between the textual and visual content, which are fundamental to the sarcasm detection task. The primary contributions of this study are: (1) a robust textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) an additional modality in the form of image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.02571 [pdf]

Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs

Authors: Ananya Pandey, Dinesh Kumar Vishwakarma

Abstract: The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an… ▽ More The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2401.06999 [pdf]

Datasets, Clues and State-of-the-Arts for Multimedia Forensics: An Extensive Review

Authors: Ankit Yadav, Dinesh Kumar Vishwakarma

Abstract: With the large chunks of social media data being created daily and the parallel rise of realistic multimedia tampering methods, detecting and localising tampering in images and videos has become essential. This survey focusses on approaches for tampering detection in multimedia data using deep learning models. Specifically, it presents a detailed analysis of benchmark datasets for malicious manipu… ▽ More With the large chunks of social media data being created daily and the parallel rise of realistic multimedia tampering methods, detecting and localising tampering in images and videos has become essential. This survey focusses on approaches for tampering detection in multimedia data using deep learning models. Specifically, it presents a detailed analysis of benchmark datasets for malicious manipulation detection that are publicly available. It also offers a comprehensive list of tampering clues and commonly used deep learning architectures. Next, it discusses the current state-of-the-art tampering detection methods, categorizing them into meaningful types such as deepfake detection methods, splice tampering detection methods, copy-move tampering detection methods, etc. and discussing their strengths and weaknesses. Top results achieved on benchmark datasets, comparison of deep learning approaches against traditional methods and critical insights from the recent tampering detection methods are also discussed. Lastly, the research gaps, future direction and conclusion are discussed to provide an in-depth understanding of the tampering detection research arena. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2401.06998 [pdf]

Towards Effective Image Forensics via A Novel Computationally Efficient Framework and A New Image Splice Dataset

Authors: Ankit Yadav, Dinesh Kumar Vishwakarma

Abstract: Splice detection models are the need of the hour since splice manipulations can be used to mislead, spread rumors and create disharmony in society. However, there is a severe lack of image splicing datasets, which restricts the capabilities of deep learning models to extract discriminative features without overfitting. This manuscript presents two-fold contributions toward splice detection. Firstl… ▽ More Splice detection models are the need of the hour since splice manipulations can be used to mislead, spread rumors and create disharmony in society. However, there is a severe lack of image splicing datasets, which restricts the capabilities of deep learning models to extract discriminative features without overfitting. This manuscript presents two-fold contributions toward splice detection. Firstly, a novel splice detection dataset is proposed having two variants. The two variants include spliced samples generated from code and through manual editing. Spliced images in both variants have corresponding binary masks to aid localization approaches. Secondly, a novel Spatio-Compression Lightweight Splice Detection Framework is proposed for accurate splice detection with minimum computational cost. The proposed dual-branch framework extracts discriminative spatial features from a lightweight spatial branch. It uses original resolution compression data to extract double compression artifacts from the second branch, thereby making it 'information preserving.' Several CNNs are tested in combination with the proposed framework on a composite dataset of images from the proposed dataset and the CASIA v2.0 dataset. The best model accuracy of 0.9382 is achieved and compared with similar state-of-the-art methods, demonstrating the superiority of the proposed framework. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2401.06995 [pdf]

A Visually Attentive Splice Localization Network with Multi-Domain Feature Extractor and Multi-Receptive Field Upsampler

Authors: Ankit Yadav, Dinesh Kumar Vishwakarma

Abstract: Image splice manipulation presents a severe challenge in today's society. With easy access to image manipulation tools, it is easier than ever to modify images that can mislead individuals, organizations or society. In this work, a novel, "Visually Attentive Splice Localization Network with Multi-Domain Feature Extractor and Multi-Receptive Field Upsampler" has been proposed. It contains a unique… ▽ More Image splice manipulation presents a severe challenge in today's society. With easy access to image manipulation tools, it is easier than ever to modify images that can mislead individuals, organizations or society. In this work, a novel, "Visually Attentive Splice Localization Network with Multi-Domain Feature Extractor and Multi-Receptive Field Upsampler" has been proposed. It contains a unique "visually attentive multi-domain feature extractor" (VA-MDFE) that extracts attentional features from the RGB, edge and depth domains. Next, a "visually attentive downsampler" (VA-DS) is responsible for fusing and downsampling the multi-domain features. Finally, a novel "visually attentive multi-receptive field upsampler" (VA-MRFU) module employs multiple receptive field-based convolutions to upsample attentional features by focussing on different information scales. Experimental results conducted on the public benchmark dataset CASIA v2.0 prove the potency of the proposed model. It comfortably beats the existing state-of-the-arts by achieving an IoU score of 0.851, pixel F1 score of 0.9195 and pixel AUC score of 0.8989. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2311.18676 [pdf, other]

DQSSA: A Quantum-Inspired Solution for Maximizing Influence in Online Social Networks (Student Abstract)

Authors: Aryaman Rao, Parth Singh, Dinesh Kumar Vishwakarma, Mukesh Prasad

Abstract: Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low effic… ▽ More Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: AAAI Conference on Artificial Intelligence 2024

arXiv:2301.05220 [pdf, other]

Adversarial Adaptation for French Named Entity Recognition

Authors: Arjun Choudhry, Inder Khatri, Pankaj Gupta, Aaryan Gupta, Maxime Nicol, Marie-Jean Meurs, Dinesh Kumar Vishwakarma

Abstract: Named Entity Recognition (NER) is the task of identifying and classifying named entities in large-scale texts into predefined classes. NER in French and other relatively limited-resource languages cannot always benefit from approaches proposed for languages like English due to a dearth of large, robust datasets. In this paper, we present our work that aims to mitigate the effects of this dearth of… ▽ More Named Entity Recognition (NER) is the task of identifying and classifying named entities in large-scale texts into predefined classes. NER in French and other relatively limited-resource languages cannot always benefit from approaches proposed for languages like English due to a dearth of large, robust datasets. In this paper, we present our work that aims to mitigate the effects of this dearth of large, labeled datasets. We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora to improve feature extraction and enable better generalization. Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains to introduce more variations during training and reduce overfitting. Experimental results on three labeled datasets show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of Transformer models, source datasets, and target corpora. We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora. △ Less

Submitted 12 January, 2023; originally announced January 2023.

Comments: Preprint version of short paper accepted for the ECIR 2023 conference

arXiv:2212.03692 [pdf, other]

Transformer-Based Named Entity Recognition for French Using Adversarial Adaptation to Similar Domain Corpora

Authors: Arjun Choudhry, Pankaj Gupta, Inder Khatri, Aaryan Gupta, Maxime Nicol, Marie-Jean Meurs, Dinesh Kumar Vishwakarma

Abstract: Named Entity Recognition (NER) involves the identification and classification of named entities in unstructured text into predefined classes. NER in languages with limited resources, like French, is still an open problem due to the lack of large, robust, labelled datasets. In this paper, we propose a transformer-based NER approach for French using adversarial adaptation to similar domain or genera… ▽ More Named Entity Recognition (NER) involves the identification and classification of named entities in unstructured text into predefined classes. NER in languages with limited resources, like French, is still an open problem due to the lack of large, robust, labelled datasets. In this paper, we propose a transformer-based NER approach for French using adversarial adaptation to similar domain or general corpora for improved feature extraction and better generalization. We evaluate our approach on three labelled datasets and show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of transformer models, source datasets and target corpora. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: Author version of Student Abstract to appear in AAAI 2023 - Student Abstract and Poster Program

arXiv:2211.17200 [pdf, other]

CKS: A Community-based K-shell Decomposition Approach using Community Bridge Nodes for Influence Maximization

Authors: Inder Khatri, Aaryan Gupta, Arjun Choudhry, Aryan Tyagi, Dinesh Kumar Vishwakarma, Mukesh Prasad

Abstract: Social networks have enabled user-specific advertisements and recommendations on their platforms, which puts a significant focus on Influence Maximisation (IM) for target advertising and related tasks. The aim is to identify nodes in the network which can maximize the spread of information through a diffusion cascade. We propose a community structures-based approach that employs K-Shell algorithm… ▽ More Social networks have enabled user-specific advertisements and recommendations on their platforms, which puts a significant focus on Influence Maximisation (IM) for target advertising and related tasks. The aim is to identify nodes in the network which can maximize the spread of information through a diffusion cascade. We propose a community structures-based approach that employs K-Shell algorithm with community structures to generate a score for the connections between seed nodes and communities. Further, our approach employs entropy within communities to ensure the proper spread of information within the communities. We validate our approach on four publicly available networks and show its superiority to four state-of-the-art approaches while still being relatively efficient. △ Less

Submitted 26 November, 2022; originally announced November 2022.

Comments: Accepted in the Student Abstract & Poster Presentation Track at AAAI 2023

arXiv:2211.17108 [pdf, other]

An Emotion-guided Approach to Domain Adaptive Fake News Detection using Adversarial Learning

Authors: Arkajyoti Chakraborty, Inder Khatri, Arjun Choudhry, Pankaj Gupta, Dinesh Kumar Vishwakarma, Mukesh Prasad

Abstract: Recent works on fake news detection have shown the efficacy of using emotions as a feature for improved performance. However, the cross-domain impact of emotion-guided features for fake news detection still remains an open problem. In this work, we propose an emotion-guided, domain-adaptive, multi-task approach for cross-domain fake news detection, proving the efficacy of emotion-guided models in… ▽ More Recent works on fake news detection have shown the efficacy of using emotions as a feature for improved performance. However, the cross-domain impact of emotion-guided features for fake news detection still remains an open problem. In this work, we propose an emotion-guided, domain-adaptive, multi-task approach for cross-domain fake news detection, proving the efficacy of emotion-guided models in cross-domain settings for various datasets. △ Less

Submitted 26 November, 2022; originally announced November 2022.

Comments: Accepted in the Student Abstract & Poster Presentation track at AAAI 2023. arXiv admin note: substantial text overlap with arXiv:2211.13718

arXiv:2211.13718 [pdf, other]

Emotion-guided Cross-domain Fake News Detection using Adversarial Domain Adaptation

Authors: Arjun Choudhry, Inder Khatri, Arkajyoti Chakraborty, Dinesh Kumar Vishwakarma, Mukesh Prasad

Abstract: Recent works on fake news detection have shown the efficacy of using emotions as a feature or emotions-based features for improved performance. However, the impact of these emotion-guided features for fake news detection in cross-domain settings, where we face the problem of domain shift, is still largely unexplored. In this work, we evaluate the impact of emotion-guided features for cross-domain… ▽ More Recent works on fake news detection have shown the efficacy of using emotions as a feature or emotions-based features for improved performance. However, the impact of these emotion-guided features for fake news detection in cross-domain settings, where we face the problem of domain shift, is still largely unexplored. In this work, we evaluate the impact of emotion-guided features for cross-domain fake news detection, and further propose an emotion-guided, domain-adaptive approach using adversarial learning. We prove the efficacy of emotion-guided models in cross-domain settings for various combinations of source and target datasets from FakeNewsAMT, Celeb, Politifact and Gossipcop datasets. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted as a Short Paper in the 19th International Conference on Natural Language Processing (ICON) 2022

arXiv:2211.12374 [pdf, other]

An Emotion-Aware Multi-Task Approach to Fake News and Rumour Detection using Transfer Learning

Authors: Arjun Choudhry, Inder Khatri, Minni Jain, Dinesh Kumar Vishwakarma

Abstract: Social networking sites, blogs, and online articles are instant sources of news for internet users globally. However, in the absence of strict regulations mandating the genuineness of every text on social media, it is probable that some of these texts are fake news or rumours. Their deceptive nature and ability to propagate instantly can have an adverse effect on society. This necessitates the nee… ▽ More Social networking sites, blogs, and online articles are instant sources of news for internet users globally. However, in the absence of strict regulations mandating the genuineness of every text on social media, it is probable that some of these texts are fake news or rumours. Their deceptive nature and ability to propagate instantly can have an adverse effect on society. This necessitates the need for more effective detection of fake news and rumours on the web. In this work, we annotate four fake news detection and rumour detection datasets with their emotion class labels using transfer learning. We show the correlation between the legitimacy of a text with its intrinsic emotion for fake news and rumour detection, and prove that even within the same emotion class, fake and real news are often represented differently, which can be used for improved feature extraction. Based on this, we propose a multi-task framework for fake news and rumour detection, predicting both the emotion and legitimacy of the text. We train a variety of deep learning models in single-task and multi-task settings for a more comprehensive comparison. We further analyze the performance of our multi-task approach for fake news detection in cross-domain settings to verify its efficacy for better generalization across datasets, and to verify that emotions act as a domain-independent feature. Experimental results verify that our multi-task models consistently outperform their single-task counterparts in terms of accuracy, precision, recall, and F1 score, both for in-domain and cross-domain settings. We also qualitatively analyze the difference in performance in single-task and multi-task learning models. △ Less

Submitted 7 December, 2022; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted in IEEE Transaction on Computational Social Systems 18 pages 5 figures

arXiv:2211.09683 [pdf, other]

Influence Maximization in Social Networks using Discretized Harris Hawks Optimization Algorithm and Neighbour Scout Strategy

Authors: Inder Khatri, Arjun Choudhry, Aryaman Rao, Aryan Tyagi, Dinesh Kumar Vishwakarma, Mukesh Prasad

Abstract: Influence Maximization (IM) is the task of determining k optimal influential nodes in a social network to maximize the influence spread using a propagation model. IM is a prominent problem for viral marketing, and helps significantly in social media advertising. However, developing effective algorithms with minimal time complexity for real-world social networks still remains a challenge. While tra… ▽ More Influence Maximization (IM) is the task of determining k optimal influential nodes in a social network to maximize the influence spread using a propagation model. IM is a prominent problem for viral marketing, and helps significantly in social media advertising. However, developing effective algorithms with minimal time complexity for real-world social networks still remains a challenge. While traditional heuristic approaches have been applied for IM, they often result in minimal performance gains over the computationally expensive Greedy-based and Reverse Influence Sampling-based approaches. In this paper, we propose the discretization of the nature-inspired Harris Hawks Optimisation meta-heuristic algorithm using community structures for optimal selection of seed nodes for influence spread. In addition to Harris Hawks intelligence, we employ a neighbour scout strategy algorithm to avoid blindness and enhance the searching ability of the hawks. Further, we use a candidate nodes-based random population initialization approach, and these candidate nodes aid in accelerating the convergence process for the entire populace. We evaluate the efficacy of our proposed DHHO approach on six social networks using the Independent Cascade model for information diffusion. We observe that DHHO is comparable or better than competing meta-heuristic approaches for Influence Maximization across five metrics, and performs noticeably better than competing heuristic approaches. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 24 pages, 7 figures

arXiv:2211.09657 [pdf, other]

A Spreader Ranking Algorithm for Extremely Low-budget Influence Maximization in Social Networks using Community Bridge Nodes

Authors: Aaryan Gupta, Inder Khatri, Arjun Choudhry, Pranav Chandhok, Dinesh Kumar Vishwakarma, Mukesh Prasad

Abstract: In recent years, social networking platforms have gained significant popularity among the masses like connecting with people and propagating ones thoughts and opinions. This has opened the door to user-specific advertisements and recommendations on these platforms, bringing along a significant focus on Influence Maximisation (IM) on social networks due to its wide applicability in target advertisi… ▽ More In recent years, social networking platforms have gained significant popularity among the masses like connecting with people and propagating ones thoughts and opinions. This has opened the door to user-specific advertisements and recommendations on these platforms, bringing along a significant focus on Influence Maximisation (IM) on social networks due to its wide applicability in target advertising, viral marketing, and personalized recommendations. The aim of IM is to identify certain nodes in the network which can help maximize the spread of certain information through a diffusion cascade. While several works have been proposed for IM, most were inefficient in exploiting community structures to their full extent. In this work, we propose a community structures-based approach, which employs a K-Shell algorithm in order to generate a score for the connections between seed nodes and communities for low-budget scenarios. Further, our approach employs entropy within communities to ensure the proper spread of information within the communities. We choose the Independent Cascade (IC) model to simulate information spread and evaluate it on four evaluation metrics. We validate our proposed approach on eight publicly available networks and find that it significantly outperforms the baseline approaches on these metrics, while still being relatively efficient. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 21 pages, 7 figures

arXiv:2112.08611 [pdf]

Clickbait in YouTube Prevention, Detection and Analysis of the Bait using Ensemble Learning

Authors: Peya Mowar, Mini Jain, Ruchika Goel, Dinesh Kumar Vishwakarma

Abstract: Unscrupulous content creators on YouTube employ deceptive techniques such as spam and clickbait to reach a broad audience and trick users into clicking on their videos to increase their advertisement revenue. Clickbait detection on YouTube requires an in depth examination and analysis of the intricate relationship between the video content and video descriptors title and thumbnail. However, the cu… ▽ More Unscrupulous content creators on YouTube employ deceptive techniques such as spam and clickbait to reach a broad audience and trick users into clicking on their videos to increase their advertisement revenue. Clickbait detection on YouTube requires an in depth examination and analysis of the intricate relationship between the video content and video descriptors title and thumbnail. However, the current solutions are mostly centred around the study of video descriptors and other metadata such as likes, tags, comments, etc and fail to utilize the video content, both video and audio. Therefore, we introduce a novel model to detect clickbaits on YouTube that consider the relationship between video content and title or thumbnail. The proposed model consists of a stacking classifier framework composed of six base models (K Nearest Neighbours, Support Vector Machine, XGBoost, Naive Bayes, Logistic Regression, and Multilayer Perceptron) and a meta classifier. The developed clickbait detection model achieved a high accuracy of 92.89% for the novel BollyBAIT dataset and 95.38% for Misleading Video Dataset. Additionally, the stated classifier does not use meta features or other statistics dependent on user interaction with the video (the number of likes, followers, or comments) for classification, and thus, can be used to detect potential clickbait videos before they are uploaded, thereby preventing the nuisance of clickbaits altogether and improving the users streaming experience. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: 26 pages, 16 figures

arXiv:2109.13476 [pdf]

Fake News Detection using Semi-Supervised Graph Convolutional Network

Authors: Priyanka Meel, Dinesh Kumar Vishwakarma

Abstract: Social media becomes the central way for people to obtain and utilise news, due to its rapidness and inexpensive value of data distribution. Though, such features of social media platforms also present it a root cause of fake news distribution, causing adverse consequences on both people and culture. Hence, detecting fake news has become a significant research interest for bringing feasible real t… ▽ More Social media becomes the central way for people to obtain and utilise news, due to its rapidness and inexpensive value of data distribution. Though, such features of social media platforms also present it a root cause of fake news distribution, causing adverse consequences on both people and culture. Hence, detecting fake news has become a significant research interest for bringing feasible real time solutions to the problem. Most current techniques of fake news disclosure are supervised, that need large cost in terms of time and effort to make a certainly interpreted dataset. The proposed framework concentrates on the text-based detection of fake news items while considering that only limited number of labels are available. Graphs are functioned extensively under several purposes of real-world problems on the strength of their property to structure things easily. Deep neural networks are used to generate great results within tasks that utilizes graph classification. The Graph Convolution Network works as a deep learning paradigm which works on graphs. Our proposed framework deals with limited amount of labelled data; we go for a semi-supervised learning method. We come up with a semi-supervised fake news detection technique based on GCN (Graph Convolutional Networks). The recommended architecture comprises of three basic components: collecting word embeddings from the news articles in datasets utilising GloVe, building similarity graph using Word Movers Distance (WMD) and finally applying Graph Convolution Network (GCN) for binary classification of news articles in semi-supervised paradigm. The implemented technique is validated on three different datasets by varying the volume of labelled data achieving 95.27 % highest accuracy on Real or Fake dataset. Comparison with other contemporary techniques also reinforced the supremacy of the proposed framework. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: 25 pages, 7 figures

arXiv:2109.13063 [pdf]

An Automated Multi-Web Platform Voting Framework to Predict Misleading Information Proliferated during COVID-19 Outbreak using Ensemble Method

Authors: Deepika Varshney, Dinesh Kumar Vishwakarma

Abstract: Spreading of misleading information on social web platforms has fuelled huge panic and confusion among the public regarding the Corona disease, the detection of which is of paramount importance. To address this issue, in this paper, we have developed an automated system that can collect and validate the fact from multi web-platform to decide the credibility of the content. To identify the credibil… ▽ More Spreading of misleading information on social web platforms has fuelled huge panic and confusion among the public regarding the Corona disease, the detection of which is of paramount importance. To address this issue, in this paper, we have developed an automated system that can collect and validate the fact from multi web-platform to decide the credibility of the content. To identify the credibility of the posted claim, probable instances/clues(titles) of news information are first gathered from various web platforms. Later, the crucial set of features is retrieved that further feeds into the ensemble-based machine learning model to classify the news as misleading or real. The four sets of features based on the content, linguistics/semantic cues, similarity, and sentiments gathered from web-platforms and voting are applied to validate the news. Finally, the combined voting decides the support given to a specific claim. In addition to the validation part, a unique source platform is designed for collecting data/facts from three web platforms (Twitter, Facebook, Google) based on certain queries/words. This unique platform can also help researchers build datasets and gather useful/efficient clues from various web platforms. It has been observed that our proposed intelligent strategy gives promising results and quite effective in predicting misleading information. The proposed work provides practical implications for the policy makers and health practitioners that could be useful in protecting the world from misleading information proliferation during this pandemic. △ Less

Submitted 19 September, 2021; originally announced September 2021.

Comments: 22 pages, 06 figures

arXiv:2109.12547 [pdf]

Multi-modal Fusion using Fine-tuned Self-attention and Transfer Learning for Veracity Analysis of Web Information

Authors: Priyanka Meel, Dinesh Kumar Vishwakarma

Abstract: The nuisance of misinformation and fake news has escalated many folds since the advent of online social networks. Human consciousness and decision-making capabilities are negatively influenced by manipulated, fabricated, biased or unverified news posts. Therefore, there is a high demand for designing veracity analysis systems to detect fake information contents in multiple data modalities. In an a… ▽ More The nuisance of misinformation and fake news has escalated many folds since the advent of online social networks. Human consciousness and decision-making capabilities are negatively influenced by manipulated, fabricated, biased or unverified news posts. Therefore, there is a high demand for designing veracity analysis systems to detect fake information contents in multiple data modalities. In an attempt to find a sophisticated solution to this critical issue, we proposed an architecture to consider both the textual and visual attributes of the data. After the data pre-processing is done, text and image features are extracted from the training data using separate deep learning models. Feature extraction from text is done using BERT and ALBERT language models that leverage the benefits of bidirectional training of transformers using a deep self-attention mechanism. The Inception-ResNet-v2 deep neural network model is employed for image data to perform the task. The proposed framework focused on two independent multi-modal fusion architectures of BERT and Inception-ResNet-v2 as well as ALBERT and Inception-ResNet-v2. Multi-modal fusion of textual and visual branches is extensively experimented and analysed using concatenation of feature vectors and weighted averaging of probabilities named as Early Fusion and Late Fusion respectively. Three publicly available broadly accepted datasets All Data, Weibo and MediaEval 2016 that incorporates English news articles, Chinese news articles, and Tweets correspondingly are used so that our designed framework's outcomes can be properly tested and compared with previous notable work in the domain. △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: 31 pages, 12 figures

arXiv:2109.09929 [pdf]

A Unified Approach of Detecting Misleading Images via Tracing its Instances on Web and Analysing its Past Context for the Verification of Content

Authors: Deepika Varshney, Dinesh Kumar Vishwakarma

Abstract: The verification of multimedia content over social media is one of the challenging and crucial issues in the current scenario and gaining prominence in an age where user-generated content and online social web platforms are the leading sources in shaping and propagating news stories. As these sources allow users to share their opinions without restriction, opportunistic users often post misleading… ▽ More The verification of multimedia content over social media is one of the challenging and crucial issues in the current scenario and gaining prominence in an age where user-generated content and online social web platforms are the leading sources in shaping and propagating news stories. As these sources allow users to share their opinions without restriction, opportunistic users often post misleading/ unreliable content on social media such as Twitter, Facebook, etc. At present, to lure users towards the news story, the text is often attached with some multimedia content (images/videos/audios). Verifying these contents to maintain the credibility and reliability of social media information is of paramount importance. Motivated by this, we proposed a generalized system that supports the automatic classification of images into credible or misleading. In this paper, we investigated machine learning-based as well as deep learning-based approaches utilized to verify misleading multimedia content, where the available image traces are used to identify the credibility of the content. The experiment is performed on the real-world dataset (Media-eval-2015 dataset) collected from Twitter. It also demonstrates the efficiency of our proposed approach and features using both Machine and Deep Learning Model (Bi-directional LSTM). The experiment result reveals that the Microsoft bings image search engine is quite effective in retrieving titles and performs better than our study's Google image search engine. It also shows that gathering clues from attached multimedia content (image) is more effective than detecting only posted content-based features. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: 22 pages, 8 figures

arXiv:2109.06488 [pdf]

Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers

Authors: Dinesh Kumar Vishwakarma, Mayank Jindal, Ayush Mittal, Aditya Sharma

Abstract: Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situati… ▽ More Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework that takes both cognition and affect-based features into consideration. A pre-features fusion-based framework that takes into account: situation-based features from a regular snapshot of a trailer that includes nouns and verbs providing the useful affect-based mapping with the corresponding genres, dialogue (speech) based feature from audio, metadata which together provides the relevant information for cognitive and affect based video analysis. We also develop the English movie trailer dataset (EMTD), which contains 2000 Hollywood movie trailers belonging to five popular genres: Action, Romance, Comedy, Horror, and Science Fiction, and perform cross-validation on the standard LMTD-9 dataset for validating the proposed framework. The results demonstrate that the proposed methodology for movie genre classification has performed excellently as depicted by the F1 scores, precision, recall, and area under the precision-recall curves. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: 21 pages, 7 figures

arXiv:2105.05708 [pdf, other]

Deep and Shallow Covariance Feature Quantization for 3D Facial Expression Recognition

Authors: Walid Hariri, Nadir Farah, Dinesh Kumar Vishwakarma

Abstract: Facial expressions recognition (FER) of 3D face scans has received a significant amount of attention in recent years. Most of the facial expression recognition methods have been proposed using mainly 2D images. These methods suffer from several issues like illumination changes and pose variations. Moreover, 2D mapping from 3D images may lack some geometric and topological characteristics of the fa… ▽ More Facial expressions recognition (FER) of 3D face scans has received a significant amount of attention in recent years. Most of the facial expression recognition methods have been proposed using mainly 2D images. These methods suffer from several issues like illumination changes and pose variations. Moreover, 2D mapping from 3D images may lack some geometric and topological characteristics of the face. Hence, to overcome this problem, a multi-modal 2D + 3D feature-based method is proposed. We extract shallow features from the 3D images, and deep features using Convolutional Neural Networks (CNN) from the transformed 2D images. Combining these features into a compact representation uses covariance matrices as descriptors for both features instead of single-handedly descriptors. A covariance matrix learning is used as a manifold layer to reduce the deep covariance matrices size and enhance their discrimination power while preserving their manifold structure. We then use the Bag-of-Features (BoF) paradigm to quantize the covariance matrices after flattening. Accordingly, we obtained two codebooks using shallow and deep features. The global codebook is then used to feed an SVM classifier. High classification performances have been achieved on the BU-3DFE and Bosphorus datasets compared to the state-of-the-art methods. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2012.13318 [pdf]

Person Re-Identification using Deep Learning Networks: A Systematic Review

Authors: Ankit Yadav, Dinesh Kumar Vishwakarma

Abstract: Person re-identification has received a lot of attention from the research community in recent times. Due to its vital role in security based applications, person re-identification lies at the heart of research relevant to tracking robberies, preventing terrorist attacks and other security critical events. While the last decade has seen tremendous growth in re-id approaches, very little review lit… ▽ More Person re-identification has received a lot of attention from the research community in recent times. Due to its vital role in security based applications, person re-identification lies at the heart of research relevant to tracking robberies, preventing terrorist attacks and other security critical events. While the last decade has seen tremendous growth in re-id approaches, very little review literature exists to comprehend and summarize this progress. This review deals with the latest state-of-the-art deep learning based approaches for person re-identification. While the few existing re-id review works have analysed re-id techniques from a singular aspect, this review evaluates numerous re-id techniques from multiple deep learning aspects such as deep architecture types, common Re-Id challenges (variation in pose, lightning, view, scale, partial or complete occlusion, background clutter), multi-modal Re-Id, cross-domain Re-Id challenges, metric learning approaches and video Re-Id contributions. This review also includes several re-id benchmarks collected over the years, describing their characteristics, specifications and top re-id results obtained on them. The inclusion of the latest deep re-id works makes this a significant contribution to the re-id literature. Lastly, the conclusion and future directions are included. △ Less

Submitted 24 December, 2020; originally announced December 2020.

Comments: 34 pages, 15 figures

arXiv:2012.08256 [pdf]

doi 10.1145/3517139

A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis

Authors: Ashima Yadav, Dinesh Kumar Vishwakarma

Abstract: Multimodal sentiment analysis has attracted increasing attention with broad application prospects. The existing methods focuses on single modality, which fails to capture the social media content for multiple modalities. Moreover, in multi-modal learning, most of the works have focused on simply combining the two modalities, without exploring the complicated correlations between them. This resulte… ▽ More Multimodal sentiment analysis has attracted increasing attention with broad application prospects. The existing methods focuses on single modality, which fails to capture the social media content for multiple modalities. Moreover, in multi-modal learning, most of the works have focused on simply combining the two modalities, without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-Level Attentive network, which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify CNNs representation power. Then we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to automatically fetch the sentiment-rich multimodal features for the classification. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verifies the superiority of our method. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: 11 pages, 7 figures

Journal ref: ACM Transactions on Multimedia Computing, Communications, and Applications, 2022

arXiv:2011.10358 [pdf]

A Deep Language-independent Network to analyze the impact of COVID-19 on the World via Sentiment Analysis

Authors: Ashima Yadav, Dinesh Kumar Vishwakarma

Abstract: Towards the end of 2019, Wuhan experienced an outbreak of novel coronavirus, which soon spread all over the world, resulting in a deadly pandemic that infected millions of people around the globe. The government and public health agencies followed many strategies to counter the fatal virus. However, the virus severely affected the social and economic lives of the people. In this paper, we extract… ▽ More Towards the end of 2019, Wuhan experienced an outbreak of novel coronavirus, which soon spread all over the world, resulting in a deadly pandemic that infected millions of people around the globe. The government and public health agencies followed many strategies to counter the fatal virus. However, the virus severely affected the social and economic lives of the people. In this paper, we extract and study the opinion of people from the top five worst affected countries by the virus, namely USA, Brazil, India, Russia, and South Africa. We propose a deep language-independent Multilevel Attention-based Conv-BiGRU network (MACBiG-Net), which includes embedding layer, word-level encoded attention, and sentence-level encoded attention mechanism to extract the positive, negative, and neutral sentiments. The embedding layer encodes the sentence sequence into a real-valued vector. The word-level and sentence-level encoding is performed by a 1D Conv-BiGRU based mechanism, followed by word-level and sentence-level attention, respectively. We further develop a COVID-19 Sentiment Dataset by crawling the tweets from Twitter. Extensive experiments on our proposed dataset demonstrate the effectiveness of the proposed MACBiG-Net. Also, attention-weights visualization and in-depth results analysis shows that the proposed network has effectively captured the sentiments of the people. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:1912.03632 [pdf]

doi 10.1109/TIP.2020.2965299

View-invariant Deep Architecture for Human Action Recognition using late fusion

Authors: Chhavi Dhiman, Dinesh Kumar Vishwakarma

Abstract: Human action Recognition for unknown views is a challenging task. We propose a view-invariant deep human action recognition framework, which is a novel integration of two important action cues: motion and shape temporal dynamics (STD). The motion stream encapsulates the motion content of action as RGB Dynamic Images (RGB-DIs) which are processed by the fine-tuned InceptionV3 model. The STD stream… ▽ More Human action Recognition for unknown views is a challenging task. We propose a view-invariant deep human action recognition framework, which is a novel integration of two important action cues: motion and shape temporal dynamics (STD). The motion stream encapsulates the motion content of action as RGB Dynamic Images (RGB-DIs) which are processed by the fine-tuned InceptionV3 model. The STD stream learns long-term view-invariant shape dynamics of action using human pose model (HPM) based view-invariant features mined from structural similarity index matrix (SSIM) based key depth human pose frames. To predict the score of the test sample, three types of late fusion (maximum, average and product) techniques are applied on individual stream scores. To validate the performance of the proposed novel framework the experiments are performed using both cross subject and cross-view validation schemes on three publically available benchmarks- NUCLA multi-view dataset, UWA3D-II Activity dataset and NTU RGB-D Activity dataset. Our algorithm outperforms with existing state-of-the-arts significantly that is reported in terms of accuracy, receiver operating characteristic (ROC) curve and area under the curve (AUC). △ Less

Submitted 8 December, 2019; originally announced December 2019.

Comments: 10 pages, 7 figures

Report number: 8960517

Journal ref: 2019

arXiv:1912.00576 [pdf]

Skeleton based Activity Recognition by Fusing Part-wise Spatio-temporal and Attention Driven Residues

Authors: Chhavi Dhiman, Dinesh Kumar Vishwakarma, Paras Aggarwal

Abstract: There exist a wide range of intra class variations of the same actions and inter class similarity among the actions, at the same time, which makes the action recognition in videos very challenging. In this paper, we present a novel skeleton-based part-wise Spatiotemporal CNN RIAC Network-based 3D human action recognition framework to visualise the action dynamics in part wise manner and utilise ea… ▽ More There exist a wide range of intra class variations of the same actions and inter class similarity among the actions, at the same time, which makes the action recognition in videos very challenging. In this paper, we present a novel skeleton-based part-wise Spatiotemporal CNN RIAC Network-based 3D human action recognition framework to visualise the action dynamics in part wise manner and utilise each part for action recognition by applying weighted late fusion mechanism. Part wise skeleton based motion dynamics helps to highlight local features of the skeleton which is performed by partitioning the complete skeleton in five parts such as Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. The RIAFNet architecture is greatly inspired by the InceptionV4 architecture which unified the ResNet and Inception based Spatio-temporal feature representation concept and achieving the highest top-1 accuracy till date. To extract and learn salient features for action recognition, attention driven residues are used which enhance the performance of residual components for effective 3D skeleton-based Spatio-temporal action representation. The robustness of the proposed framework is evaluated by performing extensive experiments on three challenging datasets such as UT Kinect Action 3D, Florence 3D action Dataset, and MSR Daily Action3D datasets, which consistently demonstrate the superiority of our method △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: 20 pages, 9 figures

arXiv:1903.04090 [pdf]

A Hybrid Framework for Action Recognition in Low-Quality Video Sequences

Authors: Tej Singh, Dinesh Kumar Vishwakarma

Abstract: Vision-based activity recognition is essential for security, monitoring and surveillance applications. Further, real-time analysis having low-quality video and contain less information about surrounding due to poor illumination, and occlusions. Therefore, it needs a more robust and integrated model for low quality and night security operations. In this context, we proposed a hybrid model for illum… ▽ More Vision-based activity recognition is essential for security, monitoring and surveillance applications. Further, real-time analysis having low-quality video and contain less information about surrounding due to poor illumination, and occlusions. Therefore, it needs a more robust and integrated model for low quality and night security operations. In this context, we proposed a hybrid model for illumination invariant human activity recognition based on sub-image histogram equalization enhancement and k-key pose human silhouettes. This feature vector gives good average recognition accuracy on three low exposure video sequences subset of original actions video datasets. Finally, the performance of the proposed approach is tested over three manually downgraded low qualities Weizmann action, KTH, and Ballet Movement dataset. This model outperformed on low exposure videos over existing technique and achieved comparable classification accuracy to similar state-of-the-art methods. △ Less

Submitted 10 March, 2019; originally announced March 2019.

Comments: 13 pages, 9 Figures

arXiv:1805.07720 [pdf]

doi 10.1109/TMSCS.2018.2870592

A Deep Structure of Person Re-Identification using Multi-Level Gaussian Models

Authors: Dinesh Kumar Vishwakarma, Sakshi Upadhyay

Abstract: Person re-identification is being widely used in the forensic, and security and surveillance system, but person re-identification is a challenging task in real life scenario. Hence, in this work, a new feature descriptor model has been proposed using a multilayer framework of Gaussian distribution model on pixel features, which include color moments, color space values and Schmid filter responses.… ▽ More Person re-identification is being widely used in the forensic, and security and surveillance system, but person re-identification is a challenging task in real life scenario. Hence, in this work, a new feature descriptor model has been proposed using a multilayer framework of Gaussian distribution model on pixel features, which include color moments, color space values and Schmid filter responses. An image of a person usually consists of distinct body regions, usually with differentiable clothing followed by local colors and texture patterns. Thus, the image is evaluated locally by dividing the image into overlapping regions. Each region is further fragmented into a set of local Gaussians on small patches. A global Gaussian encodes, these local Gaussians for each region creating a multi-level structure. Hence, the global picture of a person is described by local level information present in it, which is often ignored. Also, we have analyzed the efficiency of earlier metric learning methods on this descriptor. The performance of the descriptor is evaluated on four public available challenging datasets and the highest accuracy achieved on these datasets are compared with similar state-of-the-arts, which demonstrate the superior performance. △ Less

Submitted 20 May, 2018; originally announced May 2018.

Comments: 9 pages

Report number: 8469037

Journal ref: IEEE Transactions on Multi-Scale Computing Systems 4 (2018) 513 - 521

arXiv:1611.06683 [pdf]

Covariate conscious approach for Gait recognition based upon Zernike moment invariants

Authors: Himanshu Aggarwal, Dinesh K. Vishwakarma

Abstract: Gait recognition i.e. identification of an individual from his/her walking pattern is an emerging field. While existing gait recognition techniques perform satisfactorily in normal walking conditions, there performance tend to suffer drastically with variations in clothing and carrying conditions. In this work, we propose a novel covariate cognizant framework to deal with the presence of such cova… ▽ More Gait recognition i.e. identification of an individual from his/her walking pattern is an emerging field. While existing gait recognition techniques perform satisfactorily in normal walking conditions, there performance tend to suffer drastically with variations in clothing and carrying conditions. In this work, we propose a novel covariate cognizant framework to deal with the presence of such covariates. We describe gait motion by forming a single 2D spatio-temporal template from video sequence, called Average Energy Silhouette image (AESI). Zernike moment invariants (ZMIs) are then computed to screen the parts of AESI infected with covariates. Following this, features are extracted from Spatial Distribution of Oriented Gradients (SDOGs) and novel Mean of Directional Pixels (MDPs) methods. The obtained features are fused together to form the final well-endowed feature set. Experimental evaluation of the proposed framework on three publicly available datasets i.e. CASIA dataset B, OU-ISIR Treadmill dataset B and USF Human-ID challenge dataset with recently published gait recognition approaches, prove its superior performance. △ Less

Submitted 21 November, 2016; originally announced November 2016.

Comments: 11 pages

Showing 1–35 of 35 results for author: Vishwakarma, D K