Search | arXiv e-print repository

doi 10.1109/ICOSEC61587.2024.10722438

Intelligent DoS and DDoS Detection: A Hybrid GRU-NTM Approach to Network Security

Authors: Caroline Panggabean, Chandrasekar Venkatachalam, Priyanka Shah, Sincy John, Renuka Devi P, Shanmugavalli Venkatachalam

Abstract: Detecting Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks remains a critical challenge in cybersecurity. This research introduces a hybrid deep learning model combining Gated Recurrent Units (GRUs) and a Neural Turing Machine (NTM) for enhanced intrusion detection. Trained on the UNSW-NB15 and BoT-IoT datasets, the model employs GRU layers for sequential data processing an… ▽ More Detecting Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks remains a critical challenge in cybersecurity. This research introduces a hybrid deep learning model combining Gated Recurrent Units (GRUs) and a Neural Turing Machine (NTM) for enhanced intrusion detection. Trained on the UNSW-NB15 and BoT-IoT datasets, the model employs GRU layers for sequential data processing and an NTM for long-term pattern recognition. The proposed approach achieves 99% accuracy in distinguishing between normal, DoS, and DDoS traffic. These findings offer promising advancements in real-time threat detection and contribute to improved network security across various domains. △ Less

Submitted 10 April, 2025; originally announced April 2025.

Comments: Accepted at the 2024 5th International Conference on Smart Electronics and Communication (ICOSEC). This is the accepted manuscript version. The final version is published by IEEE at https://doi.org/10.1109/ICOSEC61587.2024.10722438

arXiv:2503.04795 [pdf, other]

Cyber for AI at SemEval-2025 Task 4: Forgotten but Not Lost: The Balancing Act of Selective Unlearning in Large Language Models

Authors: Dinesh Srivasthav P, Bala Mallikarjunarao Garlapati

Abstract: Large Language Models (LLMs) face significant challenges in maintaining privacy, ethics, and compliance, when sensitive or obsolete data must be selectively removed. Retraining these models from scratch is computationally infeasible, necessitating efficient alternatives. As part of the SemEval 2025 Task 4, this work focuses on the application of selective unlearning in LLMs to address this challen… ▽ More Large Language Models (LLMs) face significant challenges in maintaining privacy, ethics, and compliance, when sensitive or obsolete data must be selectively removed. Retraining these models from scratch is computationally infeasible, necessitating efficient alternatives. As part of the SemEval 2025 Task 4, this work focuses on the application of selective unlearning in LLMs to address this challenge. In this paper, we present our experiments and findings, primarily leveraging global weight modification to achieve an equilibrium between effectiveness of unlearning, knowledge retention, and target model's post-unlearning utility. We also detail the task-specific evaluation mechanism, results, and challenges. Our algorithms have achieved an aggregate score of 0.409 and 0.389 on the test set for 7B and 1B target models, respectively, demonstrating promising results in verifiable LLM unlearning. △ Less

Submitted 2 March, 2025; originally announced March 2025.

arXiv:2502.05032 [pdf, other]

News about Global North considered Truthful! The Geo-political Veracity Gradient in Global South News

Authors: Sujit Mandava, Deepak P, Sahely Bhadra

Abstract: While there has been much research into developing AI techniques for fake news detection aided by various benchmark datasets, it has often been pointed out that fake news in different geo-political regions traces different contours. In this work we uncover, through analytical arguments and empirical evidence, the existence of an important characteristic in news originating from the Global South vi… ▽ More While there has been much research into developing AI techniques for fake news detection aided by various benchmark datasets, it has often been pointed out that fake news in different geo-political regions traces different contours. In this work we uncover, through analytical arguments and empirical evidence, the existence of an important characteristic in news originating from the Global South viz., the geo-political veracity gradient. In particular, we show that Global South news about topics from Global North -- such as news from an Indian news agency on US elections -- tend to be less likely to be fake. Observing through the prism of the political economy of fake news creation, we posit that this pattern could be due to the relative lack of monetarily aligned incentives in producing fake news about a different region than the regional remit of the audience. We provide empirical evidence for this from benchmark datasets. We also empirically analyze the consequences of this effect in applying AI-based fake news detection models for fake news AI trained on one region within another regional context. We locate our work within emerging critical scholarship on geo-political biases within AI in general, particularly with AI usage in fake news identification; we hope our insight into the geo-political veracity gradient could help steer fake news AI scholarship towards positively impacting Global South societies. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2501.00757 [pdf, ps, other]

Beyond Static Datasets: A Behavior-Driven Entity-Specific Simulation to Overcome Data Scarcity and Train Effective Crypto Anti-Money Laundering Models

Authors: Dinesh Srivasthav P, Manoj Apte

Abstract: For different factors/reasons, ranging from inherent characteristics and features providing decentralization, enhanced privacy, ease of transactions, etc., to implied external hardships in enforcing regulations, contradictions in data sharing policies, etc., cryptocurrencies have been severely abused for carrying out numerous malicious and illicit activities including money laundering, darknet tra… ▽ More For different factors/reasons, ranging from inherent characteristics and features providing decentralization, enhanced privacy, ease of transactions, etc., to implied external hardships in enforcing regulations, contradictions in data sharing policies, etc., cryptocurrencies have been severely abused for carrying out numerous malicious and illicit activities including money laundering, darknet transactions, scams, terrorism financing, arm trades. However, money laundering is a key crime to be mitigated to also suspend the movement of funds from other illicit activities. Billions of dollars are annually being laundered. It is getting extremely difficult to identify money laundering in crypto transactions owing to many layering strategies available today, and rapidly evolving tactics, and patterns the launderers use to obfuscate the illicit funds. Many detection methods have been proposed ranging from naive approaches involving complete manual investigation to machine learning models. However, there are very limited datasets available for effectively training machine learning models. Also, the existing datasets are static and class-imbalanced, posing challenges for scalability and suitability to specific scenarios, due to lack of customization to varying requirements. This has been a persistent challenge in literature. In this paper, we propose behavior embedded entity-specific money laundering-like transaction simulation that helps in generating various transaction types and models the transactions embedding the behavior of several entities observed in this space. The paper discusses the design and architecture of the simulator, a custom dataset we generated using the simulator, and the performance of models trained on this synthetic data in detecting real addresses involved in money laundering. △ Less

Submitted 1 January, 2025; originally announced January 2025.

arXiv:2411.08148 [pdf, other]

Adaptive Meta-Learning for Robust Deepfake Detection: A Multi-Agent Framework to Data Drift and Model Generalization

Authors: Dinesh Srivasthav P, Badri Narayan Subudhi

Abstract: Pioneering advancements in artificial intelligence, especially in genAI, have enabled significant possibilities for content creation, but also led to widespread misinformation and false content. The growing sophistication and realism of deepfakes is raising concerns about privacy invasion, identity theft, and has societal, business impacts, including reputational damage and financial loss. Many de… ▽ More Pioneering advancements in artificial intelligence, especially in genAI, have enabled significant possibilities for content creation, but also led to widespread misinformation and false content. The growing sophistication and realism of deepfakes is raising concerns about privacy invasion, identity theft, and has societal, business impacts, including reputational damage and financial loss. Many deepfake detectors have been developed to tackle this problem. Nevertheless, as for every AI model, the deepfake detectors face the wrath of lack of considerable generalization to unseen scenarios and cross-domain deepfakes. Besides, adversarial robustness is another critical challenge, as detectors drastically underperform to the slightest imperceptible change. Most state-of-the-art detectors are trained on static datasets and lack the ability to adapt to emerging deepfake attack trends. These three crucial challenges though hold paramount importance for reliability in practise, particularly in the deepfake domain, are also the problems with any other AI application. This paper proposes an adversarial meta-learning algorithm using task-specific adaptive sample synthesis and consistency regularization, in a refinement phase. By focussing on the classifier's strengths and weaknesses, it boosts both robustness and generalization of the model. Additionally, the paper introduces a hierarchical multi-agent retrieval-augmented generation workflow with a sample synthesis module to dynamically adapt the model to new data trends by generating custom deepfake samples. The paper further presents a framework integrating the meta-learning algorithm with the hierarchical multi-agent workflow, offering a holistic solution for enhancing generalization, robustness, and adaptability. Experimental results demonstrate the model's consistent performance across various datasets, outperforming the models in comparison. △ Less

Submitted 12 November, 2024; originally announced November 2024.

arXiv:2404.16530 [pdf, other]

On the Political Economy of Link-based Web Search

Authors: Deepak P, James Steinhoff, Stanley Simoes

Abstract: Web search engines arguably form the most popular data-driven systems in contemporary society. They wield a considerable power by functioning as gatekeepers of the Web, with most user journeys on the Web beginning with them. Starting from the late 1990s, search engines have been dominated by the paradigm of link-based web search. In this paper, we critically analyze the political economy of the pa… ▽ More Web search engines arguably form the most popular data-driven systems in contemporary society. They wield a considerable power by functioning as gatekeepers of the Web, with most user journeys on the Web beginning with them. Starting from the late 1990s, search engines have been dominated by the paradigm of link-based web search. In this paper, we critically analyze the political economy of the paradigm of link-based web search, drawing upon insights and methodologies from critical political economy. We draw several insights on how link-based web search has led to phenomena that favor capital through long-term structural changes on the Web, and how it has led to accentuating unpaid digital labor and ecologically unsustainable practices, among several others. We show how contemporary observations on the degrading quality of link-based web search can be traced back to the internal contradictions with the paradigm, and how such socio-technical phenomena may lead to a disutility of the link-based web search model. Our contribution is primarily on enhancing the understanding of the political economy of link-based web search, and laying bare the phenomena at work, and implicitly catalyze the search for alternative models. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2403.17419 [pdf, ps, other]

doi 10.1007/s00146-024-01899-y

AI Safety: Necessary, but insufficient and possibly problematic

Authors: Deepak P

Abstract: This article critically examines the recent hype around AI safety. We first start with noting the nature of the AI safety hype as being dominated by governments and corporations, and contrast it with other avenues within AI research on advancing social good. We consider what 'AI safety' actually means, and outline the dominant concepts that the digital footprint of AI safety aligns with. We posit… ▽ More This article critically examines the recent hype around AI safety. We first start with noting the nature of the AI safety hype as being dominated by governments and corporations, and contrast it with other avenues within AI research on advancing social good. We consider what 'AI safety' actually means, and outline the dominant concepts that the digital footprint of AI safety aligns with. We posit that AI safety has a nuanced and uneasy relationship with transparency and other allied notions associated with societal good, indicating that it is an insufficient notion if the goal is that of societal good in a broad sense. We note that the AI safety debate has already influenced some regulatory efforts in AI, perhaps in not so desirable directions. We also share our concerns on how AI safety may normalize AI that advances structural harm through providing exploitative and harmful AI with a veneer of safety. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: AI & Soc (2024)

arXiv:2307.16157 [pdf, other]

A Simple Robot Selection Criteria After Path Planning Using Wavefront Algorithm

Authors: Rajashekhar V S, Dhaya C, Dinakar Raj C K, Dharshan P, Mukesh Kumar S, Harish B, Ajith R, Kamaleshwaran K

Abstract: In this work we present a technique to select the best robot for accomplishing a task assuming that the map of the environment is known in advance. To do so, capabilities of the robots are listed and the environments where they can be used are mapped. There are five robots that included for doing the tasks. They are the robotic lizard, half-humanoid, robotic snake, biped and quadruped. Each of the… ▽ More In this work we present a technique to select the best robot for accomplishing a task assuming that the map of the environment is known in advance. To do so, capabilities of the robots are listed and the environments where they can be used are mapped. There are five robots that included for doing the tasks. They are the robotic lizard, half-humanoid, robotic snake, biped and quadruped. Each of these robots are capable of performing certain activities and also they have their own limitations. The process of considering the robot performances and acting based on their limitations is the focus of this work. The wavefront algorithm is used to find the nature of terrain. Based on the terrain a suitable robot is selected from the list of five robots by the wavefront algorithm. Using this robot the mission is accomplished. △ Less

Submitted 30 July, 2023; originally announced July 2023.

Comments: 8 pages, 4 figures

arXiv:2302.03774 [pdf, other]

AI and Core Electoral Processes: Mapping the Horizons

Authors: Deepak P, Stanley Simoes, Muiris MacCarthaigh

Abstract: Significant enthusiasm around AI uptake has been witnessed across societies globally. The electoral process -- the time, place and manner of elections within democratic nations -- has been among those very rare sectors in which AI has not penetrated much. Electoral management bodies in many countries have recently started exploring and deliberating over the use of AI in the electoral process. In t… ▽ More Significant enthusiasm around AI uptake has been witnessed across societies globally. The electoral process -- the time, place and manner of elections within democratic nations -- has been among those very rare sectors in which AI has not penetrated much. Electoral management bodies in many countries have recently started exploring and deliberating over the use of AI in the electoral process. In this paper, we consider five representative avenues within the core electoral process which have potential for AI usage, and map the challenges involved in using AI within them. These five avenues are: voter list maintenance, determining polling booth locations, polling booth protection processes, voter authentication and video monitoring of elections. Within each of these avenues, we lay down the context, illustrate current or potential usage of AI, and discuss extant or potential ramifications of AI usage, and potential directions for mitigating risks while considering AI usage. We believe that the scant current usage of AI within electoral processes provides a very rare opportunity, that of being able to deliberate on the risks and mitigation possibilities, prior to real and widespread AI deployment. This paper is an attempt to map the horizons of risks and opportunities in using AI within the electoral processes and to help shape the debate around the topic. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: 19 pages, 7 figures, to be published in AI Magazine (Fall 2023)

arXiv:2301.09003 [pdf, other]

doi 10.1016/j.nlp.2024.100062

Blacks is to Anger as Whites is to Joy? Understanding Latent Affective Bias in Large Pre-trained Neural Language Models

Authors: Anoop Kadan, Deepak P., Sahely Bhadra, Manjary P. Gangan, Lajish V. L

Abstract: Groundbreaking inventions and highly significant performance improvements in deep learning based Natural Language Processing are witnessed through the development of transformer based large Pre-trained Language Models (PLMs). The wide availability of unlabeled data within human generated data deluge along with self-supervised learning strategy helps to accelerate the success of large PLMs in langu… ▽ More Groundbreaking inventions and highly significant performance improvements in deep learning based Natural Language Processing are witnessed through the development of transformer based large Pre-trained Language Models (PLMs). The wide availability of unlabeled data within human generated data deluge along with self-supervised learning strategy helps to accelerate the success of large PLMs in language generation, language understanding, etc. But at the same time, latent historical bias/unfairness in human minds towards a particular gender, race, etc., encoded unintentionally/intentionally into the corpora harms and questions the utility and efficacy of large PLMs in many real-world applications, particularly for the protected groups. In this paper, we present an extensive investigation towards understanding the existence of "Affective Bias" in large PLMs to unveil any biased association of emotions such as anger, fear, joy, etc., towards a particular gender, race or religion with respect to the downstream task of textual emotion detection. We conduct our exploration of affective bias from the very initial stage of corpus level affective bias analysis by searching for imbalanced distribution of affective words within a domain, in large scale corpora that are used to pre-train and fine-tune PLMs. Later, to quantify affective bias in model predictions, we perform an extensive set of class-based and intensity-based evaluations using various bias evaluation corpora. Our results show the existence of statistically significant affective bias in the PLM based emotion detection systems, indicating biased association of certain emotions towards a particular gender, race, and religion. △ Less

Submitted 21 January, 2023; originally announced January 2023.

arXiv:2301.08995 [pdf, other]

doi 10.1007/s10115-024-02194-4

REDAffectiveLM: Leveraging Affect Enriched Embedding and Transformer-based Neural Language Model for Readers' Emotion Detection

Authors: Anoop Kadan, Deepak P., Manjary P. Gangan, Savitha Sam Abraham, Lajish V. L

Abstract: Technological advancements in web platforms allow people to express and share emotions towards textual write-ups written and shared by others. This brings about different interesting domains for analysis; emotion expressed by the writer and emotion elicited from the readers. In this paper, we propose a novel approach for Readers' Emotion Detection from short-text documents using a deep learning mo… ▽ More Technological advancements in web platforms allow people to express and share emotions towards textual write-ups written and shared by others. This brings about different interesting domains for analysis; emotion expressed by the writer and emotion elicited from the readers. In this paper, we propose a novel approach for Readers' Emotion Detection from short-text documents using a deep learning model called REDAffectiveLM. Within state-of-the-art NLP tasks, it is well understood that utilizing context-specific representations from transformer-based pre-trained language models helps achieve improved performance. Within this affective computing task, we explore how incorporating affective information can further enhance performance. Towards this, we leverage context-specific and affect enriched representations by using a transformer-based pre-trained language model in tandem with affect enriched Bi-LSTM+Attention. For empirical evaluation, we procure a new dataset REN-20k, besides using RENh-4k and SemEval-2007. We evaluate the performance of our REDAffectiveLM rigorously across these datasets, against a vast set of state-of-the-art baselines, where our model consistently outperforms baselines and obtains statistically significant results. Our results establish that utilizing affect enriched representation along with context-specific representation within a neural architecture can considerably enhance readers' emotion detection. Since the impact of affect enrichment specifically in readers' emotion detection isn't well explored, we conduct a detailed analysis over affect enriched Bi-LSTM+Attention using qualitative and quantitative model behavior evaluation techniques. We observe that compared to conventional semantic embedding, affect enriched embedding increases ability of the network to effectively identify and assign weightage to key terms responsible for readers' emotion detection. △ Less

Submitted 21 January, 2023; originally announced January 2023.

arXiv:2212.14467 [pdf, other]

Cluster-level Group Representativity Fairness in $k$-means Clustering

Authors: Stanley Simoes, Deepak P, Muiris MacCarthaigh

Abstract: There has been much interest recently in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and gender. We observe that clustering algorithms could generate clusters such that different groups are disadvantaged within different clusters. We develop a clustering algorithm, building upon the centroid clusterin… ▽ More There has been much interest recently in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and gender. We observe that clustering algorithms could generate clusters such that different groups are disadvantaged within different clusters. We develop a clustering algorithm, building upon the centroid clustering paradigm pioneered by classical algorithms such as $k$-means, where we focus on mitigating the unfairness experienced by the most-disadvantaged group within each cluster. Our method uses an iterative optimisation paradigm whereby an initial cluster assignment is modified by reassigning objects to clusters such that the worst-off sensitive group within each cluster is benefitted. We demonstrate the effectiveness of our method through extensive empirical evaluations over a novel evaluation metric on real-world datasets. Specifically, we show that our method is effective in enhancing cluster-level group representativity fairness significantly at low impact on cluster coherence. △ Less

Submitted 29 December, 2022; originally announced December 2022.

arXiv:2209.11984 [pdf, ps, other]

Gender Bias in Fake News: An Analysis

Authors: Navya Sahadevan, Deepak P

Abstract: Data science research into fake news has gathered much momentum in recent years, arguably facilitated by the emergence of large public benchmark datasets. While it has been well-established within media studies that gender bias is an issue that pervades news media, there has been very little exploration into the relationship between gender bias and fake news. In this work, we provide the first emp… ▽ More Data science research into fake news has gathered much momentum in recent years, arguably facilitated by the emergence of large public benchmark datasets. While it has been well-established within media studies that gender bias is an issue that pervades news media, there has been very little exploration into the relationship between gender bias and fake news. In this work, we provide the first empirical analysis of gender bias vis-a-vis fake news, leveraging simple and transparent lexicon-based methods over public benchmark datasets. Our analysis establishes the increased prevalance of gender bias in fake news across three facets viz., abundance, affect and proximal words. The insights from our analysis provide a strong argument that gender bias needs to be an important consideration in research into fake news. △ Less

Submitted 4 February, 2023; v1 submitted 24 September, 2022; originally announced September 2022.

Comments: Accepted paper in "Integrity in Social Networks and Media 2023" workshop

arXiv:2205.15683 [pdf, ps, other]

Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers

Authors: Sowmya S Sundaram, Sairam Gurajada, Marco Fisichella, Deepak P, Savitha Sam Abraham

Abstract: From the latter half of the last decade, there has been a growing interest in developing algorithms for automatically solving mathematical word problems (MWP). It is a challenging and unique task that demands blending surface level text pattern recognition with mathematical reasoning. In spite of extensive research, we are still miles away from building robust representations of elementary math wo… ▽ More From the latter half of the last decade, there has been a growing interest in developing algorithms for automatically solving mathematical word problems (MWP). It is a challenging and unique task that demands blending surface level text pattern recognition with mathematical reasoning. In spite of extensive research, we are still miles away from building robust representations of elementary math word problems and effective solutions for the general task. In this paper, we critically examine the various models that have been developed for solving word problems, their pros and cons and the challenges ahead. In the last two years, a lot of deep learning models have recorded competing results on benchmark datasets, making a critical and conceptual analysis of literature highly useful at this juncture. We take a step back and analyse why, in spite of this abundance in scholarly interest, the predominantly used experiment and dataset designs continue to be a stumbling block. From the vantage point of having analyzed the literature closely, we also endeavour to provide a road-map for future math word problem research. △ Less

Submitted 31 May, 2022; originally announced May 2022.

arXiv:2205.02052 [pdf, other]

Exploring Rawlsian Fairness for K-Means Clustering

Authors: Stanley Simoes, Deepak P, Muiris MacCarthaigh

Abstract: We conduct an exploratory study that looks at incorporating John Rawls' ideas on fairness into existing unsupervised machine learning algorithms. Our focus is on the task of clustering, specifically the k-means clustering algorithm. To the best of our knowledge, this is the first work that uses Rawlsian ideas in clustering. Towards this, we attempt to develop a postprocessing technique i.e., one t… ▽ More We conduct an exploratory study that looks at incorporating John Rawls' ideas on fairness into existing unsupervised machine learning algorithms. Our focus is on the task of clustering, specifically the k-means clustering algorithm. To the best of our knowledge, this is the first work that uses Rawlsian ideas in clustering. Towards this, we attempt to develop a postprocessing technique i.e., one that operates on the cluster assignment generated by the standard k-means clustering algorithm. Our technique perturbs this assignment over a number of iterations to make it fairer according to Rawls' difference principle while minimally affecting the overall utility. As the first step, we consider two simple perturbation operators -- $\mathbf{R_1}$ and $\mathbf{R_2}$ -- that reassign examples in a given cluster assignment to new clusters; $\mathbf{R_1}$ assigning a single example to a new cluster, and $\mathbf{R_2}$ a pair of examples to new clusters. Our experiments on a sample of the Adult dataset demonstrate that both operators make meaningful perturbations in the cluster assignment towards incorporating Rawls' difference principle, with $\mathbf{R_2}$ being more efficient than $\mathbf{R_1}$ in terms of the number of iterations. However, we observe that there is still a need to design operators that make significantly better perturbations. Nevertheless, both operators provide good baselines for designing and comparing any future operator, and we hope our findings would aid future work in this direction. △ Less

Submitted 4 May, 2022; originally announced May 2022.

Comments: Accepted at ICDSE 2021

arXiv:2204.10365 [pdf, other]

doi 10.1007/978-981-19-4453-6_2

Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias

Authors: Anoop K., Manjary P. Gangan, Deepak P., Lajish V. L

Abstract: The remarkable progress in Natural Language Processing (NLP) brought about by deep learning, particularly with the recent advent of large pre-trained neural language models, is brought into scrutiny as several studies began to discuss and report potential biases in NLP applications. Bias in NLP is found to originate from latent historical biases encoded by humans into textual data which gets perpe… ▽ More The remarkable progress in Natural Language Processing (NLP) brought about by deep learning, particularly with the recent advent of large pre-trained neural language models, is brought into scrutiny as several studies began to discuss and report potential biases in NLP applications. Bias in NLP is found to originate from latent historical biases encoded by humans into textual data which gets perpetuated or even amplified by NLP algorithm. We present a survey to comprehend bias in large pre-trained language models, analyze the stages at which they occur in these models, and various ways in which these biases could be quantified and mitigated. Considering wide applicability of textual affective computing based downstream tasks in real-world systems such as business, healthcare, education, etc., we give a special emphasis on investigating bias in the context of affect (emotion) i.e., Affective Bias, in large pre-trained language models. We present a summary of various bias evaluation corpora that help to aid future research and discuss challenges in the research on bias in pre-trained language models. We believe that our attempt to draw a comprehensive view of bias in pre-trained language models, and especially the exploration of affective bias will be highly beneficial to researchers interested in this evolving field. △ Less

Submitted 21 April, 2022; originally announced April 2022.

Comments: Accepted at ICDSE 2021

arXiv:2110.15558 [pdf]

AI-Powered Semantic Segmentation and Fluid Volume Calculation of Lung CT images in Covid-19 Patients

Authors: Sabeerali K. P, Saleena T. S, Dr. Muhamed Ilyas P, Neha Mohan

Abstract: COVID-19 pandemic is a deadly disease spreading very fast. People with the confronted immune system are susceptible to many health conditions. A highly significant condition is pneumonia, which is found to be the cause of death in the majority of patients. The main purpose of this study is to find the volume of GGO and consolidation of a covid-19 patient so that the physicians can prioritize the p… ▽ More COVID-19 pandemic is a deadly disease spreading very fast. People with the confronted immune system are susceptible to many health conditions. A highly significant condition is pneumonia, which is found to be the cause of death in the majority of patients. The main purpose of this study is to find the volume of GGO and consolidation of a covid-19 patient so that the physicians can prioritize the patients. Here we used transfer learning techniques for segmentation of lung CTs with the latest libraries and techniques which reduces training time and increases the accuracy of the AI Model. This system is trained with DeepLabV3+ network architecture and model Resnet50 with Imagenet weights. We used different augmentation techniques like Gaussian Noise, Horizontal shift, color variation, etc to get to the result. Intersection over Union(IoU) is used as the performance metrics. The IoU of lung masks is predicted as 99.78% and that of infected masks is as 89.01%. Our work effectively measures the volume of infected region by calculating the volume of infected and lung mask region of the patients. △ Less

Submitted 29 October, 2021; originally announced October 2021.

Comments: https://www.uietkuk.ac.in/etbs2021/wp-content/uploads/2021/02/Special-Session-Proposal-ETBS-2021.doc

MSC Class: 68T10 (Primary)

arXiv:2110.13424 [pdf, other]

Phish-Defence: Phishing Detection Using Deep Recurrent Neural Networks

Authors: Aman Rangapur, Tarun Kanakam, Dhanvanthini P

Abstract: In the growing world of the internet, the number of ways to obtain crucial data such as passwords and login credentials, as well as sensitive personal information has expanded. Page impersonation, often known as phishing, is one method of obtaining such valuable information. Phishing is one of the most straightforward forms of cyberattack for hackers and one of the simplest for victims to fall for… ▽ More In the growing world of the internet, the number of ways to obtain crucial data such as passwords and login credentials, as well as sensitive personal information has expanded. Page impersonation, often known as phishing, is one method of obtaining such valuable information. Phishing is one of the most straightforward forms of cyberattack for hackers and one of the simplest for victims to fall for. It can also provide hackers with everything they need to get access to their target's personal and corporate accounts. Such websites do not offer a service, but instead, gather personal information from users. In this paper, we achieved state-of-the-art accuracy in detecting malicious URLs using recurrent neural networks. Unlike previous studies, which looked at online content, URLs, and traffic numbers, we merely look at the text in the URL, which makes it quicker and catches zero-day assaults. The network has been optimised to be utilised on tiny devices like Mobiles, and Raspberry Pi without sacrificing the inference time. △ Less

Submitted 6 September, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: 9 pages, 10 figures, 4 tables

ACM Class: I.2.m

arXiv:2110.11098 [pdf, other]

Index Coded - NOMA in Vehicular Ad Hoc Networks

Authors: Sreelakshmi P., Jesy Pachat, Anjana A. Mahesh, Deepthi P. P., B. Sundar Rajan

Abstract: The demand for multimedia services is growing day by day in vehicular ad-hoc networks (VANETs), resulting in high spectral usage and network congestion. Non-orthogonal multiple access (NOMA) is a promising wireless communication technique to solve the problems related to spectral efficiency effectively. The index coding (IC) is a powerful method to improve spectral utilization, where a sender aims… ▽ More The demand for multimedia services is growing day by day in vehicular ad-hoc networks (VANETs), resulting in high spectral usage and network congestion. Non-orthogonal multiple access (NOMA) is a promising wireless communication technique to solve the problems related to spectral efficiency effectively. The index coding (IC) is a powerful method to improve spectral utilization, where a sender aims to satisfy the needs of multiple receivers with a minimum number of transmissions. By combining these two approaches, in this work, we propose a novel technique called index coded NOMA (IC-NOMA), where we apply NOMA techniques on index coded data to reduce the number of transmissions further. This work shows that the IC-NOMA system demands a specific design for index codes to reap the advantages of NOMA. We have done the feasibility analysis of the proposed method in a general scenario and proposed an index code design to integrate IC over NOMA for the best efficiency. Through detailed analytical studies it is validated that the proposed transmission system provides improved spectral efficiency and power saving compared to conventional IC systems. △ Less

Submitted 21 October, 2021; originally announced October 2021.

Comments: 13 pages, 5 figures and 9 tables

arXiv:2110.08510 [pdf, ps, other]

DFW-PP: Dynamic Feature Weighting based Popularity Prediction for Social Media Content

Authors: Viswanatha Reddy G, Chaitanya B S N V, Prathyush P, Sumanth M, Mrinalini C, Dileep Kumar P, Snehasis Mukherjee

Abstract: The increasing popularity of social media platforms makes it important to study user engagement, which is a crucial aspect of any marketing strategy or business model. The over-saturation of content on social media platforms has persuaded us to identify the important factors that affect content popularity. This comes from the fact that only an iota of the humongous content available online receive… ▽ More The increasing popularity of social media platforms makes it important to study user engagement, which is a crucial aspect of any marketing strategy or business model. The over-saturation of content on social media platforms has persuaded us to identify the important factors that affect content popularity. This comes from the fact that only an iota of the humongous content available online receives the attention of the target audience. Comprehensive research has been done in the area of popularity prediction using several Machine Learning techniques. However, we observe that there is still significant scope for improvement in analyzing the social importance of media content. We propose the DFW-PP framework, to learn the importance of different features that vary over time. Further, the proposed method controls the skewness of the distribution of the features by applying a log-log normalization. The proposed method is experimented with a benchmark dataset, to show promising results. The code will be made publicly available at https://github.com/chaitnayabasava/DFW-PP. △ Less

Submitted 16 October, 2021; originally announced October 2021.

arXiv:2106.13271 [pdf, ps, other]

On Fairness and Interpretability

Authors: Deepak P, Sanil V, Joemon M. Jose

Abstract: Ethical AI spans a gamut of considerations. Among these, the most popular ones, fairness and interpretability, have remained largely distinct in technical pursuits. We discuss and elucidate the differences between fairness and interpretability across a variety of dimensions. Further, we develop two principles-based frameworks towards developing ethical AI for the future that embrace aspects of bot… ▽ More Ethical AI spans a gamut of considerations. Among these, the most popular ones, fairness and interpretability, have remained largely distinct in technical pursuits. We discuss and elucidate the differences between fairness and interpretability across a variety of dimensions. Further, we develop two principles-based frameworks towards developing ethical AI for the future that embrace aspects of both fairness and interpretability. First, interpretability for fairness proposes instantiating interpretability within the realm of fairness to develop a new breed of ethical AI. Second, fairness and interpretability initiates deliberations on bringing the best aspects of both together. We hope that these two frameworks will contribute to intensifying scholarly discussions on new frontiers of ethical AI that brings together fairness and interpretability. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: in IJCAI 2021 Workshop on AI for Social Good, January 2021. [ Ref: https://crcs.seas.harvard.edu/publications/fairness-and-interpretability ]

arXiv:2106.06049 [pdf, other]

doi 10.1007/s10618-022-00887-4

FiSH: Fair Spatial Hotspots

Authors: Deepak P, Sowmya S Sundaram

Abstract: Pervasiveness of tracking devices and enhanced availability of spatially located data has deepened interest in using them for various policy interventions, through computational data analysis tasks such as spatial hot spot detection. In this paper, we consider, for the first time to our best knowledge, fairness in detecting spatial hot spots. We motivate the need for ensuring fairness through stat… ▽ More Pervasiveness of tracking devices and enhanced availability of spatially located data has deepened interest in using them for various policy interventions, through computational data analysis tasks such as spatial hot spot detection. In this paper, we consider, for the first time to our best knowledge, fairness in detecting spatial hot spots. We motivate the need for ensuring fairness through statistical parity over the collective population covered across chosen hot spots. We then characterize the task of identifying a diverse set of solutions in the noteworthiness-fairness trade-off spectrum, to empower the user to choose a trade-off justified by the policy domain. Being a novel task formulation, we also develop a suite of evaluation metrics for fair hot spots, motivated by the need to evaluate pertinent aspects of the task. We illustrate the computational infeasibility of identifying fair hot spots using naive and/or direct approaches and devise a method, codenamed {\it FiSH}, for efficiently identifying high-quality, fair and diverse sets of spatial hot spots. FiSH traverses the tree-structured search space using heuristics that guide it towards identifying effective and fair sets of spatial hot spots. Through an extensive empirical analysis over a real-world dataset from the domain of human development, we illustrate that FiSH generates high-quality solutions at fast response times. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Journal ref: Data Mining and Knowledge Discovery 37, 1374 - 1403 (2023)

arXiv:2012.09118 [pdf]

Exploring Thematic Coherence in Fake News

Authors: Martins Samuel Dogo, Deepak P, Anna Jurek-Loughrey

Abstract: The spread of fake news remains a serious global issue; understanding and curtailing it is paramount. One way of differentiating between deceptive and truthful stories is by analyzing their coherence. This study explores the use of topic models to analyze the coherence of cross-domain news shared online. Experimental results on seven cross-domain datasets demonstrate that fake news shows a greater… ▽ More The spread of fake news remains a serious global issue; understanding and curtailing it is paramount. One way of differentiating between deceptive and truthful stories is by analyzing their coherence. This study explores the use of topic models to analyze the coherence of cross-domain news shared online. Experimental results on seven cross-domain datasets demonstrate that fake news shows a greater thematic deviation between its opening sentences and its remainder. △ Less

Submitted 16 December, 2020; v1 submitted 16 December, 2020; originally announced December 2020.

Comments: 10 pages, 1 figure, to be published in Proceedings of the 8th International Workshop on News Recommendation and Analytics (INRA 2020)

arXiv:2010.10836 [pdf, ps, other]

ReSCo-CC: Unsupervised Identification of Key Disinformation Sentences

Authors: Soumya Suvra Ghosal, Deepak P, Anna Jurek-Loughrey

Abstract: Disinformation is often presented in long textual articles, especially when it relates to domains such as health, often seen in relation to COVID-19. These articles are typically observed to have a number of trustworthy sentences among which core disinformation sentences are scattered. In this paper, we propose a novel unsupervised task of identifying sentences containing key disinformation within… ▽ More Disinformation is often presented in long textual articles, especially when it relates to domains such as health, often seen in relation to COVID-19. These articles are typically observed to have a number of trustworthy sentences among which core disinformation sentences are scattered. In this paper, we propose a novel unsupervised task of identifying sentences containing key disinformation within a document that is known to be untrustworthy. We design a three-phase statistical NLP solution for the task which starts with embedding sentences within a bespoke feature space designed for the task. Sentences represented using those features are then clustered, following which the key sentences are identified through proximity scoring. We also curate a new dataset with sentence level disinformation scorings to aid evaluation for this task; the dataset is being made publicly available to facilitate further research. Based on a comprehensive empirical evaluation against techniques from related tasks such as claim detection and summarization, as well as against simplified variants of our proposed approach, we illustrate that our method is able to identify core disinformation effectively. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: The 22nd International Conference on Information Integration and Web-based Applications & Services (iiWAS '20), Chiang Mai, Thailand

arXiv:2010.07054 [pdf, other]

doi 10.1145/3394231.3397910

Representativity Fairness in Clustering

Authors: Deepak P, Savitha Sam Abraham

Abstract: Incorporating fairness constructs into machine learning algorithms is a topic of much societal importance and recent interest. Clustering, a fundamental task in unsupervised learning that manifests across a number of web data scenarios, has also been subject of attention within fair ML research. In this paper, we develop a novel notion of fairness in clustering, called representativity fairness. R… ▽ More Incorporating fairness constructs into machine learning algorithms is a topic of much societal importance and recent interest. Clustering, a fundamental task in unsupervised learning that manifests across a number of web data scenarios, has also been subject of attention within fair ML research. In this paper, we develop a novel notion of fairness in clustering, called representativity fairness. Representativity fairness is motivated by the need to alleviate disparity across objects' proximity to their assigned cluster representatives, to aid fairer decision making. We illustrate the importance of representativity fairness in real-world decision making scenarios involving clustering and provide ways of quantifying objects' representativity and fairness over it. We develop a new clustering formulation, RFKM, that targets to optimize for representativity fairness along with clustering quality. Inspired by the $K$-Means framework, RFKM incorporates novel loss terms to formulate an objective function. The RFKM objective and optimization approach guides it towards clustering configurations that yield higher representativity fairness. Through an empirical evaluation over a variety of public datasets, we establish the effectiveness of our method. We illustrate that we are able to significantly improve representativity fairness at only marginal impact to clustering quality. △ Less

Submitted 11 October, 2020; originally announced October 2020.

Comments: In 12th ACM Web Science Conference (WebSci 2020)

arXiv:2010.05353 [pdf, other]

doi 10.1145/3410566.3410601

Local Connectivity in Centroid Clustering

Authors: Deepak P

Abstract: Clustering is a fundamental task in unsupervised learning, one that targets to group a dataset into clusters of similar objects. There has been recent interest in embedding normative considerations around fairness within clustering formulations. In this paper, we propose 'local connectivity' as a crucial factor in assessing membership desert in centroid clustering. We use local connectivity to ref… ▽ More Clustering is a fundamental task in unsupervised learning, one that targets to group a dataset into clusters of similar objects. There has been recent interest in embedding normative considerations around fairness within clustering formulations. In this paper, we propose 'local connectivity' as a crucial factor in assessing membership desert in centroid clustering. We use local connectivity to refer to the support offered by the local neighborhood of an object towards supporting its membership to the cluster in question. We motivate the need to consider local connectivity of objects in cluster assignment, and provide ways to quantify local connectivity in a given clustering. We then exploit concepts from density-based clustering and devise LOFKM, a clustering method that seeks to deepen local connectivity in clustering outputs, while staying within the framework of centroid clustering. Through an empirical evaluation over real-world datasets, we illustrate that LOFKM achieves notable improvements in local connectivity at reasonable costs to clustering quality, illustrating the effectiveness of the method. △ Less

Submitted 11 October, 2020; originally announced October 2020.

Comments: In 24th International Database Engineering & Applications Symposium (IDEAS 2020), August 12--14, 2020, Seoul, Republic of Korea

arXiv:2007.07838 [pdf, ps, other]

Whither Fair Clustering?

Authors: Deepak P

Abstract: Within the relatively busy area of fair machine learning that has been dominated by classification fairness research, fairness in clustering has started to see some recent attention. In this position paper, we assess the existing work in fair clustering and observe that there are several directions that are yet to be explored, and postulate that the state-of-the-art in fair clustering has been qui… ▽ More Within the relatively busy area of fair machine learning that has been dominated by classification fairness research, fairness in clustering has started to see some recent attention. In this position paper, we assess the existing work in fair clustering and observe that there are several directions that are yet to be explored, and postulate that the state-of-the-art in fair clustering has been quite parochial in outlook. We posit that widening the normative principles to target for, characterizing shortfalls where the target cannot be achieved fully, and making use of knowledge of downstream processes can significantly widen the scope of research in fair clustering research. At a time when clustering and unsupervised learning are being increasingly used to make and influence decisions that matter significantly to human lives, we believe that widening the ambit of fair clustering is of immense significance. △ Less

Submitted 8 July, 2020; originally announced July 2020.

Comments: Accepted at the AI for Social Good Workshop, Harvard, July 20-21, 2020

arXiv:2007.00559 [pdf, ps, other]

Index Coding in Vehicle to Vehicle Communication

Authors: Jesy Pachat, Nujoom Sageer Karat, Deepthi P. P., B. Sundar Rajan

Abstract: Vehicle to Vehicle (V2V) communication phase is an integral part of collaborative message dissemination in vehicular ad-hoc networks (VANETs). In this work, we apply index coding techniques to reduce the number of transmissions required for data exchange. The index coding problem has a sender, which tries to meet the demands of several receivers in a minimum number of transmissions. All these rece… ▽ More Vehicle to Vehicle (V2V) communication phase is an integral part of collaborative message dissemination in vehicular ad-hoc networks (VANETs). In this work, we apply index coding techniques to reduce the number of transmissions required for data exchange. The index coding problem has a sender, which tries to meet the demands of several receivers in a minimum number of transmissions. All these receivers have some prior knowledge of the messages, known as the side-information. In this work, we consider a particular case of the index coding problem, where multiple nodes want to share information among them. Under this set up, lower bound on the number of transmissions is established when the cardinality of side-information is the same. An optimal solution to achieve the bound in a special case of VANET scenario is presented. For this special case, we consider the link between the nodes to be error-prone, and in this setting, we construct optimal linear error correcting index codes. △ Less

Submitted 1 July, 2020; originally announced July 2020.

Comments: Accepted for publication in IEEE Transactions on Vehicular Technology

arXiv:2006.07580 [pdf, other]

Modeling Implicit Communities using Spatio-Temporal Point Processes from Geo-tagged Event Traces

Authors: Ankita Likhyani, Vinayak Gupta, Srijith P. K., Deepak P., Srikanta Bedathur

Abstract: The location check-ins of users through various location-based services such as Foursquare, Twitter, and Facebook Places, etc., generate large traces of geo-tagged events. These event-traces often manifest in hidden (possibly overlapping) communities of users with similar interests. Inferring these implicit communities is crucial for forming user profiles for improvements in recommendation and pre… ▽ More The location check-ins of users through various location-based services such as Foursquare, Twitter, and Facebook Places, etc., generate large traces of geo-tagged events. These event-traces often manifest in hidden (possibly overlapping) communities of users with similar interests. Inferring these implicit communities is crucial for forming user profiles for improvements in recommendation and prediction tasks. Given only time-stamped geo-tagged traces of users, can we find out these implicit communities, and characteristics of the underlying influence network? Can we use this network to improve the next location prediction task? In this paper, we focus on the problem of community detection as well as capturing the underlying diffusion process and propose a model COLAB based on Spatio-temporal point processes in continuous time but discrete space of locations that simultaneously models the implicit communities of users based on their check-in activities, without making use of their social network connections. COLAB captures the semantic features of the location, user-to-user influence along with spatial and temporal preferences of users. To learn the latent community of users and model parameters, we propose an algorithm based on stochastic variational inference. To the best of our knowledge, this is the first attempt at jointly modeling the diffusion process with activity-driven implicit communities. We demonstrate COLAB achieves up to 27% improvements in location prediction task over recent deep point-process based methods on geo-tagged event traces collected from Foursquare check-ins. △ Less

Submitted 13 June, 2020; originally announced June 2020.

Comments: 17 pages

arXiv:2005.09900 [pdf, ps, other]

Fair Outlier Detection

Authors: Deepak P, Savitha Sam Abraham

Abstract: An outlier detection method may be considered fair over specified sensitive attributes if the results of outlier detection are not skewed towards particular groups defined on such sensitive attributes. In this task, we consider, for the first time to our best knowledge, the task of fair outlier detection. In this work, we consider the task of fair outlier detection over multiple multi-valued sensi… ▽ More An outlier detection method may be considered fair over specified sensitive attributes if the results of outlier detection are not skewed towards particular groups defined on such sensitive attributes. In this task, we consider, for the first time to our best knowledge, the task of fair outlier detection. In this work, we consider the task of fair outlier detection over multiple multi-valued sensitive attributes (e.g., gender, race, religion, nationality, marital status etc.). We propose a fair outlier detection method, FairLOF, that is inspired by the popular LOF formulation for neighborhood-based outlier detection. We outline ways in which unfairness could be induced within LOF and develop three heuristic principles to enhance fairness, which form the basis of the FairLOF method. Being a novel task, we develop an evaluation framework for fair outlier detection, and use that to benchmark FairLOF on quality and fairness of results. Through an extensive empirical evaluation over real-world datasets, we illustrate that FairLOF is able to achieve significant improvements in fairness at sometimes marginal degradations on result quality as measured against the fairness-agnostic LOF method. △ Less

Submitted 4 August, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

Comments: In Proceedings of The 21th International Conference on Web Information Systems Engineering (WISE 2020), Amsterdam and Leiden, The Netherlands

arXiv:2004.11044 [pdf, ps, other]

BOLD: An Ontology-based Log Debugger for C Programs

Authors: Dileep Kumar P, Rupesh Nasre, Sreenivasa Kumar P

Abstract: The different activities related to debugging such as program instrumentation, representation of execution trace and analysis of trace are not typically performed in an unified framework. We propose \textit{BOLD}, an Ontology-based Log Debugger to unify and standardize the activities in debugging. The syntactical information of programs can be represented in the from of Resource Description Framew… ▽ More The different activities related to debugging such as program instrumentation, representation of execution trace and analysis of trace are not typically performed in an unified framework. We propose \textit{BOLD}, an Ontology-based Log Debugger to unify and standardize the activities in debugging. The syntactical information of programs can be represented in the from of Resource Description Framework (RDF) triples. Using the BOLD framework, the programs can be automatically instrumented by using declarative specifications over these triples. A salient feature of the framework is to store the execution trace of the program also as RDF triples called \textit{trace triples}. These triples can be queried to implement the common debug operations. The novelty of the framework is to abstract these triples as \textit{spans} for high-level reasoning. A span gives a way of examining the values of a particular variable over certain portion of the program execution. The properties of the spans are defined formally as a Web Ontology Language (OWL) ontology called \textit{Program Debug (PD) Ontology}. Using the span abstraction and PD ontology, end-users can debug a given buggy program in a standard manner. A notable feature of using ontology is that users can accurately debug in some cases of missing information, which can be practically useful. To demonstrate the feasibility of the proposed framework, we have debugged the programs in a standard bug benchmark suite Software-artifact Infrastructure Repository (SIR). Experiments show that the querying time is almost the same as in \texttt{gdb}. The reasoning time depends on the sub-language of OWL. We find that the expressibility offered by OWL-DL language is sufficient for the bugs in SIR programs; but to achieve scalability in reasoning, a restricted OWL-RL language is required. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: 16 pages, 4 tables, 4 figures

arXiv:2002.05527 [pdf, other]

Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

Authors: Sridhama Prakhya, Deepak P

Abstract: Quite often, words from one language are adopted within a different language without translation; these words appear in transliterated form in text written in the latter language. This phenomenon is particularly widespread within Indian languages where many words are loaned from English. In this paper, we address the task of identifying loanwords automatically and in an unsupervised manner, from l… ▽ More Quite often, words from one language are adopted within a different language without translation; these words appear in transliterated form in text written in the latter language. This phenomenon is particularly widespread within Indian languages where many words are loaned from English. In this paper, we address the task of identifying loanwords automatically and in an unsupervised manner, from large datasets of words from agglutinative Dravidian languages. We target two specific languages from the Dravidian family, viz., Malayalam and Telugu. Based on familiarity with the languages, we outline an observation that native words in both these languages tend to be characterized by a much more versatile stem - stem being a shorthand to denote the subword sequence formed by the first few characters of the word - than words that are loaned from other languages. We harness this observation to build an objective function and an iterative optimization formulation to optimize for it, yielding a scoring of each word's nativeness in the process. Through an extensive empirical analysis over real-world datasets from both Malayalam and Telugu, we illustrate the effectiveness of our method in quantifying nativeness effectively over available baselines for the task. △ Less

Submitted 11 February, 2020; originally announced February 2020.

Comments: submitted to Natural Language Engineering; 22 pages; 4 figures. This is an extended version of a conference paper (arXiv:1803.09641) that has been enriched with substantive new content, with significant extensions on both the method modeling and the experiments

arXiv:1910.05113 [pdf, other]

Fairness in Clustering with Multiple Sensitive Attributes

Authors: Savitha Sam Abraham, Deepak P, Sowmya S Sundaram

Abstract: A clustering may be considered as fair on pre-specified sensitive attributes if the proportions of sensitive attribute groups in each cluster reflect that in the dataset. In this paper, we consider the task of fair clustering for scenarios involving multiple multi-valued or numeric sensitive attributes. We propose a fair clustering method, \textit{FairKM} (Fair K-Means), that is inspired by the po… ▽ More A clustering may be considered as fair on pre-specified sensitive attributes if the proportions of sensitive attribute groups in each cluster reflect that in the dataset. In this paper, we consider the task of fair clustering for scenarios involving multiple multi-valued or numeric sensitive attributes. We propose a fair clustering method, \textit{FairKM} (Fair K-Means), that is inspired by the popular K-Means clustering formulation. We outline a computational notion of fairness which is used along with a cluster coherence objective, to yield the FairKM clustering method. We empirically evaluate our approach, wherein we quantify both the quality and fairness of clusters, over real-world datasets. Our experimental evaluation illustrates that the clusters generated by FairKM fare significantly better on both clustering quality and fair representation of sensitive attribute groups compared to the clusters from a state-of-the-art baseline fair clustering method. △ Less

Submitted 24 January, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

Comments: Proceedings of the 23rd International Conference on Extending Database Technology (EDBT 2020), 30th March-2nd April, 2020

arXiv:1906.11126 [pdf, other]

On the Coherence of Fake News Articles

Authors: Iknoor Singh, Deepak P, Anoop K

Abstract: The generation and spread of fake news within new and online media sources is emerging as a phenomenon of high societal significance. Combating them using data-driven analytics has been attracting much recent scholarly interest. In this study, we analyze the textual coherence of fake news articles vis-a-vis legitimate ones. We develop three computational formulations of textual coherence drawing u… ▽ More The generation and spread of fake news within new and online media sources is emerging as a phenomenon of high societal significance. Combating them using data-driven analytics has been attracting much recent scholarly interest. In this study, we analyze the textual coherence of fake news articles vis-a-vis legitimate ones. We develop three computational formulations of textual coherence drawing upon the state-of-the-art methods in natural language processing and data science. Two real-world datasets from widely different domains which have fake/legitimate article labellings are then analyzed with respect to textual coherence. We observe apparent differences in textual coherence across fake and legitimate news articles, with fake news articles consistently scoring lower on coherence as compared to legitimate news ones. While the relative coherence shortfall of fake news articles as compared to legitimate ones form the main observation from our study, we analyze several aspects of the differences and outline potential avenues of further inquiry. △ Less

Submitted 15 August, 2020; v1 submitted 26 June, 2019; originally announced June 2019.

Comments: 8th International Workshop on News Recommendation and Analytics (INRA 2020) held in conjunction with ECML PKDD 2020 Conference

arXiv:1906.10365 [pdf, other]

Emotion Cognizance Improves Health Fake News Identification

Authors: Anoop K, Deepak P, Lajish V L

Abstract: Identifying misinformation is increasingly being recognized as an important computational task with high potential social impact. Misinformation and fake contents are injected into almost every domain of news including politics, health, science, business, etc., among which, the fakeness in health domain pose serious adverse effects to scare or harm the society. Misinformation contains scientific c… ▽ More Identifying misinformation is increasingly being recognized as an important computational task with high potential social impact. Misinformation and fake contents are injected into almost every domain of news including politics, health, science, business, etc., among which, the fakeness in health domain pose serious adverse effects to scare or harm the society. Misinformation contains scientific claims or content from social media exaggerated with strong emotion content to attract eyeballs. In this paper, we consider the utility of the affective character of news articles for fake news identification in the health domain and present evidence that emotion cognizant representations are significantly more suited for the task. We outline a technique to leverage emotion intensity lexicons to develop emotionized text representations, and evaluate the utility of such a representation for identifying fake news relating to health in various supervised and unsupervised scenarios. The consistent and significant empirical gains that we observe over a range of technique types and parameter settings establish the utility of the emotional information in news articles, an often overlooked aspect, for the task of misinformation identification in the health domain. △ Less

Submitted 4 August, 2020; v1 submitted 25 June, 2019; originally announced June 2019.

Comments: In Proceedings of 24th International Database Engineering & Applications Symposium (IDEAS 2020), Incheon, Korea

arXiv:1906.05205 [pdf, other]

Warping Resilient Scalable Anomaly Detection in Time Series

Authors: Abilasha S, Sahely Bhadra, Deepak P, Anish Mathew

Abstract: Time series data is ubiquitous in the real-world problems across various domains including healthcare, social media, and crime surveillance. Detecting anomalies, or irregular and rare events, in time series data, can enable us to find abnormal events in any natural phenomena, which may require special treatment. Moreover, labeled instances of anomaly are hard to get in time series data. On the oth… ▽ More Time series data is ubiquitous in the real-world problems across various domains including healthcare, social media, and crime surveillance. Detecting anomalies, or irregular and rare events, in time series data, can enable us to find abnormal events in any natural phenomena, which may require special treatment. Moreover, labeled instances of anomaly are hard to get in time series data. On the other hand, time series data, due to its nature, often exhibits localized expansions and compressions in the time dimension which is called warping. These two challenges make it hard to detect anomalies in time series as often such warpings could get detected as anomalies erroneously. Our objective is to build an anomaly detection model that is robust to such warping variations. In this paper, we propose a novel unsupervised time series anomaly detection method, WaRTEm-AD, that operates in two stages. Within the key stage of representation learning, we employ data augmentation through bespoke time series operators which are passed through a twin autoencoder architecture to learn warping-robust representations for time series data. Second, adaptations of state-of-the-art anomaly detection methods are employed on the learnt representations to identify anomalies. We will illustrate that WaRTEm-AD is designed to detect two types of time series anomalies: point and sequence anomalies. We compare WaRTEm-AD with the state-of-the-art baselines and establish the effectiveness of our method both in terms of anomaly detection performance and computational efficiency. △ Less

Submitted 2 October, 2021; v1 submitted 12 June, 2019; originally announced June 2019.

Comments: October 2021: in communication to ECML PKDD Journal Track

arXiv:1810.12897 [pdf, ps, other]

Topic-Specific Sentiment Analysis Can Help Identify Political Ideology

Authors: Sumit Bhatia, Deepak P

Abstract: Ideological leanings of an individual can often be gauged by the sentiment one expresses about different issues. We propose a simple framework that represents a political ideology as a distribution of sentiment polarities towards a set of topics. This representation can then be used to detect ideological leanings of documents (speeches, news articles, etc.) based on the sentiments expressed toward… ▽ More Ideological leanings of an individual can often be gauged by the sentiment one expresses about different issues. We propose a simple framework that represents a political ideology as a distribution of sentiment polarities towards a set of topics. This representation can then be used to detect ideological leanings of documents (speeches, news articles, etc.) based on the sentiments expressed towards different topics. Experiments performed using a widely used dataset show the promise of our proposed approach that achieves comparable performance to other methods despite being much simpler and more interpretable. △ Less

Submitted 30 October, 2018; originally announced October 2018.

Comments: Presented at EMNLP Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis, 2018

arXiv:1803.09641 [pdf, ps, other]

Unsupervised Separation of Transliterable and Native Words for Malayalam

Authors: Deepak P

Abstract: Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score wo… ▽ More Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task. △ Less

Submitted 26 March, 2018; originally announced March 2018.

Comments: 10 pages, Proceedings of 14th International Conference on Natural Language Processing, Kolkata, India. 18-21 December, 2017

ACM Class: I.2.7

arXiv:1701.03855 [pdf, other]

Location Inference from Tweets using Grid-based Classification

Authors: Oluwaseun Ajao, Deepak P, Jun Hong

Abstract: The impact of social media and its growing association with the sharing of ideas and propagation of messages remains vital in everyday communication. Twitter is one effective platform for the dissemination of news and stories about recent events happening around the world. It has a continually growing database currently adopted by over 300 million users. In this paper we propose a novel grid-based… ▽ More The impact of social media and its growing association with the sharing of ideas and propagation of messages remains vital in everyday communication. Twitter is one effective platform for the dissemination of news and stories about recent events happening around the world. It has a continually growing database currently adopted by over 300 million users. In this paper we propose a novel grid-based approach employing supervised Multinomial Naive Bayes while extracting geographic entities from relevant user descriptions metadata which gives a spatial indication of the user location. To the best of our knowledge our approach is the first to make location inference from tweets using geo-enriched grid-based classification. Our approach performs better than existing baselines achieving more than 57% accuracy at city-level granularity. In addition we present a novel framework for content-based estimation of user locations by specifying levels of granularity required in pre-defined location grids. △ Less

Submitted 13 January, 2017; originally announced January 2017.

Comments: Location Inference from Tweets using Grid-based Classification

ACM Class: H.3.3

arXiv:1111.2750 [pdf]

Finite State Machine Based Evaluation Model for Web Service Reliability Analysis

Authors: Thirumaran. M, Dhavachelvan. P, S. Abarna, Lakshmi. P

Abstract: Now-a-days they are very much considering about the changes to be done at shorter time since the reaction time needs are decreasing every moment. Business Logic Evaluation Model (BLEM) are the proposed solution targeting business logic automation and facilitating business experts to write sophisticated business rules and complex calculations without costly custom programming. BLEM is powerful enou… ▽ More Now-a-days they are very much considering about the changes to be done at shorter time since the reaction time needs are decreasing every moment. Business Logic Evaluation Model (BLEM) are the proposed solution targeting business logic automation and facilitating business experts to write sophisticated business rules and complex calculations without costly custom programming. BLEM is powerful enough to handle service manageability issues by analyzing and evaluating the computability and traceability and other criteria of modified business logic at run time. The web service and QOS grows expensively based on the reliability of the service. Hence the service provider of today things that reliability is the major factor and any problem in the reliability of the service should overcome then and there in order to achieve the expected level of reliability. In our paper we propose business logic evaluation model for web service reliability analysis using Finite State Machine (FSM) where FSM will be extended to analyze the reliability of composed set of service i.e., services under composition, by analyzing reliability of each participating service of composition with its functional work flow process. FSM is exploited to measure the quality parameters. If any change occurs in the business logic the FSM will automatically measure the reliability. △ Less

Submitted 7 November, 2011; originally announced November 2011.

Comments: 13 pages,3 figures, WesT-2011

arXiv:1111.1586 [pdf, other]

Evaluation of Computability Criterions for Runtime Web Service Integration

Authors: Thirumaran. M, Dhavachelvan. P, Aranganayagi. G, S. Abarna

Abstract: Today's competitive environment drives the enterprises to extend their focus and collaborate with their business partners to carry out the necessities. Tight coordination among business partners assists to share and integrate the service logic globally. But integrating service logics across diverse enterprises leads to exponential problem which stipulates developers to comprehend the whole service… ▽ More Today's competitive environment drives the enterprises to extend their focus and collaborate with their business partners to carry out the necessities. Tight coordination among business partners assists to share and integrate the service logic globally. But integrating service logics across diverse enterprises leads to exponential problem which stipulates developers to comprehend the whole service and must resolve suitable method to integrate the services. It is complex and time-consuming task. So the present focus is to have a mechanized system to analyze the Business logics and convey the proper mode to integrate them. There is no standard model to undertake these issues and one such a framework proposed in this paper examines the Business logics individually and suggests proper structure to integrate them. One of the innovative concepts of proposed model is Property Evaluation System which scrutinizes the service logics and generates Business Logic Property Schema (BLPS) for the required services. BLPS holds necessary information to recognize the correct structure for integrating the service logics. At the time of integration, System consumes this BLPS schema and suggests the feasible ways to integrate the service logics. Also if the service logics are attempted to integrate in invalid structure or attempted to violate accessibility levels, system will throw exception with necessary information. This helps developers to ascertain the efficient structure to integrate the services with least effort. △ Less

Submitted 7 November, 2011; originally announced November 2011.

Comments: 16 pages

arXiv:1002.2202 [pdf]

Modeling of Human Criminal Behavior using Probabilistic Networks

Authors: Ramesh Kumar Gopala Pillai, Dr. Ramakanth Kumar . P

Abstract: Currently, criminals profile (CP) is obtained from investigators or forensic psychologists interpretation, linking crime scene characteristics and an offenders behavior to his or her characteristics and psychological profile. This paper seeks an efficient and systematic discovery of nonobvious and valuable patterns between variables from a large database of solved cases via a probabilistic netwo… ▽ More Currently, criminals profile (CP) is obtained from investigators or forensic psychologists interpretation, linking crime scene characteristics and an offenders behavior to his or her characteristics and psychological profile. This paper seeks an efficient and systematic discovery of nonobvious and valuable patterns between variables from a large database of solved cases via a probabilistic network (PN) modeling approach. The PN structure can be used to extract behavioral patterns and to gain insight into what factors influence these behaviors. Thus, when a new case is being investigated and the profile variables are unknown because the offender has yet to be identified, the observed crime scene variables are used to infer the unknown variables based on their connections in the structure and the corresponding numerical (probabilistic) weights. The objective is to produce a more systematic and empirical approach to profiling, and to use the resulting PN model as a decision tool. △ Less

Submitted 10 February, 2010; originally announced February 2010.

Comments: IEEE format, International Journal of Computer Science and Information Security, IJCSIS January 2010, ISSN 1947 5500, http://sites.google.com/site/ijcsis/

Report number: Journal of Computer Science, ISSN 19475500

Journal ref: International Journal of Computer Science and Information Security, IJCSIS, Vol. 7, No. 1, pp. 216-219, January 2010, USA

Showing 1–42 of 42 results for author: P., D