-
Temporal Motifs for Financial Networks: A Study on Mercari, JPMC, and Venmo Platforms
Authors:
Penghang Liu,
Rupam Acharyya,
Robert E. Tillman,
Shunya Kimura,
Naoki Masuda,
Ahmet Erdem Sarıyüce
Abstract:
Understanding the dynamics of financial transactions among people is critically important for various applications such as fraud detection. One important aspect of financial transaction networks is temporality. The order and repetition of transactions can offer new insights when considered within the graph structure. Temporal motifs, defined as a set of nodes that interact with each other in a sho…
▽ More
Understanding the dynamics of financial transactions among people is critically important for various applications such as fraud detection. One important aspect of financial transaction networks is temporality. The order and repetition of transactions can offer new insights when considered within the graph structure. Temporal motifs, defined as a set of nodes that interact with each other in a short time period, are a promising tool in this context. In this work, we study three unique temporal financial networks: transactions in Mercari, an online marketplace, payments in a synthetic network generated by J.P. Morgan Chase, and payments and friendships among Venmo users. We consider the fraud detection problem on the Mercari and J.P. Morgan Chase networks, for which the ground truth is available. We show that temporal motifs offer superior performance than a previous method that considers simple graph features. For the Venmo network, we investigate the interplay between financial and social relations on three tasks: friendship prediction, vendor identification, and analysis of temporal cycles. For friendship prediction, temporal motifs yield better results than general heuristics, such as Jaccard and Adamic-Adar measures. We are also able to identify vendors with high accuracy and observe interesting patterns in rare motifs, like temporal cycles. We believe that the analysis, datasets, and lessons from this work will be beneficial for future research on financial transaction networks.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
Fairness in Rating Prediction by Awareness of Verbal and Gesture Quality of Public Speeches
Authors:
Ankani Chattoraj,
Rupam Acharyya,
Shouman Das,
Md. Iftekhar Tanveer,
Ehsan Hoque
Abstract:
The role of verbal and non-verbal cues towards great public speaking has been a topic of exploration for many decades. We identify a commonality across present theories, the element of "variety or heterogeneity" in channels or modes of communication (e.g. resorting to stories, scientific facts, emotional connections, facial expressions etc.) which is essential for effectively communicating informa…
▽ More
The role of verbal and non-verbal cues towards great public speaking has been a topic of exploration for many decades. We identify a commonality across present theories, the element of "variety or heterogeneity" in channels or modes of communication (e.g. resorting to stories, scientific facts, emotional connections, facial expressions etc.) which is essential for effectively communicating information. We use this observation to formalize a novel HEterogeneity Metric, HEM, that quantifies the quality of a talk both in the verbal and non-verbal domain (transcript and facial gestures). We use TED talks as an input repository of public speeches because it consists of speakers from a diverse community besides having a wide outreach. We show that there is an interesting relationship between HEM and the ratings of TED talks given to speakers by viewers. It emphasizes that HEM inherently and successfully represents the quality of a talk based on "variety or heterogeneity". Further, we also discover that HEM successfully captures the prevalent bias in ratings with respect to race and gender, that we call sensitive attributes (because prediction based on these might result in unfair outcome). We incorporate the HEM metric into the loss function of a neural network with the goal to reduce unfairness in rating predictions with respect to race and gender. Our results show that the modified loss function improves fairness in prediction without considerably affecting prediction accuracy of the neural network. Our work ties together a novel metric for public speeches in both verbal and non-verbal domain with the computational power of a neural network to design a fair prediction system for speakers.
△ Less
Submitted 15 November, 2021; v1 submitted 11 December, 2020;
originally announced December 2020.
-
Detecting Individuals with Depressive Disorder fromPersonal Google Search and YouTube History Logs
Authors:
Boyu Zhang,
Anis Zaman,
Rupam Acharyya,
Ehsan Hoque,
Vincent Silenzio,
Henry Kautz
Abstract:
Depressive disorder is one of the most prevalent mental illnesses among the global population. However, traditional screening methods require exacting in-person interviews and may fail to provide immediate interventions. In this work, we leverage ubiquitous personal longitudinal Google Search and YouTube engagement logs to detect individuals with depressive disorder. We collected Google Search and…
▽ More
Depressive disorder is one of the most prevalent mental illnesses among the global population. However, traditional screening methods require exacting in-person interviews and may fail to provide immediate interventions. In this work, we leverage ubiquitous personal longitudinal Google Search and YouTube engagement logs to detect individuals with depressive disorder. We collected Google Search and YouTube history data and clinical depression evaluation results from $212$ participants ($99$ of them suffered from moderate to severe depressions). We then propose a personalized framework for classifying individuals with and without depression symptoms based on mutual-exciting point process that captures both the temporal and semantic aspects of online activities. Our best model achieved an average F1 score of $0.77 \pm 0.04$ and an AUC ROC of $0.81 \pm 0.02$.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
Statistical Mechanical Analysis of Neural Network Pruning
Authors:
Rupam Acharyya,
Ankani Chattoraj,
Boyu Zhang,
Shouman Das,
Daniel Stefankovic
Abstract:
Deep learning architectures with a huge number of parameters are often compressed using pruning techniques to ensure computational efficiency of inference during deployment. Despite multitude of empirical advances, there is a lack of theoretical understanding of the effectiveness of different pruning methods. We inspect different pruning techniques under the statistical mechanics formulation of a…
▽ More
Deep learning architectures with a huge number of parameters are often compressed using pruning techniques to ensure computational efficiency of inference during deployment. Despite multitude of empirical advances, there is a lack of theoretical understanding of the effectiveness of different pruning methods. We inspect different pruning techniques under the statistical mechanics formulation of a teacher-student framework and derive their generalization error (GE) bounds. It has been shown that Determinantal Point Process (DPP) based node pruning method is notably superior to competing approaches when tested on real datasets. Using GE bounds in the aforementioned setup we provide theoretical guarantees for their empirical observations. Another consistent finding in literature is that sparse neural networks (edge pruned) generalize better than dense neural networks (node pruned) for a fixed number of parameters. We use our theoretical setup to prove this finding and show that even the baseline random edge pruning method performs better than the DPP node pruning method. We also validate this empirically on real datasets.
△ Less
Submitted 11 June, 2021; v1 submitted 30 June, 2020;
originally announced June 2020.
-
Detection and Mitigation of Bias in Ted Talk Ratings
Authors:
Rupam Acharyya,
Shouman Das,
Ankani Chattoraj,
Oishani Sengupta,
Md Iftekar Tanveer
Abstract:
Unbiased data collection is essential to guaranteeing fairness in artificial intelligence models. Implicit bias, a form of behavioral conditioning that leads us to attribute predetermined characteristics to members of certain groups and informs the data collection process. This paper quantifies implicit bias in viewer ratings of TEDTalks, a diverse social platform assessing social and professional…
▽ More
Unbiased data collection is essential to guaranteeing fairness in artificial intelligence models. Implicit bias, a form of behavioral conditioning that leads us to attribute predetermined characteristics to members of certain groups and informs the data collection process. This paper quantifies implicit bias in viewer ratings of TEDTalks, a diverse social platform assessing social and professional performance, in order to present the correlations of different kinds of bias across sensitive attributes. Although the viewer ratings of these videos should purely reflect the speaker's competence and skill, our analysis of the ratings demonstrates the presence of overwhelming and predominant implicit bias with respect to race and gender. In our paper, we present strategies to detect and mitigate bias that are critical to removing unfairness in AI.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
FairyTED: A Fair Rating Predictor for TED Talk Data
Authors:
Rupam Acharyya,
Shouman Das,
Ankani Chattoraj,
Md. Iftekhar Tanveer
Abstract:
With the recent trend of applying machine learning in every aspect of human life, it is important to incorporate fairness into the core of the predictive algorithms. We address the problem of predicting the quality of public speeches while being fair with respect to sensitive attributes of the speakers, e.g. gender and race. We use the TED talks as an input repository of public speeches because it…
▽ More
With the recent trend of applying machine learning in every aspect of human life, it is important to incorporate fairness into the core of the predictive algorithms. We address the problem of predicting the quality of public speeches while being fair with respect to sensitive attributes of the speakers, e.g. gender and race. We use the TED talks as an input repository of public speeches because it consists of speakers from a diverse community and has a wide outreach. Utilizing the theories of Causal Models, Counterfactual Fairness and state-of-the-art neural language models, we propose a mathematical framework for fair prediction of the public speaking quality. We employ grounded assumptions to construct a causal model capturing how different attributes affect public speaking quality. This causal model contributes in generating counterfactual data to train a fair predictive model. Our framework is general enough to utilize any assumption within the causal model. Experimental results show that while prediction accuracy is comparable to recent work on this dataset, our predictions are counterfactually fair with respect to a novel metric when compared to true data labels. The FairyTED setup not only allows organizers to make informed and diverse selection of speakers from the unobserved counterfactual possibilities but it also ensures that viewers and new users are not influenced by unfair and unbalanced ratings from arbitrary visitors to the www.ted.com website when deciding to view a talk.
△ Less
Submitted 25 November, 2019;
originally announced November 2019.
-
Infinite-Label Learning with Semantic Output Codes
Authors:
Yang Zhang,
Rupam Acharyya,
Ji Liu,
Boqing Gong
Abstract:
We develop a new statistical machine learning paradigm, named infinite-label learning, to annotate a data point with more than one relevant labels from a candidate set, which pools both the finite labels observed at training and a potentially infinite number of previously unseen labels. The infinite-label learning fundamentally expands the scope of conventional multi-label learning, and better mod…
▽ More
We develop a new statistical machine learning paradigm, named infinite-label learning, to annotate a data point with more than one relevant labels from a candidate set, which pools both the finite labels observed at training and a potentially infinite number of previously unseen labels. The infinite-label learning fundamentally expands the scope of conventional multi-label learning, and better models the practical requirements in various real-world applications, such as image tagging, ads-query association, and article categorization. However, how can we learn a labeling function that is capable of assigning to a data point the labels omitted from the training set? To answer the question, we seek some clues from the recent work on zero-shot learning, where the key is to represent a class/label by a vector of semantic codes, as opposed to treating them as atomic labels. We validate the infinite-label learning by a PAC bound in theory and some empirical studies on both synthetic and real data.
△ Less
Submitted 20 October, 2017; v1 submitted 23 August, 2016;
originally announced August 2016.
-
Counting Popular Matchings in House Allocation Problems
Authors:
Rupam Acharyya,
Sourav Chakraborty,
Nitesh Jha
Abstract:
We study the problem of counting the number of popular matchings in a given instance. A popular matching instance consists of agents A and houses H, where each agent ranks a subset of houses according to their preferences. A matching is an assignment of agents to houses. A matching M is more popular than matching M' if the number of agents that prefer M to M' is more than the number of people that…
▽ More
We study the problem of counting the number of popular matchings in a given instance. A popular matching instance consists of agents A and houses H, where each agent ranks a subset of houses according to their preferences. A matching is an assignment of agents to houses. A matching M is more popular than matching M' if the number of agents that prefer M to M' is more than the number of people that prefer M' to M. A matching M is called popular if there exists no matching more popular than M. McDermid and Irving gave a poly-time algorithm for counting the number of popular matchings when the preference lists are strictly ordered.
We first consider the case of ties in preference lists. Nasre proved that the problem of counting the number of popular matching is #P-hard when there are ties. We give an FPRAS for this problem.
We then consider the popular matching problem where preference lists are strictly ordered but each house has a capacity associated with it. We give a switching graph characterization of popular matchings in this case. Such characterizations were studied earlier for the case of strictly ordered preference lists (McDermid and Irving) and for preference lists with ties (Nasre). We use our characterization to prove that counting popular matchings in capacitated case is #P-hard.
△ Less
Submitted 12 December, 2013;
originally announced December 2013.
-
Unit Disk Cover Problem
Authors:
Rashmisnata Acharyya,
Manjanna B.,
Gautam K. Das
Abstract:
Given a set ${\cal D}$ of unit disks in the Euclidean plane, we consider (i) the {\it discrete unit disk cover} (DUDC) problem and (ii) the {\it rectangular region cover} (RRC) problem. In the DUDC problem, for a given set ${\cal P}$ of points the objective is to select minimum cardinality subset ${\cal D}^* \subseteq {\cal D}$ such that each point in ${\cal P}$ is covered by at least one disk in…
▽ More
Given a set ${\cal D}$ of unit disks in the Euclidean plane, we consider (i) the {\it discrete unit disk cover} (DUDC) problem and (ii) the {\it rectangular region cover} (RRC) problem. In the DUDC problem, for a given set ${\cal P}$ of points the objective is to select minimum cardinality subset ${\cal D}^* \subseteq {\cal D}$ such that each point in ${\cal P}$ is covered by at least one disk in ${\cal D}^*$. On the other hand, in the RRC problem the objective is to select minimum cardinality subset ${\cal D}^{**} \subseteq {\cal D}$ such that each point of a given rectangular region ${\cal R}$ is covered by a disk in ${\cal D}^{**}$. For the DUDC problem, we propose an $(9+ε)$-factor ($0 < ε\leq 6$) approximation algorithm. The previous best known approximation factor was 15 \cite{FL12}. For the RRC problem, we propose (i) an $(9 + ε)$-factor ($0 < ε\leq 6$) approximation algorithm, (ii) an 2.25-factor approximation algorithm in reduce radius setup, improving previous 4-factor approximation result in the same setup \cite{FKKLS07}.
The solution of DUDC problem is based on a PTAS for the subproblem LSDUDC, where all the points in ${\cal P}$ are on one side of a line and covered by the disks centered on the other side of that line.
△ Less
Submitted 13 September, 2012;
originally announced September 2012.