-
DiaLLMs: EHR Enhanced Clinical Conversational System for Clinical Test Recommendation and Diagnosis Prediction
Authors:
Weijieying Ren,
Tianxiang Zhao,
Lei Wang,
Tianchun Wang,
Vasant Honavar
Abstract:
Recent advances in Large Language Models (LLMs) have led to remarkable progresses in medical consultation. However, existing medical LLMs overlook the essential role of Electronic Health Records (EHR) and focus primarily on diagnosis recommendation, limiting their clinical applicability. We propose DiaLLM, the first medical LLM that integrates heterogeneous EHR data into clinically grounded dialog…
▽ More
Recent advances in Large Language Models (LLMs) have led to remarkable progresses in medical consultation. However, existing medical LLMs overlook the essential role of Electronic Health Records (EHR) and focus primarily on diagnosis recommendation, limiting their clinical applicability. We propose DiaLLM, the first medical LLM that integrates heterogeneous EHR data into clinically grounded dialogues, enabling clinical test recommendation, result interpretation, and diagnosis prediction to better align with real-world medical practice. To construct clinically grounded dialogues from EHR, we design a Clinical Test Reference (CTR) strategy that maps each clinical code to its corresponding description and classifies test results as "normal" or "abnormal". Additionally, DiaLLM employs a reinforcement learning framework for evidence acquisition and automated diagnosis. To handle the large action space, we introduce a reject sampling strategy to reduce redundancy and improve exploration efficiency. Furthermore, a confirmation reward and a class-sensitive diagnosis reward are designed to guide accurate diagnosis prediction. Extensive experimental results demonstrate that DiaLLM outperforms baselines in clinical test recommendation and diagnosis prediction.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Authors:
Hamed Mahdavi,
Alireza Hashemi,
Majid Daliri,
Pegah Mohammadipour,
Alireza Farhadi,
Samira Malek,
Yekta Yazdanifard,
Amir Khasahmadi,
Vasant Honavar
Abstract:
Recent advances in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the crucial logical rigor for mathematical problem solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore t…
▽ More
Recent advances in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the crucial logical rigor for mathematical problem solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore this, we conducted both qualitative and quantitative human evaluations of proofs generated by LLMs, and developed a schema for automatically assessing their reasoning capabilities. Our study reveals that current LLMs fall significantly short of solving challenging Olympiad-level problems and frequently fail to distinguish correct mathematical reasoning from clearly flawed solutions. Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers.
△ Less
Submitted 10 April, 2025; v1 submitted 31 March, 2025;
originally announced April 2025.
-
A practical guide to machine learning interatomic potentials -- Status and future
Authors:
Ryan Jacobs,
Dane Morgan,
Siamak Attarian,
Jun Meng,
Chen Shen,
Zhenghao Wu,
Clare Yijia Xie,
Julia H. Yang,
Nongnuch Artrith,
Ben Blaiszik,
Gerbrand Ceder,
Kamal Choudhary,
Gabor Csanyi,
Ekin Dogus Cubuk,
Bowen Deng,
Ralf Drautz,
Xiang Fu,
Jonathan Godwin,
Vasant Honavar,
Olexandr Isayev,
Anders Johansson,
Boris Kozinsky,
Stefano Martiniani,
Shyue Ping Ong,
Igor Poltavsky
, et al. (5 additional authors not shown)
Abstract:
The rapid development and large body of literature on machine learning interatomic potentials (MLIPs) can make it difficult to know how to proceed for researchers who are not experts but wish to use these tools. The spirit of this review is to help such researchers by serving as a practical, accessible guide to the state-of-the-art in MLIPs. This review paper covers a broad range of topics related…
▽ More
The rapid development and large body of literature on machine learning interatomic potentials (MLIPs) can make it difficult to know how to proceed for researchers who are not experts but wish to use these tools. The spirit of this review is to help such researchers by serving as a practical, accessible guide to the state-of-the-art in MLIPs. This review paper covers a broad range of topics related to MLIPs, including (i) central aspects of how and why MLIPs are enablers of many exciting advancements in molecular modeling, (ii) the main underpinnings of different types of MLIPs, including their basic structure and formalism, (iii) the potentially transformative impact of universal MLIPs for both organic and inorganic systems, including an overview of the most recent advances, capabilities, downsides, and potential applications of this nascent class of MLIPs, (iv) a practical guide for estimating and understanding the execution speed of MLIPs, including guidance for users based on hardware availability, type of MLIP used, and prospective simulation size and time, (v) a manual for what MLIP a user should choose for a given application by considering hardware resources, speed requirements, energy and force accuracy requirements, as well as guidance for choosing pre-trained potentials or fitting a new potential from scratch, (vi) discussion around MLIP infrastructure, including sources of training data, pre-trained potentials, and hardware resources for training, (vii) summary of some key limitations of present MLIPs and current approaches to mitigate such limitations, including methods of including long-range interactions, handling magnetic systems, and treatment of excited states, and finally (viii) we finish with some more speculative thoughts on what the future holds for the development and application of MLIPs over the next 3-10+ years.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
On a Connection Between Imitation Learning and RLHF
Authors:
Teng Xiao,
Yige Yuan,
Mingxiao Li,
Zhengyu Chen,
Vasant G Honavar
Abstract:
This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback RLHF and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled…
▽ More
This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback RLHF and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Authors:
Teng Xiao,
Yige Yuan,
Zhengyu Chen,
Mingxiao Li,
Shangsong Liang,
Zhaochun Ren,
Vasant G Honavar
Abstract:
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promisi…
▽ More
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.
△ Less
Submitted 20 February, 2025; v1 submitted 2 February, 2025;
originally announced February 2025.
-
C-HDNet: A Fast Hyperdimensional Computing Based Method for Causal Effect Estimation from Networked Observational Data
Authors:
Abhishek Dalvi,
Neil Ashtekar,
Vasant Honavar
Abstract:
We consider the problem of estimating causal effects from observational data in the presence of network confounding. In this context, an individual's treatment assignment and outcomes may be affected by their neighbors within the network. We propose a novel matching technique which leverages hyperdimensional computing to model network information and improve predictive performance. We present resu…
▽ More
We consider the problem of estimating causal effects from observational data in the presence of network confounding. In this context, an individual's treatment assignment and outcomes may be affected by their neighbors within the network. We propose a novel matching technique which leverages hyperdimensional computing to model network information and improve predictive performance. We present results of extensive experiments which show that the proposed method outperforms or is competitive with the state-of-the-art methods for causal effect estimation from network data, including advanced computationally demanding deep learning methods. Further, our technique benefits from simplicity and speed, with roughly an order of magnitude lower runtime compared to state-of-the-art methods, while offering similar causal effect estimation error rates.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Deep Learning within Tabular Data: Foundations, Challenges, Advances and Future Directions
Authors:
Weijieying Ren,
Tianxiang Zhao,
Yuqing Huang,
Vasant Honavar
Abstract:
Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies. This survey provides a comprehensive review of state-of-the-art techniques in tabular data representation lea…
▽ More
Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies. This survey provides a comprehensive review of state-of-the-art techniques in tabular data representation learning, structured around three foundational design elements: training data, neural architectures, and learning objectives. Unlike prior surveys that focus primarily on either architecture design or learning strategies, we adopt a holistic perspective that emphasizes the universality and robustness of representation learning methods across diverse downstream tasks. We examine recent advances in data augmentation and generation, specialized neural network architectures tailored to tabular data, and innovative learning objectives that enhance representation quality. Additionally, we highlight the growing influence of self-supervised learning and the adaptation of transformer-based foundation models for tabular data. Our review is based on a systematic literature search using rigorous inclusion criteria, encompassing 127 papers published since 2020 in top-tier conferences and journals. Through detailed analysis and comparison, we identify emerging trends, critical gaps, and promising directions for future research, aiming to guide the development of more generalizable and effective tabular data representation methods.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment
Authors:
Teng Xiao,
Yige Yuan,
Huaisheng Zhu,
Mingxiao Li,
Vasant G Honavar
Abstract:
We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignorin…
▽ More
We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
GeomCLIP: Contrastive Geometry-Text Pre-training for Molecules
Authors:
Teng Xiao,
Chao Cui,
Huaisheng Zhu,
Vasant G. Honavar
Abstract:
Pretraining molecular representations is crucial for drug and material discovery. Recent methods focus on learning representations from geometric structures, effectively capturing 3D position information. Yet, they overlook the rich information in biomedical texts, which detail molecules' properties and substructures. With this in mind, we set up a data collection effort for 200K pairs of ground-s…
▽ More
Pretraining molecular representations is crucial for drug and material discovery. Recent methods focus on learning representations from geometric structures, effectively capturing 3D position information. Yet, they overlook the rich information in biomedical texts, which detail molecules' properties and substructures. With this in mind, we set up a data collection effort for 200K pairs of ground-state geometric structures and biomedical texts, resulting in a PubChem3D dataset. Based on this dataset, we propose the GeomCLIP framework to enhance for multi-modal representation learning from molecular structures and biomedical text. During pre-training, we design two types of tasks, i.e., multimodal representation alignment and unimodal denoising pretraining, to align the 3D geometric encoder with textual information and, at the same time, preserve its original representation power. Experimental results show the effectiveness of GeomCLIP in various tasks such as molecular property prediction, zero-shot text-molecule retrieval, and 3D molecule captioning. Our code and collected dataset are available at \url{https://github.com/xiaocui3737/GeomCLIP}
△ Less
Submitted 16 November, 2024;
originally announced November 2024.
-
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective
Authors:
Teng Xiao,
Mingxiao Li,
Yige Yuan,
Huaisheng Zhu,
Chao Cui,
Vasant G Honavar
Abstract:
This paper introduces a novel generalized self-imitation learning ($\textbf{GSIL}$) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop $\textbf{GSIL}$ by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with…
▽ More
This paper introduces a novel generalized self-imitation learning ($\textbf{GSIL}$) framework, which effectively and efficiently aligns large language models with offline demonstration data. We develop $\textbf{GSIL}$ by deriving a surrogate objective of imitation learning with density ratio estimates, facilitating the use of self-generated data and optimizing the imitation learning objective with simple classification losses. $\textbf{GSIL}$ eliminates the need for complex adversarial training in standard imitation learning, achieving lightweight and efficient fine-tuning for large language models. In addition, $\textbf{GSIL}$ encompasses a family of offline losses parameterized by a general class of convex functions for density ratio estimation and enables a unified view for alignment with demonstration data. Extensive experiments show that $\textbf{GSIL}$ consistently and significantly outperforms baselines in many challenging benchmarks, such as coding (HuamnEval), mathematical reasoning (GSM8K) and instruction-following benchmark (MT-Bench).
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Regression with Large Language Models for Materials and Molecular Property Prediction
Authors:
Ryan Jacobs,
Maciej P. Polak,
Lane E. Schultz,
Hamed Mahdavi,
Vasant Honavar,
Dane Morgan
Abstract:
We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on…
▽ More
We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
Causal Effect Estimation Using Random Hyperplane Tessellations
Authors:
Abhishek Dalvi,
Neil Ashtekar,
Vasant Honavar
Abstract:
Matching is one of the simplest approaches for estimating causal effects from observational data. Matching techniques compare the observed outcomes across pairs of individuals with similar covariate values but different treatment statuses in order to estimate causal effects. However, traditional matching techniques are unreliable given high-dimensional covariates due to the infamous curse of dimen…
▽ More
Matching is one of the simplest approaches for estimating causal effects from observational data. Matching techniques compare the observed outcomes across pairs of individuals with similar covariate values but different treatment statuses in order to estimate causal effects. However, traditional matching techniques are unreliable given high-dimensional covariates due to the infamous curse of dimensionality. To overcome this challenge, we propose a simple, fast, yet highly effective approach to matching using Random Hyperplane Tessellations (RHPT). First, we prove that the RHPT representation is an approximate balancing score -- thus maintaining the strong ignorability assumption -- and provide empirical evidence for this claim. Second, we report results of extensive experiments showing that matching using RHPT outperforms traditional matching techniques and is competitive with state-of-the-art deep learning methods for causal effect estimation. In addition, RHPT avoids the need for computationally expensive training of deep neural networks.
△ Less
Submitted 19 September, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
MolBind: Multimodal Alignment of Language, Molecules, and Proteins
Authors:
Teng Xiao,
Chao Cui,
Huaisheng Zhu,
Vasant G. Honavar
Abstract:
Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) re…
▽ More
Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities.
△ Less
Submitted 2 April, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation
Authors:
Huaisheng Zhu,
Teng Xiao,
Vasant G Honavar
Abstract:
Generating molecular structures with desired properties is a critical task with broad applications in drug discovery and materials design. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to generate diverse, ideally novel molecular structures with desired properties. 3M-Diffusion encodes molecular graphs into a graph latent space which it then aligns with the text s…
▽ More
Generating molecular structures with desired properties is a critical task with broad applications in drug discovery and materials design. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to generate diverse, ideally novel molecular structures with desired properties. 3M-Diffusion encodes molecular graphs into a graph latent space which it then aligns with the text space learned by encoder-based LLMs from textual descriptions. It then reconstructs the molecular structure and atomic attributes based on the given text descriptions using the molecule decoder. It then learns a probabilistic mapping from the text space to the latent molecular graph space using a diffusion model. The results of our extensive experiments on several datasets demonstrate that 3M-Diffusion can generate high-quality, novel and diverse molecular graphs that semantically match the textual description provided.
△ Less
Submitted 2 October, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
Hyperdimensional Representation Learning for Node Classification and Link Prediction
Authors:
Abhishek Dalvi,
Vasant Honavar
Abstract:
We introduce Hyperdimensional Graph Learner (HDGL), a novel method for node classification and link prediction in graphs. HDGL maps node features into a very high-dimensional space (\textit{hyperdimensional} or HD space for short) using the \emph{injectivity} property of node representations in a family of Graph Neural Networks (GNNs) and then uses HD operators such as \textit{bundling} and \texti…
▽ More
We introduce Hyperdimensional Graph Learner (HDGL), a novel method for node classification and link prediction in graphs. HDGL maps node features into a very high-dimensional space (\textit{hyperdimensional} or HD space for short) using the \emph{injectivity} property of node representations in a family of Graph Neural Networks (GNNs) and then uses HD operators such as \textit{bundling} and \textit{binding} to aggregate information from the local neighborhood of each node yielding latent node representations that can support both node classification and link prediction tasks. HDGL, unlike GNNs that rely on computationally expensive iterative optimization and hyperparameter tuning, requires only a single pass through the data set. We report results of experiments using widely used benchmark datasets which demonstrate that, on the node classification task, HDGL achieves accuracy that is competitive with that of the state-of-the-art GNN methods at substantially reduced computational cost; and on the link prediction task, HDGL matches the performance of DeepWalk and related methods, although it falls short of computationally demanding state-of-the-art GNNs.
△ Less
Submitted 26 February, 2025; v1 submitted 26 February, 2024;
originally announced February 2024.
-
EsaCL: Efficient Continual Learning of Sparse Models
Authors:
Weijieying Ren,
Vasant G Honavar
Abstract:
A key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. Many existing approaches to this problem work by either retraining the model on previous tasks or by expanding the model to accommodate new tasks. However, these approaches typically suffer from increased storage and computational requirements, a…
▽ More
A key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. Many existing approaches to this problem work by either retraining the model on previous tasks or by expanding the model to accommodate new tasks. However, these approaches typically suffer from increased storage and computational requirements, a problem that is worsened in the case of sparse models due to need for expensive re-training after sparsification. To address this challenge, we propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power, and circumvent the need of retraining. We conduct a theoretical analysis of loss landscapes with parameter pruning, and design a directional pruning (SDP) strategy that is informed by the sharpness of the loss function with respect to the model parameters. SDP ensures model with minimal loss of predictive accuracy, accelerating the learning of sparse models at each stage. To accelerate model update, we introduce an intelligent data selection (IDS) strategy that can identify critical instances for estimating loss landscape, yielding substantially improved data efficiency. The results of our experiments show that EsaCL achieves performance that is competitive with the state-of-the-art methods on three continual learning benchmarks, while using substantially reduced memory and computational resources.
△ Less
Submitted 10 January, 2024;
originally announced January 2024.
-
Representing and Reasoning with Multi-Stakeholder Qualitative Preference Queries
Authors:
Samik Basu,
Vasant Honavar,
Ganesh Ram Santhanam,
Jia Tao
Abstract:
Many decision-making scenarios, e.g., public policy, healthcare, business, and disaster response, require accommodating the preferences of multiple stakeholders. We offer the first formal treatment of reasoning with multi-stakeholder qualitative preferences in a setting where stakeholders express their preferences in a qualitative preference language, e.g., CP-net, CI-net, TCP-net, CP-Theory. We i…
▽ More
Many decision-making scenarios, e.g., public policy, healthcare, business, and disaster response, require accommodating the preferences of multiple stakeholders. We offer the first formal treatment of reasoning with multi-stakeholder qualitative preferences in a setting where stakeholders express their preferences in a qualitative preference language, e.g., CP-net, CI-net, TCP-net, CP-Theory. We introduce a query language for expressing queries against such preferences over sets of outcomes that satisfy specified criteria, e.g., $\mlangpref{ψ_1}{ψ_2}{A}$ (read loosely as the set of outcomes satisfying $ψ_1$ that are preferred over outcomes satisfying $ψ_2$ by a set of stakeholders $A$). Motivated by practical application scenarios, we introduce and analyze several alternative semantics for such queries, and examine their interrelationships. We provide a provably correct algorithm for answering multi-stakeholder qualitative preference queries using model checking in alternation-free $μ$-calculus. We present experimental results that demonstrate the feasibility of our approach.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
Forecasting User Interests Through Topic Tag Predictions in Online Health Communities
Authors:
Amogh Subbakrishna Adishesha,
Lily Jakielaszek,
Fariha Azhar,
Peixuan Zhang,
Vasant Honavar,
Fenglong Ma,
Chandra Belani,
Prasenjit Mitra,
Sharon Xiaolei Huang
Abstract:
The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely ma…
▽ More
The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely manner to prevent such harm. This paper proposes an innovative approach to suggesting reliable information to participants in online communities as they move through different stages in their disease or treatment. We hypothesize that patients with similar histories of disease progression or course of treatment would have similar information needs at comparable stages. Specifically, we pose the problem of predicting topic tags or keywords that describe the future information needs of users based on their profiles, traces of their online interactions within the community (past posts, replies) and the profiles and traces of online interactions of other users with similar profiles and similar traces of past interaction with the target users. The result is a variant of the collaborative information filtering or recommendation system tailored to the needs of users of online health communities. We report results of our experiments on an expert curated data set which demonstrate the superiority of the proposed approach over the state of the art baselines with respect to accurate and timely prediction of topic tags (and hence information sources of interest).
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Zooming Into the Darknet: Characterizing Internet Background Radiation and its Structural Changes
Authors:
Michalis Kallitsis,
Vasant Honavar,
Rupesh Prajapati,
Dinghao Wu,
John Yen
Abstract:
Network telescopes or "Darknets" provide a unique window into Internet-wide malicious activities associated with malware propagation, denial of service attacks, scanning performed for network reconnaissance, and others. Analyses of the resulting data can provide actionable insights to security analysts that can be used to prevent or mitigate cyber-threats. Large Darknets, however, observe millions…
▽ More
Network telescopes or "Darknets" provide a unique window into Internet-wide malicious activities associated with malware propagation, denial of service attacks, scanning performed for network reconnaissance, and others. Analyses of the resulting data can provide actionable insights to security analysts that can be used to prevent or mitigate cyber-threats. Large Darknets, however, observe millions of nefarious events on a daily basis which makes the transformation of the captured information into meaningful insights challenging. We present a novel framework for characterizing Darknet behavior and its temporal evolution aiming to address this challenge. The proposed framework: (i) Extracts a high dimensional representation of Darknet events composed of features distilled from Darknet data and other external sources; (ii) Learns, in an unsupervised fashion, an information-preserving low-dimensional representation of these events (using deep representation learning) that is amenable to clustering; (iv) Performs clustering of the scanner data in the resulting representation space and provides interpretable insights using optimal decision trees; and (v) Utilizes the clustering outcomes as "signatures" that can be used to detect structural changes in the Darknet activities. We evaluate the proposed system on a large operational Network Telescope and demonstrate its ability to detect real-world, high-impact cybersecurity incidents.
△ Less
Submitted 5 August, 2021; v1 submitted 28 July, 2021;
originally announced August 2021.
-
Explainable Multivariate Time Series Classification: A Deep Neural Network Which Learns To Attend To Important Variables As Well As Informative Time Intervals
Authors:
Tsung-Yu Hsieh,
Suhang Wang,
Yiwei Sun,
Vasant Honavar
Abstract:
Time series data is prevalent in a wide variety of real-world applications and it calls for trustworthy and explainable models for people to understand and fully trust decisions made by AI solutions. We consider the problem of building explainable classifiers from multi-variate time series data. A key criterion to understand such predictive models involves elucidating and quantifying the contribut…
▽ More
Time series data is prevalent in a wide variety of real-world applications and it calls for trustworthy and explainable models for people to understand and fully trust decisions made by AI solutions. We consider the problem of building explainable classifiers from multi-variate time series data. A key criterion to understand such predictive models involves elucidating and quantifying the contribution of time varying input variables to the classification. Hence, we introduce a novel, modular, convolution-based feature extraction and attention mechanism that simultaneously identifies the variables as well as time intervals which determine the classifier output. We present results of extensive experiments with several benchmark data sets that show that the proposed method outperforms the state-of-the-art baseline methods on multi-variate time series classification task. The results of our case studies demonstrate that the variables and time intervals identified by the proposed method make sense relative to available domain knowledge.
△ Less
Submitted 23 November, 2020;
originally announced November 2020.
-
Commuting Network Spillovers and COVID-19 Deaths Across US Counties
Authors:
Christopher Seto,
Aria Khademi,
Corina Graif,
Vasant G. Honavar
Abstract:
This study explored how population mobility flows form commuting networks across US counties and influence the spread of COVID-19. We utilized 3-level mixed effects negative binomial regression models to estimate the impact of network COVID-19 exposure on county confirmed cases and deaths over time. We also conducted weighting-based analyses to estimate the causal effect of network exposure. Resul…
▽ More
This study explored how population mobility flows form commuting networks across US counties and influence the spread of COVID-19. We utilized 3-level mixed effects negative binomial regression models to estimate the impact of network COVID-19 exposure on county confirmed cases and deaths over time. We also conducted weighting-based analyses to estimate the causal effect of network exposure. Results showed that commuting networks matter for COVID-19 deaths and cases, net of spatial proximity, socioeconomic, and demographic factors. Different local racial and ethnic concentrations are also associated with unequal outcomes. These findings suggest that commuting is an important causal mechanism in the spread of COVID-19 and highlight the significance of interconnected of communities. The results suggest that local level mitigation and prevention efforts are more effective when complemented by similar efforts in the network of connected places. Implications for research on inequality in health and flexible work arrangements are discussed.
△ Less
Submitted 10 February, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
A Causal Lens for Peeking into Black Box Predictive Models: Predictive Model Interpretation via Causal Attribution
Authors:
Aria Khademi,
Vasant Honavar
Abstract:
With the increasing adoption of predictive models trained using machine learning across a wide range of high-stakes applications, e.g., health care, security, criminal justice, finance, and education, there is a growing need for effective techniques for explaining such models and their predictions. We aim to address this problem in settings where the predictive model is a black box; That is, we ca…
▽ More
With the increasing adoption of predictive models trained using machine learning across a wide range of high-stakes applications, e.g., health care, security, criminal justice, finance, and education, there is a growing need for effective techniques for explaining such models and their predictions. We aim to address this problem in settings where the predictive model is a black box; That is, we can only observe the response of the model to various inputs, but have no knowledge about the internal structure of the predictive model, its parameters, the objective function, and the algorithm used to optimize the model. We reduce the problem of interpreting a black box predictive model to that of estimating the causal effects of each of the model inputs on the model output, from observations of the model inputs and the corresponding outputs. We estimate the causal effects of model inputs on model output using variants of the Rubin Neyman potential outcomes framework for estimating causal effects from observational data. We show how the resulting causal attribution of responsibility for model output to the different model inputs can be used to interpret the predictive model and to explain its predictions. We present results of experiments that demonstrate the effectiveness of our approach to the interpretation of black box predictive models via causal attribution in the case of deep neural network models trained on one synthetic data set (where the input variables that impact the output variable are known by design) and two real-world data sets: Handwritten digit classification, and Parkinson's disease severity prediction. Because our approach does not require knowledge about the predictive model algorithm and is free of assumptions regarding the black box predictive model except that its input-output responses be observable, it can be applied, in principle, to any black box predictive model.
△ Less
Submitted 1 August, 2020;
originally announced August 2020.
-
Longitudinal Deep Kernel Gaussian Process Regression
Authors:
Junjie Liang,
Yanting Wu,
Dongkuan Xu,
Vasant Honavar
Abstract:
Gaussian processes offer an attractive framework for predictive modeling from longitudinal data, i.e., irregularly sampled, sparse observations from a set of individuals over time. However, such methods have two key shortcomings: (i) They rely on ad hoc heuristics or expensive trial and error to choose the effective kernels, and (ii) They fail to handle multilevel correlation structure in the data…
▽ More
Gaussian processes offer an attractive framework for predictive modeling from longitudinal data, i.e., irregularly sampled, sparse observations from a set of individuals over time. However, such methods have two key shortcomings: (i) They rely on ad hoc heuristics or expensive trial and error to choose the effective kernels, and (ii) They fail to handle multilevel correlation structure in the data. We introduce Longitudinal deep kernel Gaussian process regression (L-DKGPR), which to the best of our knowledge, is the only method to overcome these limitations by fully automating the discovery of complex multilevel correlation structure from longitudinal data. Specifically, L-DKGPR eliminates the need for ad hoc heuristics or trial and error using a novel adaptation of deep kernel learning that combines the expressive power of deep neural networks with the flexibility of non-parametric kernel methods. L-DKGPR effectively learns the multilevel correlation with a novel addictive kernel that simultaneously accommodates both time-varying and the time-invariant effects. We derive an efficient algorithm to train L-DKGPR using latent space inducing points and variational inference. Results of extensive experiments on several benchmark data sets demonstrate that L-DKGPR significantly outperforms the state-of-the-art longitudinal data analysis (LDA) methods.
△ Less
Submitted 7 December, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
Towards Robust Relational Causal Discovery
Authors:
Sanghack Lee,
Vasant Honavar
Abstract:
We consider the problem of learning causal relationships from relational data. Existing approaches rely on queries to a relational conditional independence (RCI) oracle to establish and orient causal relations in such a setting. In practice, queries to a RCI oracle have to be replaced by reliable tests for RCI against available data. Relational data present several unique challenges in testing for…
▽ More
We consider the problem of learning causal relationships from relational data. Existing approaches rely on queries to a relational conditional independence (RCI) oracle to establish and orient causal relations in such a setting. In practice, queries to a RCI oracle have to be replaced by reliable tests for RCI against available data. Relational data present several unique challenges in testing for RCI. We study the conditions under which traditional iid-based conditional independence (CI) tests yield reliable answers to RCI queries against relational data. We show how to conduct CI tests against relational data to robustly recover the underlying relational causal structure. Results of our experiments demonstrate the effectiveness of our proposed approach.
△ Less
Submitted 5 December, 2019;
originally announced December 2019.
-
Algorithmic Bias in Recidivism Prediction: A Causal Perspective
Authors:
Aria Khademi,
Vasant Honavar
Abstract:
ProPublica's analysis of recidivism predictions produced by Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software tool for the task, has shown that the predictions were racially biased against African American defendants. We analyze the COMPAS data using a causal reformulation of the underlying algorithmic fairness problem. Specifically, we assess whether COMPAS ex…
▽ More
ProPublica's analysis of recidivism predictions produced by Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software tool for the task, has shown that the predictions were racially biased against African American defendants. We analyze the COMPAS data using a causal reformulation of the underlying algorithmic fairness problem. Specifically, we assess whether COMPAS exhibits racial bias against African American defendants using FACT, a recently introduced causality grounded measure of algorithmic fairness. We use the Neyman-Rubin potential outcomes framework for causal inference from observational data to estimate FACT from COMPAS data. Our analysis offers strong evidence that COMPAS exhibits racial bias against African American defendants. We further show that the FACT estimates from COMPAS data are robust in the presence of unmeasured confounding.
△ Less
Submitted 24 November, 2019;
originally announced November 2019.
-
LMLFM: Longitudinal Multi-Level Factorization Machine
Authors:
Junjie Liang,
Dongkuan Xu,
Yiwei Sun,
Vasant Honavar
Abstract:
We consider the problem of learning predictive models from longitudinal data, consisting of irregularly repeated, sparse observations from a set of individuals over time. Such data often exhibit {\em longitudinal correlation} (LC) (correlations among observations for each individual over time), {\em cluster correlation} (CC) (correlations among individuals that have similar characteristics), or bo…
▽ More
We consider the problem of learning predictive models from longitudinal data, consisting of irregularly repeated, sparse observations from a set of individuals over time. Such data often exhibit {\em longitudinal correlation} (LC) (correlations among observations for each individual over time), {\em cluster correlation} (CC) (correlations among individuals that have similar characteristics), or both. These correlations are often accounted for using {\em mixed effects models} that include {\em fixed effects} and {\em random effects}, where the fixed effects capture the regression parameters that are shared by all individuals, whereas random effects capture those parameters that vary across individuals. However, the current state-of-the-art methods are unable to select the most predictive fixed effects and random effects from a large number of variables, while accounting for complex correlation structure in the data and non-linear interactions among the variables. We propose Longitudinal Multi-Level Factorization Machine (LMLFM), to the best of our knowledge, the first model to address these challenges in learning predictive models from longitudinal data. We establish the convergence properties, and analyze the computational complexity, of LMLFM. We present results of experiments with both simulated and real-world longitudinal data which show that LMLFM outperforms the state-of-the-art methods in terms of predictive accuracy, variable selection ability, and scalability to data with large number of variables. The code and supplemental material is available at \url{https://github.com/junjieliang672/LMLFM}.
△ Less
Submitted 21 November, 2019; v1 submitted 10 November, 2019;
originally announced November 2019.
-
The Dynamical Gaussian Process Latent Variable Model in the Longitudinal Scenario
Authors:
Thanh Le,
Vasant Honavar
Abstract:
The Dynamical Gaussian Process Latent Variable Models provide an elegant non-parametric framework for learning the low dimensional representations of the high-dimensional time-series. Real world observational studies, however, are often ill-conditioned: the observations can be noisy, not assuming the luxury of relatively complete and equally spaced like those in time series. Such conditions make i…
▽ More
The Dynamical Gaussian Process Latent Variable Models provide an elegant non-parametric framework for learning the low dimensional representations of the high-dimensional time-series. Real world observational studies, however, are often ill-conditioned: the observations can be noisy, not assuming the luxury of relatively complete and equally spaced like those in time series. Such conditions make it difficult to learn reasonable representations in the high dimensional longitudinal data set by way of Gaussian Process Latent Variable Model as well as other dimensionality reduction procedures. In this study, we approach the inference of Gaussian Process Dynamical Systems in Longitudinal scenario by augmenting the bound in the variational approximation to include systematic samples of the unseen observations. We demonstrate the usefulness of this approach on synthetic as well as the human motion capture data set.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
Node Injection Attacks on Graphs via Reinforcement Learning
Authors:
Yiwei Sun,
Suhang Wang,
Xianfeng Tang,
Tsung-Yu Hsieh,
Vasant Honavar
Abstract:
Real-world graph applications, such as advertisements and product recommendations make profits based on accurately classify the label of the nodes. However, in such scenarios, there are high incentives for the adversaries to attack such graph to reduce the node classification performance. Previous work on graph adversarial attacks focus on modifying existing graph structures, which is infeasible i…
▽ More
Real-world graph applications, such as advertisements and product recommendations make profits based on accurately classify the label of the nodes. However, in such scenarios, there are high incentives for the adversaries to attack such graph to reduce the node classification performance. Previous work on graph adversarial attacks focus on modifying existing graph structures, which is infeasible in most real-world applications. In contrast, it is more practical to inject adversarial nodes into existing graphs, which can also potentially reduce the performance of the classifier. In this paper, we study the novel node injection poisoning attacks problem which aims to poison the graph. We describe a reinforcement learning based method, namely NIPA, to sequentially modify the adversarial information of the injected nodes. We report the results of experiments using several benchmark data sets that show the superior performance of the proposed method NIPA, relative to the existing state-of-the-art methods.
△ Less
Submitted 14 September, 2019;
originally announced September 2019.
-
MEGAN: A Generative Adversarial Network for Multi-View Network Embedding
Authors:
Yiwei Sun,
Suhang Wang,
Tsung-Yu Hsieh,
Xianfeng Tang,
Vasant Honavar
Abstract:
Data from many real-world applications can be naturally represented by multi-view networks where the different views encode different types of relationships (e.g., friendship, shared interests in music, etc.) between real-world individuals or entities. There is an urgent need for methods to obtain low-dimensional, information preserving and typically nonlinear embeddings of such multi-view network…
▽ More
Data from many real-world applications can be naturally represented by multi-view networks where the different views encode different types of relationships (e.g., friendship, shared interests in music, etc.) between real-world individuals or entities. There is an urgent need for methods to obtain low-dimensional, information preserving and typically nonlinear embeddings of such multi-view networks. However, most of the work on multi-view learning focuses on data that lack a network structure, and most of the work on network embeddings has focused primarily on single-view networks. Against this background, we consider the multi-view network representation learning problem, i.e., the problem of constructing low-dimensional information preserving embeddings of multi-view networks. Specifically, we investigate a novel Generative Adversarial Network (GAN) framework for Multi-View Network Embedding, namely MEGAN, aimed at preserving the information from the individual network views, while accounting for connectivity across (and hence complementarity of and correlations between) different views. The results of our experiments on two real-world multi-view data sets show that the embeddings obtained using MEGAN outperform the state-of-the-art methods on node classification, link prediction and visualization tasks.
△ Less
Submitted 19 August, 2019;
originally announced September 2019.
-
Fairness in Algorithmic Decision Making: An Excursion Through the Lens of Causality
Authors:
Aria Khademi,
Sanghack Lee,
David Foley,
Vasant Honavar
Abstract:
As virtually all aspects of our lives are increasingly impacted by algorithmic decision making systems, it is incumbent upon us as a society to ensure such systems do not become instruments of unfair discrimination on the basis of gender, race, ethnicity, religion, etc. We consider the problem of determining whether the decisions made by such systems are discriminatory, through the lens of causal…
▽ More
As virtually all aspects of our lives are increasingly impacted by algorithmic decision making systems, it is incumbent upon us as a society to ensure such systems do not become instruments of unfair discrimination on the basis of gender, race, ethnicity, religion, etc. We consider the problem of determining whether the decisions made by such systems are discriminatory, through the lens of causal models. We introduce two definitions of group fairness grounded in causality: fair on average causal effect (FACE), and fair on average causal effect on the treated (FACT). We use the Rubin-Neyman potential outcomes framework for the analysis of cause-effect relationships to robustly estimate FACE and FACT. We demonstrate the effectiveness of our proposed approach on synthetic data. Our analyses of two real-world data sets, the Adult income data set from the UCI repository (with gender as the protected attribute), and the NYC Stop and Frisk data set (with race as the protected attribute), show that the evidence of discrimination obtained by FACE and FACT, or lack thereof, is often in agreement with the findings from other studies. We further show that FACT, being somewhat more nuanced compared to FACE, can yield findings of discrimination that differ from those obtained using FACE.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.
-
Improving Image Captioning by Leveraging Knowledge Graphs
Authors:
Yimin Zhou,
Yiwei Sun,
Vasant Honavar
Abstract:
We explore the use of a knowledge graphs, that capture general or commonsense knowledge, to augment the information extracted from images by the state-of-the-art methods for image captioning. The results of our experiments, on several benchmark data sets such as MS COCO, as measured by CIDEr-D, a performance metric for image captioning, show that the variants of the state-of-the-art methods for im…
▽ More
We explore the use of a knowledge graphs, that capture general or commonsense knowledge, to augment the information extracted from images by the state-of-the-art methods for image captioning. The results of our experiments, on several benchmark data sets such as MS COCO, as measured by CIDEr-D, a performance metric for image captioning, show that the variants of the state-of-the-art methods for image captioning that make use of the information extracted from knowledge graphs can substantially outperform those that rely solely on the information extracted from images.
△ Less
Submitted 25 January, 2019;
originally announced January 2019.
-
Top-N-Rank: A Scalable List-wise Ranking Method for Recommender Systems
Authors:
Junjie Liang,
Jinlong Hu,
Shoubin Dong,
Vasant Honavar
Abstract:
We propose Top-N-Rank, a novel family of list-wise Learning-to-Rank models for reliably recommending the N top-ranked items. The proposed models optimize a variant of the widely used discounted cumulative gain (DCG) objective function which differs from DCG in two important aspects: (i) It limits the evaluation of DCG only on the top N items in the ranked lists, thereby eliminating the impact of l…
▽ More
We propose Top-N-Rank, a novel family of list-wise Learning-to-Rank models for reliably recommending the N top-ranked items. The proposed models optimize a variant of the widely used discounted cumulative gain (DCG) objective function which differs from DCG in two important aspects: (i) It limits the evaluation of DCG only on the top N items in the ranked lists, thereby eliminating the impact of low-ranked items on the learned ranking function; and (ii) it incorporates weights that allow the model to leverage multiple types of implicit feedback with differing levels of reliability or trustworthiness. Because the resulting objective function is non-smooth and hence challenging to optimize, we consider two smooth approximations of the objective function, using the traditional sigmoid function and the rectified linear unit (ReLU). We propose a family of learning-to-rank algorithms (Top-N-Rank) that work with any smooth objective function. Then, a more efficient variant, Top-N-Rank.ReLU, is introduced, which effectively exploits the properties of ReLU function to reduce the computational complexity of Top-N-Rank from quadratic to linear in the average number of items rated by users. The results of our experiments using two widely used benchmarks, namely, the MovieLens data set and the Amazon Video Games data set demonstrate that: (i) The `top-N truncation' of the objective function substantially improves the ranking quality of the top N recommendations; (ii) using the ReLU for smoothing the objective function yields significant improvement in both ranking quality as well as runtime as compared to using the sigmoid; and (iii) Top-N-Rank.ReLU substantially outperforms the well-performing list-wise ranking methods in terms of ranking quality.
△ Less
Submitted 19 December, 2018; v1 submitted 10 December, 2018;
originally announced December 2018.
-
Multi-View Network Embedding Via Graph Factorization Clustering and Co-Regularized Multi-View Agreement
Authors:
Yiwei Sun,
Ngot Bui,
Tsung-Yu Hsieh,
Vasant Honavar
Abstract:
Real-world social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Sub-networks of such a network based on each type of link correspond to distinct views of the underlying network. In real-world applications, each node is typically linked to only a small subset of other nodes. Hence…
▽ More
Real-world social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Sub-networks of such a network based on each type of link correspond to distinct views of the underlying network. In real-world applications, each node is typically linked to only a small subset of other nodes. Hence, practical approaches to problems such as node labeling have to cope with the resulting sparse networks. While low-dimensional network embeddings offer a promising approach to this problem, most of the current network embedding methods focus primarily on single view networks. We introduce a novel multi-view network embedding (MVNE) algorithm for constructing low-dimensional node embeddings from multi-view networks. MVNE adapts and extends an approach to single view network embedding (SVNE) using graph factorization clustering (GFC) to the multi-view setting using an objective function that maximizes the agreement between views based on both the local and global structure of the underlying multi-view graph. Our experiments with several benchmark real-world single view networks show that GFC-based SVNE yields network embeddings that are competitive with or superior to those produced by the state-of-the-art single view network embedding methods when the embeddings are used for labeling unlabeled nodes in the networks. Our experiments with several multi-view networks show that MVNE substantially outperforms the single view methods on integrated view and the state-of-the-art multi-view methods. We further show that even when the goal is to predict labels of nodes within a single target view, MVNE outperforms its single-view counterpart suggesting that the MVNE is able to extract the information that is useful for labeling nodes in the target view from the all of the views.
△ Less
Submitted 18 February, 2019; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Compositional Stochastic Average Gradient for Machine Learning and Related Applications
Authors:
Tsung-Yu Hsieh,
Yasser EL-Manzalawy,
Yiwei Sun,
Vasant Honavar
Abstract:
Many machine learning, statistical inference, and portfolio optimization problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stoch…
▽ More
Many machine learning, statistical inference, and portfolio optimization problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stochastic variance reduced gradient descent (SVRG) methods are the state-of-the-art methods for FS-CEVF problems. We introduce compositional stochastic average gradient descent (C-SAG) a novel extension of the stochastic average gradient method (SAG) to minimize composition of finite-sum functions. C-SAG, like SAG, estimates gradient by incorporating memory of previous gradient information. We present theoretical analyses of C-SAG which show that C-SAG, like SAG, and C-SVRG, achieves a linear convergence rate when the objective function is strongly convex; However, C-CAG achieves lower oracle query complexity per iteration than C-SVRG. Finally, we present results of experiments showing that C-SAG converges substantially faster than full gradient (FG), as well as C-SVRG.
△ Less
Submitted 7 September, 2018; v1 submitted 4 September, 2018;
originally announced September 2018.
-
Advances in Artificial Intelligence Require Progress Across all of Computer Science
Authors:
Gregory D. Hager,
Randal Bryant,
Eric Horvitz,
Maja Mataric,
Vasant Honavar
Abstract:
Advances in Artificial Intelligence require progress across all of computer science.
Advances in Artificial Intelligence require progress across all of computer science.
△ Less
Submitted 13 July, 2017;
originally announced July 2017.
-
Advanced Cyberinfrastructure for Science, Engineering, and Public Policy
Authors:
Vasant G. Honavar,
Katherine Yelick,
Klara Nahrstedt,
Holly Rushmeier,
Jennifer Rexford,
Mark D. Hill,
Elizabeth Bradley,
Elizabeth Mynatt
Abstract:
Progress in many domains increasingly benefits from our ability to view the systems through a computational lens, i.e., using computational abstractions of the domains; and our ability to acquire, share, integrate, and analyze disparate types of data. These advances would not be possible without the advanced data and computational cyberinfrastructure and tools for data capture, integration, analys…
▽ More
Progress in many domains increasingly benefits from our ability to view the systems through a computational lens, i.e., using computational abstractions of the domains; and our ability to acquire, share, integrate, and analyze disparate types of data. These advances would not be possible without the advanced data and computational cyberinfrastructure and tools for data capture, integration, analysis, modeling, and simulation. However, despite, and perhaps because of, advances in "big data" technologies for data acquisition, management and analytics, the other largely manual, and labor-intensive aspects of the decision making process, e.g., formulating questions, designing studies, organizing, curating, connecting, correlating and integrating crossdomain data, drawing inferences and interpreting results, have become the rate-limiting steps to progress. Advancing the capability and capacity for evidence-based improvements in science, engineering, and public policy requires support for (1) computational abstractions of the relevant domains coupled with computational methods and tools for their analysis, synthesis, simulation, visualization, sharing, and integration; (2) cognitive tools that leverage and extend the reach of human intellect, and partner with humans on all aspects of the activity; (3) nimble and trustworthy data cyber-infrastructures that connect, manage a variety of instruments, multiple interrelated data types and associated metadata, data representations, processes, protocols and workflows; and enforce applicable security and data access and use policies; and (4) organizational and social structures and processes for collaborative and coordinated activity across disciplinary and institutional boundaries.
△ Less
Submitted 30 June, 2017;
originally announced July 2017.
-
Big Data, Data Science, and Civil Rights
Authors:
Solon Barocas,
Elizabeth Bradley,
Vasant Honavar,
Foster Provost
Abstract:
Advances in data analytics bring with them civil rights implications. Data-driven and algorithmic decision making increasingly determine how businesses target advertisements to consumers, how police departments monitor individuals or groups, how banks decide who gets a loan and who does not, how employers hire, how colleges and universities make admissions and financial aid decisions, and much mor…
▽ More
Advances in data analytics bring with them civil rights implications. Data-driven and algorithmic decision making increasingly determine how businesses target advertisements to consumers, how police departments monitor individuals or groups, how banks decide who gets a loan and who does not, how employers hire, how colleges and universities make admissions and financial aid decisions, and much more. As data-driven decisions increasingly affect every corner of our lives, there is an urgent need to ensure they do not become instruments of discrimination, barriers to equality, threats to social justice, and sources of unfairness. In this paper, we argue for a concrete research agenda aimed at addressing these concerns, comprising five areas of emphasis: (i) Determining if models and modeling procedures exhibit objectionable bias; (ii) Building awareness of fairness into machine learning methods; (iii) Improving the transparency and control of data- and model-driven decision making; (iv) Looking beyond the algorithm(s) for sources of bias and unfairness-in the myriad human decisions made during the problem formulation and modeling process; and (v) Supporting the cross-disciplinary scholarship necessary to do all of that well.
△ Less
Submitted 9 June, 2017;
originally announced June 2017.
-
Accelerating Science: A Computing Research Agenda
Authors:
Vasant G. Honavar,
Mark D. Hill,
Katherine Yelick
Abstract:
The emergence of "big data" offers unprecedented opportunities for not only accelerating scientific advances but also enabling new modes of discovery. Scientific progress in many disciplines is increasingly enabled by our ability to examine natural phenomena through the computational lens, i.e., using algorithmic or information processing abstractions of the underlying processes; and our ability t…
▽ More
The emergence of "big data" offers unprecedented opportunities for not only accelerating scientific advances but also enabling new modes of discovery. Scientific progress in many disciplines is increasingly enabled by our ability to examine natural phenomena through the computational lens, i.e., using algorithmic or information processing abstractions of the underlying processes; and our ability to acquire, share, integrate and analyze disparate types of data. However, there is a huge gap between our ability to acquire, store, and process data and our ability to make effective use of the data to advance discovery. Despite successful automation of routine aspects of data management and analytics, most elements of the scientific process currently require considerable human expertise and effort. Accelerating science to keep pace with the rate of data acquisition and data processing calls for the development of algorithmic or information processing abstractions, coupled with formal methods and tools for modeling and simulation of natural processes as well as major innovations in cognitive tools for scientists, i.e., computational tools that leverage and extend the reach of human intellect, and partner with humans on a broad range of tasks in scientific discovery (e.g., identifying, prioritizing formulating questions, designing, prioritizing and executing experiments designed to answer a chosen question, drawing inferences and evaluating the results, and formulating new questions, in a closed-loop fashion). This calls for concerted research agenda aimed at: Development, analysis, integration, sharing, and simulation of algorithmic or information processing abstractions of natural processes, coupled with formal methods and tools for their analyses and simulation; Innovations in cognitive tools that augment and extend human intellect and partner with humans in all aspects of science.
△ Less
Submitted 6 April, 2016;
originally announced April 2016.
-
Lifted Representation of Relational Causal Models Revisited: Implications for Reasoning and Structure Learning
Authors:
Sanghack Lee,
Vasant Honavar
Abstract:
Maier et al. (2010) introduced the relational causal model (RCM) for representing and inferring causal relationships in relational data. A lifted representation, called abstract ground graph (AGG), plays a central role in reasoning with and learning of RCM. The correctness of the algorithm proposed by Maier et al. (2013a) for learning RCM from data relies on the soundness and completeness of AGG f…
▽ More
Maier et al. (2010) introduced the relational causal model (RCM) for representing and inferring causal relationships in relational data. A lifted representation, called abstract ground graph (AGG), plays a central role in reasoning with and learning of RCM. The correctness of the algorithm proposed by Maier et al. (2013a) for learning RCM from data relies on the soundness and completeness of AGG for relational d-separation to reduce the learning of an RCM to learning of an AGG. We revisit the definition of AGG and show that AGG, as defined in Maier et al. (2013b), does not correctly abstract all ground graphs. We revise the definition of AGG to ensure that it correctly abstracts all ground graphs. We further show that AGG representation is not complete for relational d-separation, that is, there can exist conditional independence relations in an RCM that are not entailed by AGG. A careful examination of the relationship between the lack of completeness of AGG for relational d-separation and faithfulness conditions suggests that weaker notions of completeness, namely adjacency faithfulness and orientation faithfulness between an RCM and its AGG, can be used to learn an RCM from data.
△ Less
Submitted 17 August, 2015; v1 submitted 9 August, 2015;
originally announced August 2015.
-
CRISNER: A Practically Efficient Reasoner for Qualitative Preferences
Authors:
Ganesh Ram Santhanam,
Samik Basu,
Vasant Honavar
Abstract:
We present CRISNER (Conditional & Relative Importance Statement Network PrEference Reasoner), a tool that provides practically efficient as well as exact reasoning about qualitative preferences in popular ceteris paribus preference languages such as CP-nets, TCP-nets, CP-theories, etc. The tool uses a model checking engine to translate preference specifications and queries into appropriate Kripke…
▽ More
We present CRISNER (Conditional & Relative Importance Statement Network PrEference Reasoner), a tool that provides practically efficient as well as exact reasoning about qualitative preferences in popular ceteris paribus preference languages such as CP-nets, TCP-nets, CP-theories, etc. The tool uses a model checking engine to translate preference specifications and queries into appropriate Kripke models and verifiable properties over them respectively. The distinguishing features of the tool are: (1) exact and provably correct query answering for testing dominance, consistency with respect to a preference specification, and testing equivalence and subsumption of two sets of preferences; (2) automatic generation of proofs evidencing the correctness of answer produced by CRISNER to any of the above queries; (3) XML inputs and outputs that make it portable and pluggable into other applications. We also describe the extensible architecture of CRISNER, which can be extended to new reference formalisms based on ceteris paribus semantics that may be developed in the future.
△ Less
Submitted 30 July, 2015;
originally announced July 2015.
-
Representing and Reasoning with Qualitative Preferences for Compositional Systems
Authors:
Ganesh Ram Santhanam,
Samik Basu,
Vasant Honavar
Abstract:
Many applications, e.g., Web service composition, complex system design, team formation, etc., rely on methods for identifying collections of objects or entities satisfying some functional requirement. Among the collections that satisfy the functional requirement, it is often necessary to identify one or more collections that are optimal with respect to user preferences over a set of attributes th…
▽ More
Many applications, e.g., Web service composition, complex system design, team formation, etc., rely on methods for identifying collections of objects or entities satisfying some functional requirement. Among the collections that satisfy the functional requirement, it is often necessary to identify one or more collections that are optimal with respect to user preferences over a set of attributes that describe the non-functional properties of the collection.
We develop a formalism that lets users express the relative importance among attributes and qualitative preferences over the valuations of each attribute. We define a dominance relation that allows us to compare collections of objects in terms of preferences over attributes of the objects that make up the collection. We establish some key properties of the dominance relation. In particular, we show that the dominance relation is a strict partial order when the intra-attribute preference relations are strict partial orders and the relative importance preference relation is an interval order.
We provide algorithms that use this dominance relation to identify the set of most preferred collections. We show that under certain conditions, the algorithms are guaranteed to return only (sound), all (complete), or at least one (weakly complete) of the most preferred collections. We present results of simulation experiments comparing the proposed algorithms with respect to (a) the quality of solutions (number of most preferred solutions) produced by the algorithms, and (b) their performance and efficiency. We also explore some interesting conjectures suggested by the results of our experiments that relate the properties of the user preferences, the dominance relation, and the algorithms.
△ Less
Submitted 16 January, 2014;
originally announced January 2014.
-
Efficient Markov Network Structure Discovery Using Independence Tests
Authors:
Facundo Bromberg,
Dimitris Margaritis,
Vasant Honavar
Abstract:
We present two algorithms for learning the structure of a Markov network from data: GSMN* and GSIMN. Both algorithms use statistical independence tests to infer the structure by successively constraining the set of structures consistent with the results of these tests. Until very recently, algorithms for structure learning were based on maximum likelihood estimation, which has been proved to…
▽ More
We present two algorithms for learning the structure of a Markov network from data: GSMN* and GSIMN. Both algorithms use statistical independence tests to infer the structure by successively constraining the set of structures consistent with the results of these tests. Until very recently, algorithms for structure learning were based on maximum likelihood estimation, which has been proved to be NP-hard for Markov networks due to the difficulty of estimating the parameters of the network, needed for the computation of the data likelihood. The independence-based approach does not require the computation of the likelihood, and thus both GSMN* and GSIMN can compute the structure efficiently (as shown in our experiments). GSMN* is an adaptation of the Grow-Shrink algorithm of Margaritis and Thrun for learning the structure of Bayesian networks. GSIMN extends GSMN* by additionally exploiting Pearls well-known properties of the conditional independence relation to infer novel independences from known ones, thus avoiding the performance of statistical tests to estimate them. To accomplish this efficiently GSIMN uses the Triangle theorem, also introduced in this work, which is a simplified version of the set of Markov axioms. Experimental comparisons on artificial and real-world data sets show GSIMN can yield significant savings with respect to GSMN*, while generating a Markov network with comparable or in some cases improved quality. We also compare GSIMN to a forward-chaining implementation, called GSIMN-FCH, that produces all possible conditional independences resulting from repeatedly applying Pearls theorems on the known conditional independence tests. The results of this comparison show that GSIMN, by the sole use of the Triangle theorem, is nearly optimal in terms of the set of independences tests that it infers.
△ Less
Submitted 15 January, 2014;
originally announced January 2014.
-
Causal Transportability of Experiments on Controllable Subsets of Variables: z-Transportability
Authors:
Sanghack Lee,
Vasant Honavar
Abstract:
We introduce z-transportability, the problem of estimating the causal effect of a set of variables X on another set of variables Y in a target domain from experiments on any subset of controllable variables Z where Z is an arbitrary subset of observable variables V in a source domain. z-Transportability generalizes z-identifiability, the problem of estimating in a given domain the causal effect of…
▽ More
We introduce z-transportability, the problem of estimating the causal effect of a set of variables X on another set of variables Y in a target domain from experiments on any subset of controllable variables Z where Z is an arbitrary subset of observable variables V in a source domain. z-Transportability generalizes z-identifiability, the problem of estimating in a given domain the causal effect of X on Y from surrogate experiments on a set of variables Z such that Z is disjoint from X;. z-Transportability also generalizes transportability which requires that the causal effect of X on Y in the target domain be estimable from experiments on any subset of all observable variables in the source domain. We first generalize z-identifiability to allow cases where Z is not necessarily disjoint from X. Then, we establish a necessary and sufficient condition for z-transportability in terms of generalized z-identifiability and transportability. We provide a correct and complete algorithm that determines whether a causal effect is z-transportable; and if it is, produces a transport formula, that is, a recipe for estimating the causal effect of X on Y in the target domain using information elicited from the results of experimental manipulations of Z in the source domain and observational data from the target domain. Our results also show that do-calculus is complete for z-transportability.
△ Less
Submitted 26 September, 2013;
originally announced September 2013.
-
Abstraction Super-structuring Normal Forms: Towards a Theory of Structural Induction
Authors:
Adrian Silvescu,
Vasant Honavar
Abstract:
Induction is the process by which we obtain predictive laws or theories or models of the world. We consider the structural aspect of induction. We answer the question as to whether we can find a finite and minmalistic set of operations on structural elements in terms of which any theory can be expressed. We identify abstraction (grouping similar entities) and super-structuring (combining topologic…
▽ More
Induction is the process by which we obtain predictive laws or theories or models of the world. We consider the structural aspect of induction. We answer the question as to whether we can find a finite and minmalistic set of operations on structural elements in terms of which any theory can be expressed. We identify abstraction (grouping similar entities) and super-structuring (combining topologically e.g., spatio-temporally close entities) as the essential structural operations in the induction process. We show that only two more structural operations, namely, reverse abstraction and reverse super-structuring (the duals of abstraction and super-structuring respectively) suffice in order to exploit the full power of Turing-equivalent generative grammars in induction. We explore the implications of this theorem with respect to the nature of hidden variables, radical positivism and the 2-century old claim of David Hume about the principles of connexion among ideas.
△ Less
Submitted 3 July, 2011;
originally announced July 2011.
-
Identifying Interaction Sites in "Recalcitrant" Proteins: Predicted Protein and Rna Binding Sites in Rev Proteins of Hiv-1 and Eiav Agree with Experimental Data
Authors:
Michael Terribilini,
Jae-Hyung Lee,
Changhui Yan,
Robert L. Jernigan,
Susan Carpenter,
Vasant Honavar,
Drena Dobbs
Abstract:
Protein-protein and protein nucleic acid interactions are vitally important for a wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses. We have developed machine learning approaches for predicting which amino acids of a protein participate in its interactions with other proteins and/or nucleic acids, using on…
▽ More
Protein-protein and protein nucleic acid interactions are vitally important for a wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses. We have developed machine learning approaches for predicting which amino acids of a protein participate in its interactions with other proteins and/or nucleic acids, using only the protein sequence as input. In this paper, we describe an application of classifiers trained on datasets of well-characterized protein-protein and protein-RNA complexes for which experimental structures are available. We apply these classifiers to the problem of predicting protein and RNA binding sites in the sequence of a clinically important protein for which the structure is not known: the regulatory protein Rev, essential for the replication of HIV-1 and other lentiviruses. We compare our predictions with published biochemical, genetic and partial structural information for HIV-1 and EIAV Rev and with our own published experimental mapping of RNA binding sites in EIAV Rev. The predicted and experimentally determined binding sites are in very good agreement. The ability to predict reliably the residues of a protein that directly contribute to specific binding events - without the requirement for structural information regarding either the protein or complexes in which it participates - can potentially generate new disease intervention strategies.
△ Less
Submitted 20 November, 2005;
originally announced November 2005.