-
Training Video Foundation Models with NVIDIA NeMo
Authors:
Zeeshan Patel,
Ethan He,
Parth Mannan,
Xiaowei Ren,
Ryan Wolf,
Niket Agarwal,
Jacob Huffman,
Zhuoyao Wang,
Carl Wang,
Jack Chang,
Yan Bai,
Tommy Huang,
Linnan Wang,
Sahil Jain,
Shanmugam Ramasamy,
Joseph Jennings,
Ekaterina Sirazitdinova,
Oleg Sudakov,
Mingyuan Ma,
Bobby Chen,
Forrest Lin,
Hao Wang,
Vasanth Rao Naik Sabavat,
Sriharsha Niverty,
Rong Ou
, et al. (4 additional authors not shown)
Abstract:
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, mul…
▽ More
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Cosmos World Foundation Model Platform for Physical AI
Authors:
NVIDIA,
:,
Niket Agarwal,
Arslan Ali,
Maciej Bala,
Yogesh Balaji,
Erik Barker,
Tiffany Cai,
Prithvijit Chattopadhyay,
Yongxin Chen,
Yin Cui,
Yifan Ding,
Daniel Dworakowski,
Jiaojiao Fan,
Michele Fenzi,
Francesco Ferroni,
Sanja Fidler,
Dieter Fox,
Songwei Ge,
Yunhao Ge,
Jinwei Gu,
Siddharth Gururani,
Ethan He,
Jiahui Huang,
Jacob Huffman
, et al. (54 additional authors not shown)
Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into cu…
▽ More
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
△ Less
Submitted 18 March, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
MALADE: Orchestration of LLM-powered Agents with Retrieval Augmented Generation for Pharmacovigilance
Authors:
Jihye Choi,
Nils Palumbo,
Prasad Chalasani,
Matthew M. Engelhard,
Somesh Jha,
Anivarya Kumar,
David Page
Abstract:
In the era of Large Language Models (LLMs), given their remarkable text understanding and generation abilities, there is an unprecedented opportunity to develop new, LLM-based methods for trustworthy medical knowledge synthesis, extraction and summarization. This paper focuses on the problem of Pharmacovigilance (PhV), where the significance and challenges lie in identifying Adverse Drug Events (A…
▽ More
In the era of Large Language Models (LLMs), given their remarkable text understanding and generation abilities, there is an unprecedented opportunity to develop new, LLM-based methods for trustworthy medical knowledge synthesis, extraction and summarization. This paper focuses on the problem of Pharmacovigilance (PhV), where the significance and challenges lie in identifying Adverse Drug Events (ADEs) from diverse text sources, such as medical literature, clinical notes, and drug labels. Unfortunately, this task is hindered by factors including variations in the terminologies of drugs and outcomes, and ADE descriptions often being buried in large amounts of narrative text. We present MALADE, the first effective collaborative multi-agent system powered by LLM with Retrieval Augmented Generation for ADE extraction from drug label data. This technique involves augmenting a query to an LLM with relevant information extracted from text resources, and instructing the LLM to compose a response consistent with the augmented data. MALADE is a general LLM-agnostic architecture, and its unique capabilities are: (1) leveraging a variety of external sources, such as medical literature, drug labels, and FDA tools (e.g., OpenFDA drug information API), (2) extracting drug-outcome association in a structured format along with the strength of the association, and (3) providing explanations for established associations. Instantiated with GPT-4 Turbo or GPT-4o, and FDA drug label data, MALADE demonstrates its efficacy with an Area Under ROC Curve of 0.90 against the OMOP Ground Truth table of ADEs. Our implementation leverages the Langroid multi-agent LLM framework and can be found at https://github.com/jihyechoi77/malade.
△ Less
Submitted 3 August, 2024;
originally announced August 2024.
-
Super Mario in the Pernicious Kingdoms: Classifying glitches in old games
Authors:
Llewellyn Forward,
Io Limmer,
Joseph Hallett,
Dan Page
Abstract:
In a case study spanning four classic Super Mario games and the analysis of 237 known glitches within them, we classify a variety of weaknesses that are exploited by speedrunners to enable them to beat games quickly and in surprising ways. Using the Seven Pernicious Kingdoms software defect taxonomy and the Common Weakness Enumeration, we categorize the glitches by the weaknesses that enable them.…
▽ More
In a case study spanning four classic Super Mario games and the analysis of 237 known glitches within them, we classify a variety of weaknesses that are exploited by speedrunners to enable them to beat games quickly and in surprising ways. Using the Seven Pernicious Kingdoms software defect taxonomy and the Common Weakness Enumeration, we categorize the glitches by the weaknesses that enable them. We identify 7 new weaknesses that appear specific to games and which are not covered by current software weakness taxonomies.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
A multi-channel cycleGAN for CBCT to CT synthesis
Authors:
Chelsea A. H. Sargeant,
Edward G. A. Henderson,
Dónal M. McSweeney,
Aaron G. Rankin,
Denis Page
Abstract:
Image synthesis is used to generate synthetic CTs (sCTs) from on-treatment cone-beam CTs (CBCTs) with a view to improving image quality and enabling accurate dose computation to facilitate a CBCT-based adaptive radiotherapy workflow. As this area of research gains momentum, developments in sCT generation methods are difficult to compare due to the lack of large public datasets and sizeable variati…
▽ More
Image synthesis is used to generate synthetic CTs (sCTs) from on-treatment cone-beam CTs (CBCTs) with a view to improving image quality and enabling accurate dose computation to facilitate a CBCT-based adaptive radiotherapy workflow. As this area of research gains momentum, developments in sCT generation methods are difficult to compare due to the lack of large public datasets and sizeable variation in training procedures. To compare and assess the latest advancements in sCT generation, the SynthRAD2023 challenge provides a public dataset and evaluation framework for both MR and CBCT to sCT synthesis. Our contribution focuses on the second task, CBCT-to-sCT synthesis. By leveraging a multi-channel input to emphasize specific image features, our approach effectively addresses some of the challenges inherent in CBCT imaging, whilst restoring the contrast necessary for accurate visualisation of patients' anatomy. Additionally, we introduce an auxiliary fusion network to further enhance the fidelity of generated sCT images.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Neural Markov Prolog
Authors:
Alexander Thomson,
David Page
Abstract:
The recent rapid advance of AI has been driven largely by innovations in neural network architectures. A concomitant concern is how to understand these resulting systems. In this paper, we propose a tool to assist in both the design of further innovative architectures and the simple yet precise communication of their structure. We propose the language Neural Markov Prolog (NMP), based on both Mark…
▽ More
The recent rapid advance of AI has been driven largely by innovations in neural network architectures. A concomitant concern is how to understand these resulting systems. In this paper, we propose a tool to assist in both the design of further innovative architectures and the simple yet precise communication of their structure. We propose the language Neural Markov Prolog (NMP), based on both Markov logic and Prolog, as a means to both bridge first order logic and neural network design and to allow for the easy generation and presentation of architectures for images, text, relational databases, or other target data types or their mixtures.
△ Less
Submitted 27 November, 2023;
originally announced December 2023.
-
Differentially Private Multi-Site Treatment Effect Estimation
Authors:
Tatsuki Koga,
Kamalika Chaudhuri,
David Page
Abstract:
Patient privacy is a major barrier to healthcare AI. For confidentiality reasons, most patient data remains in silo in separate hospitals, preventing the design of data-driven healthcare AI systems that need large volumes of patient data to make effective decisions. A solution to this is collective learning across multiple sites through federated learning with differential privacy. However, litera…
▽ More
Patient privacy is a major barrier to healthcare AI. For confidentiality reasons, most patient data remains in silo in separate hospitals, preventing the design of data-driven healthcare AI systems that need large volumes of patient data to make effective decisions. A solution to this is collective learning across multiple sites through federated learning with differential privacy. However, literature in this space typically focuses on differentially private statistical estimation and machine learning, which is different from the causal inference-related problems that arise in healthcare. In this work, we take a fresh look at federated learning with a focus on causal inference; specifically, we look at estimating the average treatment effect (ATE), an important task in causal inference for healthcare applications, and provide a federated analytics approach to enable ATE estimation across multiple sites along with differential privacy (DP) guarantees at each site. The main challenge comes from site heterogeneity -- different sites have different sample sizes and privacy budgets. We address this through a class of per-site estimation algorithms that reports the ATE estimate and its variance as a quality measure, and an aggregation algorithm on the server side that minimizes the overall variance of the final ATE estimate. Our experiments on real and synthetic data show that our method reliably aggregates private statistics across sites and provides better privacy-utility tradeoff under site heterogeneity than baselines.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
On Neural Networks as Infinite Tree-Structured Probabilistic Graphical Models
Authors:
Boyao Li,
Alexander J. Thomson,
Houssam Nassif,
Matthew M. Engelhard,
David Page
Abstract:
Deep neural networks (DNNs) lack the precise semantics and definitive probabilistic interpretation of probabilistic graphical models (PGMs). In this paper, we propose an innovative solution by constructing infinite tree-structured PGMs that correspond exactly to neural networks. Our research reveals that DNNs, during forward propagation, indeed perform approximations of PGM inference that are prec…
▽ More
Deep neural networks (DNNs) lack the precise semantics and definitive probabilistic interpretation of probabilistic graphical models (PGMs). In this paper, we propose an innovative solution by constructing infinite tree-structured PGMs that correspond exactly to neural networks. Our research reveals that DNNs, during forward propagation, indeed perform approximations of PGM inference that are precise in this alternative PGM structure. Not only does our research complement existing studies that describe neural networks as kernel machines or infinite-sized Gaussian processes, it also elucidates a more direct approximation that DNNs make to exact inference in PGMs. Potential benefits include improved pedagogy and interpretation of DNNs, and algorithms that can merge the strengths of PGMs and DNNs.
△ Less
Submitted 17 January, 2025; v1 submitted 27 May, 2023;
originally announced May 2023.
-
Variable Importance Matching for Causal Inference
Authors:
Quinn Lanners,
Harsh Parikh,
Alexander Volfovsky,
Cynthia Rudin,
David Page
Abstract:
Our goal is to produce methods for observational causal inference that are auditable, easy to troubleshoot, accurate for treatment effect estimation, and scalable to high-dimensional data. We describe a general framework called Model-to-Match that achieves these goals by (i) learning a distance metric via outcome modeling, (ii) creating matched groups using the distance metric, and (iii) using the…
▽ More
Our goal is to produce methods for observational causal inference that are auditable, easy to troubleshoot, accurate for treatment effect estimation, and scalable to high-dimensional data. We describe a general framework called Model-to-Match that achieves these goals by (i) learning a distance metric via outcome modeling, (ii) creating matched groups using the distance metric, and (iii) using the matched groups to estimate treatment effects. Model-to-Match uses variable importance measurements to construct a distance metric, making it a flexible framework that can be adapted to various applications. Concentrating on the scalability of the problem in the number of potential confounders, we operationalize the Model-to-Match framework with LASSO. We derive performance guarantees for settings where LASSO outcome modeling consistently identifies all confounders (importantly without requiring the linear model to be correctly specified). We also provide experimental results demonstrating the method's auditability, accuracy, and scalability as well as extensions to more general nonparametric outcome modeling.
△ Less
Submitted 28 June, 2023; v1 submitted 22 February, 2023;
originally announced February 2023.
-
Predicting Drug-Drug Interactions from Heterogeneous Data: An Embedding Approach
Authors:
Devendra Singh Dhami,
Siwen Yan,
Gautam Kunapuli,
David Page,
Sriraam Natarajan
Abstract:
Predicting and discovering drug-drug interactions (DDIs) using machine learning has been studied extensively. However, most of the approaches have focused on text data or textual representation of the drug structures. We present the first work that uses multiple data sources such as drug structure images, drug structure string representation and relational representation of drug relationships as t…
▽ More
Predicting and discovering drug-drug interactions (DDIs) using machine learning has been studied extensively. However, most of the approaches have focused on text data or textual representation of the drug structures. We present the first work that uses multiple data sources such as drug structure images, drug structure string representation and relational representation of drug relationships as the input. To this effect, we exploit the recent advances in deep networks to integrate these varied sources of inputs in predicting DDIs. Our empirical evaluation against several state-of-the-art methods using standalone different data types for drugs clearly demonstrate the efficacy of combining heterogeneous data in predicting DDIs.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
High-Throughput Approach to Modeling Healthcare Costs Using Electronic Healthcare Records
Authors:
Alex Taylor,
Ross Kleiman,
Scott Hebbring,
Peggy Peissig,
David Page
Abstract:
Accurate estimation of healthcare costs is crucial for healthcare systems to plan and effectively negotiate with insurance companies regarding the coverage of patient-care costs. Greater accuracy in estimating healthcare costs would provide mutual benefit for both health systems and the insurers that support these systems by better aligning payment models with patient-care costs. This study presen…
▽ More
Accurate estimation of healthcare costs is crucial for healthcare systems to plan and effectively negotiate with insurance companies regarding the coverage of patient-care costs. Greater accuracy in estimating healthcare costs would provide mutual benefit for both health systems and the insurers that support these systems by better aligning payment models with patient-care costs. This study presents the results of a generalizable machine learning approach to predicting medical events built from 40 years of data from >860,000 patients pertaining to >6,700 prescription medications, courtesy of Marshfield Clinic in Wisconsin. It was found that models built using this approach performed well when compared to similar studies predicting physician prescriptions of individual medications. In addition to providing a comprehensive predictive model for all drugs in a large healthcare system, the approach taken in this research benefits from potential applicability to a wide variety of other medical events.
△ Less
Submitted 1 June, 2022; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Temporal Poisson Square Root Graphical Models
Authors:
Sinong Geng,
Zhaobin Kuang,
Peggy Peissig,
David Page
Abstract:
We propose temporal Poisson square root graphical models (TPSQRs), a generalization of Poisson square root graphical models (PSQRs) specifically designed for modeling longitudinal event data. By estimating the temporal relationships for all possible pairs of event types, TPSQRs can offer a holistic perspective about whether the occurrences of any given event type could excite or inhibit any other…
▽ More
We propose temporal Poisson square root graphical models (TPSQRs), a generalization of Poisson square root graphical models (PSQRs) specifically designed for modeling longitudinal event data. By estimating the temporal relationships for all possible pairs of event types, TPSQRs can offer a holistic perspective about whether the occurrences of any given event type could excite or inhibit any other type. A TPSQR is learned by estimating a collection of interrelated PSQRs that share the same template parameterization. These PSQRs are estimated jointly in a pseudo-likelihood fashion, where Poisson pseudo-likelihood is used to approximate the original more computationally-intensive pseudo-likelihood problem stemming from PSQRs. Theoretically, we demonstrate that under mild assumptions, the Poisson pseudo-likelihood approximation is sparsistent for recovering the underlying PSQR. Empirically, we learn TPSQRs from Marshfield Clinic electronic health records (EHRs) with millions of drug prescription and condition diagnosis events, for adverse drug reaction (ADR) detection. Experimental results demonstrate that the learned TPSQRs can recover ADR signals from the EHR effectively and efficiently.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Stochastic Learning for Sparse Discrete Markov Random Fields with Controlled Gradient Approximation Error
Authors:
Sinong Geng,
Zhaobin Kuang,
Jie Liu,
Stephen Wright,
David Page
Abstract:
We study the $L_1$-regularized maximum likelihood estimator/estimation (MLE) problem for discrete Markov random fields (MRFs), where efficient and scalable learning requires both sparse regularization and approximate inference. To address these challenges, we consider a stochastic learning framework called stochastic proximal gradient (SPG; Honorio 2012a, Atchade et al. 2014,Miasojedow and Rejchel…
▽ More
We study the $L_1$-regularized maximum likelihood estimator/estimation (MLE) problem for discrete Markov random fields (MRFs), where efficient and scalable learning requires both sparse regularization and approximate inference. To address these challenges, we consider a stochastic learning framework called stochastic proximal gradient (SPG; Honorio 2012a, Atchade et al. 2014,Miasojedow and Rejchel 2016). SPG is an inexact proximal gradient algorithm [Schmidtet al., 2011], whose inexactness stems from the stochastic oracle (Gibbs sampling) for gradient approximation - exact gradient evaluation is infeasible in general due to the NP-hard inference problem for discrete MRFs [Koller and Friedman, 2009]. Theoretically, we provide novel verifiable bounds to inspect and control the quality of gradient approximation. Empirically, we propose the tighten asymptotically (TAY) learning strategy based on the verifiable bounds to boost the performance of SPG.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Discrimination Among Multiple Cutaneous and Proprioceptive Hand Percepts Evoked by Nerve Stimulation with Utah Slanted Electrode Arrays in Human Amputees
Authors:
David M. Page,
Suzanne M. Wendelken,
Tyler S. Davis,
David T. Kluger,
Douglas T. Hutchinson,
Jacob A. George,
Gregory A. Clark
Abstract:
Objective: This paper aims to demonstrate functional discriminability among restored hand sensations with different locations, qualities, and intensities that are evoked by microelectrode stimulation of residual afferent fibers in human amputees. Methods: We implanted a Utah Slanted Electrode Array (USEA) in the median and ulnar residual arm nerves of three transradial amputees and delivered stimu…
▽ More
Objective: This paper aims to demonstrate functional discriminability among restored hand sensations with different locations, qualities, and intensities that are evoked by microelectrode stimulation of residual afferent fibers in human amputees. Methods: We implanted a Utah Slanted Electrode Array (USEA) in the median and ulnar residual arm nerves of three transradial amputees and delivered stimulation via different electrodes and at different frequencies to produce various locations, qualities, and intensities of sensation on the missing hand. Blind discrimination trials were performed to determine how well subjects could discriminate among these restored sensations. Results: Subjects discriminated among restored sensory percepts with varying cutaneous and proprioceptive locations, qualities, and intensities in blind trials, including discrimination among up to 10 different location-intensity combinations (15/30 successes, p < 0.0005). Variations in the site of stimulation within the nerve, via electrode selection, enabled discrimination among up to 5 locations and qualities (35/35 successes, p < 0.0001). Variations in the stimulation frequency enabled discrimination among 4 different intensities at the same location (13/20 successes, p < 0.005). One subject discriminated among simultaneous, alternating, and isolated stimulation of two different USEA electrodes, as may be desired during multi-sensor closed-loop prosthesis use (20/25 successes, p < 0.001). Conclusion: USEA stimulation enables encoding of a diversity of functionally discriminable sensations with different locations, qualities, and intensities. Significance: These percepts provide a potentially rich source of sensory feedback that may enhance performance and embodiment during multi-sensor, closed-loop prosthesis use.
△ Less
Submitted 7 March, 2020;
originally announced March 2020.
-
CAUSE: Learning Granger Causality from Event Sequences using Attribution Methods
Authors:
Wei Zhang,
Thomas Kobber Panum,
Somesh Jha,
Prasad Chalasani,
David Page
Abstract:
We study the problem of learning Granger causality between event types from asynchronous, interdependent, multi-type event sequences. Existing work suffers from either limited model flexibility or poor model explainability and thus fails to uncover Granger causality across a wide variety of event sequences with diverse event interdependency. To address these weaknesses, we propose CAUSE (Causality…
▽ More
We study the problem of learning Granger causality between event types from asynchronous, interdependent, multi-type event sequences. Existing work suffers from either limited model flexibility or poor model explainability and thus fails to uncover Granger causality across a wide variety of event sequences with diverse event interdependency. To address these weaknesses, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a novel framework for the studied task. The key idea of CAUSE is to first implicitly capture the underlying event interdependency by fitting a neural point process, and then extract from the process a Granger causality statistic using an axiomatic attribution method. Across multiple datasets riddled with diverse event interdependency, we demonstrate that CAUSE achieves superior performance on correctly inferring the inter-type Granger causality over a range of state-of-the-art methods.
△ Less
Submitted 18 February, 2020;
originally announced February 2020.
-
AutoBlock: A Hands-off Blocking Framework for Entity Matching
Authors:
Wei Zhang,
Hao Wei,
Bunyamin Sisman,
Xin Luna Dong,
Christos Faloutsos,
David Page
Abstract:
Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human e…
▽ More
Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human effort in cleaning data and designing blocking keys.
In this paper, we propose AutoBlock, a novel hands-off blocking framework for entity matching, based on similarity-preserving representation learning and nearest neighbor search. Our contributions include: (a) Automation: AutoBlock frees users from laborious data cleaning and blocking key tuning. (b) Scalability: AutoBlock has a sub-quadratic total time complexity and can be easily deployed for millions of records. (c) Effectiveness: AutoBlock outperforms a wide range of competitive baselines on multiple large-scale, real-world datasets, especially when datasets are dirty and/or unstructured.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
Beyond Textual Data: Predicting Drug-Drug Interactions from Molecular Structure Images using Siamese Neural Networks
Authors:
Devendra Singh Dhami,
Siwen Yan,
Gautam Kunapuli,
David Page,
Sriraam Natarajan
Abstract:
Predicting and discovering drug-drug interactions (DDIs) is an important problem and has been studied extensively both from medical and machine learning point of view. Almost all of the machine learning approaches have focused on text data or textual representation of the structural data of drugs. We present the first work that uses drug structure images as the input and utilizes a Siamese convolu…
▽ More
Predicting and discovering drug-drug interactions (DDIs) is an important problem and has been studied extensively both from medical and machine learning point of view. Almost all of the machine learning approaches have focused on text data or textual representation of the structural data of drugs. We present the first work that uses drug structure images as the input and utilizes a Siamese convolutional network architecture to predict DDIs.
△ Less
Submitted 29 June, 2020; v1 submitted 14 November, 2019;
originally announced November 2019.
-
High-Throughput Machine Learning from Electronic Health Records
Authors:
Ross S. Kleiman,
Paul S. Bennett,
Peggy L. Peissig,
Richard L. Berg,
Zhaobin Kuang,
Scott J. Hebbring,
Michael D. Caldwell,
David Page
Abstract:
The widespread digitization of patient data via electronic health records (EHRs) has created an unprecedented opportunity to use machine learning algorithms to better predict disease risk at the patient level. Although predictive models have previously been constructed for a few important diseases, such as breast cancer and myocardial infarction, we currently know very little about how accurately…
▽ More
The widespread digitization of patient data via electronic health records (EHRs) has created an unprecedented opportunity to use machine learning algorithms to better predict disease risk at the patient level. Although predictive models have previously been constructed for a few important diseases, such as breast cancer and myocardial infarction, we currently know very little about how accurately the risk for most diseases or events can be predicted, and how far in advance. Machine learning algorithms use training data rather than preprogrammed rules to make predictions and are well suited for the complex task of disease prediction. Although there are thousands of conditions and illnesses patients can encounter, no prior research simultaneously predicts risks for thousands of diagnosis codes and thereby establishes a comprehensive patient risk profile. Here we show that such pandiagnostic prediction is possible with a high level of performance across diagnosis codes. For the tasks of predicting diagnosis risks both 1 and 6 months in advance, we achieve average areas under the receiver operating characteristic curve (AUCs) of 0.803 and 0.758, respectively, across thousands of prediction tasks. Finally, our research contributes a new clinical prediction dataset in which researchers can explore how well a diagnosis can be predicted and what health factors are most useful for prediction. For the first time, we can get a much more complete picture of how well risks for thousands of different diagnosis codes can be predicted.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
A Simple Text Mining Approach for Ranking Pairwise Associations in Biomedical Applications
Authors:
Finn Kuusisto,
John Steill,
Zhaobin Kuang,
James Thomson,
David Page,
Ron Stewart
Abstract:
We present a simple text mining method that is easy to implement, requires minimal data collection and preparation, and is easy to use for proposing ranked associations between a list of target terms and a key phrase. We call this method KinderMiner, and apply it to two biomedical applications. The first application is to identify relevant transcription factors for cell reprogramming, and the seco…
▽ More
We present a simple text mining method that is easy to implement, requires minimal data collection and preparation, and is easy to use for proposing ranked associations between a list of target terms and a key phrase. We call this method KinderMiner, and apply it to two biomedical applications. The first application is to identify relevant transcription factors for cell reprogramming, and the second is to identify potential drugs for investigation in drug repositioning. We compare the results from our algorithm to existing data and state-of-the-art algorithms, demonstrating compelling results for both application areas. While we apply the algorithm here for biomedical applications, we argue that the method is generalizable to any available corpus of sufficient size.
△ Less
Submitted 12 June, 2019;
originally announced June 2019.
-
Machine Learning to Predict Developmental Neurotoxicity with High-throughput Data from 2D Bio-engineered Tissues
Authors:
Finn Kuusisto,
Vitor Santos Costa,
Zhonggang Hou,
James Thomson,
David Page,
Ron Stewart
Abstract:
There is a growing need for fast and accurate methods for testing developmental neurotoxicity across several chemical exposure sources. Current approaches, such as in vivo animal studies, and assays of animal and human primary cell cultures, suffer from challenges related to time, cost, and applicability to human physiology. We previously demonstrated success employing machine learning to predict…
▽ More
There is a growing need for fast and accurate methods for testing developmental neurotoxicity across several chemical exposure sources. Current approaches, such as in vivo animal studies, and assays of animal and human primary cell cultures, suffer from challenges related to time, cost, and applicability to human physiology. We previously demonstrated success employing machine learning to predict developmental neurotoxicity using gene expression data collected from human 3D tissue models exposed to various compounds. The 3D model is biologically similar to developing neural structures, but its complexity necessitates extensive expertise and effort to employ. By instead focusing solely on constructing an assay of developmental neurotoxicity, we propose that a simpler 2D tissue model may prove sufficient. We thus compare the accuracy of predictive models trained on data from a 2D tissue model with those trained on data from a 3D tissue model, and find the 2D model to be substantially more accurate. Furthermore, we find the 2D model to be more robust under stringent gene set selection, whereas the 3D model suffers substantial accuracy degradation. While both approaches have advantages and disadvantages, we propose that our described 2D approach could be a valuable tool for decision makers when prioritizing neurotoxicity screening.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Privacy-Preserving Collaborative Prediction using Random Forests
Authors:
Irene Giacomelli,
Somesh Jha,
Ross Kleiman,
David Page,
Kyonghwan Yoon
Abstract:
We study the problem of privacy-preserving machine learning (PPML) for ensemble methods, focusing our effort on random forests. In collaborative analysis, PPML attempts to solve the conflict between the need for data sharing and privacy. This is especially important in privacy sensitive applications such as learning predictive models for clinical decision support from EHR data from different clini…
▽ More
We study the problem of privacy-preserving machine learning (PPML) for ensemble methods, focusing our effort on random forests. In collaborative analysis, PPML attempts to solve the conflict between the need for data sharing and privacy. This is especially important in privacy sensitive applications such as learning predictive models for clinical decision support from EHR data from different clinics, where each clinic has a responsibility for its patients' privacy. We propose a new approach for ensemble methods: each entity learns a model, from its own data, and then when a client asks the prediction for a new private instance, the answers from all the locally trained models are used to compute the prediction in such a way that no extra information is revealed. We implement this approach for random forests and we demonstrate its high efficiency and potential accuracy benefit via experiments on real-world datasets, including actual EHR data.
△ Less
Submitted 21 November, 2018;
originally announced November 2018.
-
An Inductive Logic Programming Approach to Validate Hexose Binding Biochemical Knowledge
Authors:
Houssam Nassif,
Hassan Al-Ali,
Sawsan Khuri,
Walid Keirouz,
David Page
Abstract:
Hexoses are simple sugars that play a key role in many cellular pathways, and in the regulation of development and disease mechanisms. Current protein-sugar computational models are based, at least partially, on prior biochemical findings and knowledge. They incorporate different parts of these findings in predictive black-box models. We investigate the empirical support for biochemical findings b…
▽ More
Hexoses are simple sugars that play a key role in many cellular pathways, and in the regulation of development and disease mechanisms. Current protein-sugar computational models are based, at least partially, on prior biochemical findings and knowledge. They incorporate different parts of these findings in predictive black-box models. We investigate the empirical support for biochemical findings by comparing Inductive Logic Programming (ILP) induced rules to actual biochemical results. We mine the Protein Data Bank for a representative data set of hexose binding sites, non-hexose binding sites and surface grooves. We build an ILP model of hexose-binding sites and evaluate our results against several baseline machine learning classifiers. Our method achieves an accuracy similar to that of other black-box classifiers while providing insight into the discriminating process. In addition, it confirms wet-lab findings and reveals a previously unreported Trp-Glu amino acids dependency.
△ Less
Submitted 2 October, 2018;
originally announced October 2018.
-
CLP(BN): Constraint Logic Programming for Probabilistic Knowledge
Authors:
Vitor Santos Costa,
David Page,
Maleeha Qazi,
James Cussens
Abstract:
We present CLP(BN), a novel approach that aims at expressing Bayesian networks through the constraint logic programming framework. Arguably, an important limitation of traditional Bayesian networks is that they are propositional, and thus cannot represent relations between multiple similar objects in multiple contexts. Several researchers have thus proposed first-order language…
▽ More
We present CLP(BN), a novel approach that aims at expressing Bayesian networks through the constraint logic programming framework. Arguably, an important limitation of traditional Bayesian networks is that they are propositional, and thus cannot represent relations between multiple similar objects in multiple contexts. Several researchers have thus proposed first-order languages to describe such networks. Namely, one very successful example of this approach are the Probabilistic Relational Models (PRMs), that combine Bayesian networks with relational database technology. The key difficulty that we had to address when designing CLP(cal{BN}) is that logic based representations use ground terms to denote objects. With probabilitic data, we need to be able to uniquely represent an object whose value we are not sure about. We use {sl Skolem functions} as unique new symbols that uniquely represent objects with unknown value. The semantics of CLP(cal{BN}) programs then naturally follow from the general framework of constraint logic programming, as applied to a specific domain where we have probabilistic data. This paper introduces and defines CLP(cal{BN}), and it describes an implementation and initial experiments. The paper also shows how CLP(cal{BN}) relates to Probabilistic Relational Models (PRMs), Ngo and Haddawys Probabilistic Logic Programs, AND Kersting AND De Raedts Bayesian Logic Programs.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-wide Association Studies
Authors:
Jie Liu,
Chunming Zhang,
Catherine McCarty,
Peggy Peissig,
Elizabeth Burnside,
David Page
Abstract:
Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The…
▽ More
Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.
-
Demand-Driven Clustering in Relational Domains for Predicting Adverse Drug Events
Authors:
Jesse Davis,
Vitor Santos Costa,
Peggy Peissig,
Michael Caldwell,
Elizabeth Berg,
David Page
Abstract:
Learning from electronic medical records (EMR) is challenging due to their relational nature and the uncertain dependence between a patient's past and future health status. Statistical relational learning is a natural fit for analyzing EMRs but is less adept at handling their inherent latent structure, such as connections between related medications or diseases. One way to capture the latent struc…
▽ More
Learning from electronic medical records (EMR) is challenging due to their relational nature and the uncertain dependence between a patient's past and future health status. Statistical relational learning is a natural fit for analyzing EMRs but is less adept at handling their inherent latent structure, such as connections between related medications or diseases. One way to capture the latent structure is via a relational clustering of objects. We propose a novel approach that, instead of pre-clustering the objects, performs a demand-driven clustering during learning. We evaluate our algorithm on three real-world tasks where the goal is to use EMRs to predict whether a patient will have an adverse reaction to a medication. We find that our approach is more accurate than performing no clustering, pre-clustering, and using expert-constructed medical heterarchies.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Learning Bayesian Network Structure from Correlation-Immune Data
Authors:
Eric Lantz,
Soumya Ray,
David Page
Abstract:
Searching the complete space of possible Bayesian networks is intractable for problems of interesting size, so Bayesian network structure learning algorithms, such as the commonly used Sparse Candidate algorithm, employ heuristics. However, these heuristics also restrict the types of relationships that can be learned exclusively from data. They are unable to learn relationships that exhibit "corre…
▽ More
Searching the complete space of possible Bayesian networks is intractable for problems of interesting size, so Bayesian network structure learning algorithms, such as the commonly used Sparse Candidate algorithm, employ heuristics. However, these heuristics also restrict the types of relationships that can be learned exclusively from data. They are unable to learn relationships that exhibit "correlation-immunity", such as parity. To learn Bayesian networks in the presence of correlation-immune relationships, we extend the Sparse Candidate algorithm with a technique called "skewing". This technique uses the observation that relationships that are correlation-immune under a specific input distribution may not be correlation-immune under another, sufficiently different distribution. We show that by extending Sparse Candidate with this technique we are able to discover relationships between random variables that are approximately correlation-immune, with a significantly lower computational cost than the alternative of considering multiple parents of a node at a time.
△ Less
Submitted 20 June, 2012;
originally announced June 2012.
-
Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation
Authors:
Kendrick Boyd,
Vitor Santos Costa,
Jesse Davis,
David Page
Abstract:
Precision-recall (PR) curves and the areas under them are widely used to summarize machine learning results, especially for data sets exhibiting class skew. They are often used analogously to ROC curves and the area under ROC curves. It is known that PR curves vary as class skew changes. What was not recognized before this paper is that there is a region of PR space that is completely unachievable…
▽ More
Precision-recall (PR) curves and the areas under them are widely used to summarize machine learning results, especially for data sets exhibiting class skew. They are often used analogously to ROC curves and the area under ROC curves. It is known that PR curves vary as class skew changes. What was not recognized before this paper is that there is a region of PR space that is completely unachievable, and the size of this region depends only on the skew. This paper precisely characterizes the size of that region and discusses its implications for empirical evaluation methodology in machine learning.
△ Less
Submitted 18 July, 2012; v1 submitted 18 June, 2012;
originally announced June 2012.