-
lintsampler: Easy random sampling via linear interpolation
Authors:
Aneesh P. Naik,
Michael S. Petersen
Abstract:
'lintsampler' provides a Python implementation of a technique we term 'linear interpolant sampling': an algorithm to efficiently draw pseudo-random samples from an arbitrary probability density function (PDF). First, the PDF is evaluated on a grid-like structure. Then, it is assumed that the PDF can be approximated between grid vertices by the (multidimensional) linear interpolant. With this assum…
▽ More
'lintsampler' provides a Python implementation of a technique we term 'linear interpolant sampling': an algorithm to efficiently draw pseudo-random samples from an arbitrary probability density function (PDF). First, the PDF is evaluated on a grid-like structure. Then, it is assumed that the PDF can be approximated between grid vertices by the (multidimensional) linear interpolant. With this assumption, random samples can be efficiently drawn via inverse transform sampling. lintsampler is primarily written with 'numpy', drawing some additional functionality from 'scipy'. Under the most basic usage of lintsampler, the user provides a Python function defining the target PDF and some parameters describing a grid-like structure to the 'LintSampler' class, and is then able to draw samples via the 'sample' method. Additionally, there is functionality for the user to set the random seed, employ quasi-Monte Carlo sampling, or sample within a premade grid ('DensityGrid') or tree ('DensityTree') structure.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Applying data technologies to combat AMR: current status, challenges, and opportunities on the way forward
Authors:
Leonid Chindelevitch,
Elita Jauneikaite,
Nicole E. Wheeler,
Kasim Allel,
Bede Yaw Ansiri-Asafoakaa,
Wireko A. Awuah,
Denis C. Bauer,
Stephan Beisken,
Kara Fan,
Gary Grant,
Michael Graz,
Yara Khalaf,
Veranja Liyanapathirana,
Carlos Montefusco-Pereira,
Lawrence Mugisha,
Atharv Naik,
Sylvia Nanono,
Anthony Nguyen,
Timothy Rawson,
Kessendri Reddy,
Juliana M. Ruzante,
Anneke Schmider,
Roman Stocker,
Leonhardt Unruh,
Daniel Waruingi
, et al. (2 additional authors not shown)
Abstract:
Antimicrobial resistance (AMR) is a growing public health threat, estimated to cause over 10 million deaths per year and cost the global economy 100 trillion USD by 2050 under status quo projections. These losses would mainly result from an increase in the morbidity and mortality from treatment failure, AMR infections during medical procedures, and a loss of quality of life attributed to AMR. Nume…
▽ More
Antimicrobial resistance (AMR) is a growing public health threat, estimated to cause over 10 million deaths per year and cost the global economy 100 trillion USD by 2050 under status quo projections. These losses would mainly result from an increase in the morbidity and mortality from treatment failure, AMR infections during medical procedures, and a loss of quality of life attributed to AMR. Numerous interventions have been proposed to control the development of AMR and mitigate the risks posed by its spread. This paper reviews key aspects of bacterial AMR management and control which make essential use of data technologies such as artificial intelligence, machine learning, and mathematical and statistical modelling, fields that have seen rapid developments in this century. Although data technologies have become an integral part of biomedical research, their impact on AMR management has remained modest. We outline the use of data technologies to combat AMR, detailing recent advancements in four complementary categories: surveillance, prevention, diagnosis, and treatment. We provide an overview on current AMR control approaches using data technologies within biomedical research, clinical practice, and in the "One Health" context. We discuss the potential impact and challenges wider implementation of data technologies is facing in high-income as well as in low- and middle-income countries, and recommend concrete actions needed to allow these technologies to be more readily integrated within the healthcare and public health sectors.
△ Less
Submitted 11 August, 2022; v1 submitted 5 July, 2022;
originally announced August 2022.
-
Composite Scores for Transplant Center Evaluation: A New Individualized Empirical Null Method
Authors:
Nicholas Hartman,
Joseph M. Messana,
Jian Kang,
Abhijit S. Naik,
Tempie H. Shearon,
Kevin He
Abstract:
Risk-adjusted quality measures are used to evaluate healthcare providers while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the risk adjustment is perfect and the between-provider variation in quality measures is entirely due to the quality of care. However, in practice, even with very good models for risk adjustment, some be…
▽ More
Risk-adjusted quality measures are used to evaluate healthcare providers while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the risk adjustment is perfect and the between-provider variation in quality measures is entirely due to the quality of care. However, in practice, even with very good models for risk adjustment, some between-provider variation will be due to incomplete risk adjustment, which should be recognized in assessing and monitoring providers. Otherwise, conventional methods disproportionately identify larger providers as outliers, even though their provider effects need not be "extreme.'' Motivated by efforts to evaluate the quality of care provided by transplant centers, we develop a composite evaluation score based on a novel individualized empirical null method, which robustly accounts for overdispersion due to unobserved risk factors, models the marginal variance of standardized scores as a function of the effective center size, and only requires the use of publicly-available center-level statistics. The evaluations of United States kidney transplant centers based on the proposed composite score are substantially different from those based on conventional methods. Simulations show that the proposed empirical null approach more accurately classifies centers in terms of quality of care, compared to existing methods.
△ Less
Submitted 23 July, 2022; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Classifying Documents within Multiple Hierarchical Datasets using Multi-Task Learning
Authors:
Azad Naik,
Anveshi Charuvaka,
Huzefa Rangwala
Abstract:
Multi-task learning (MTL) is a supervised learning paradigm in which the prediction models for several related tasks are learned jointly to achieve better generalization performance. When there are only a few training examples per task, MTL considerably outperforms the traditional Single task learning (STL) in terms of prediction accuracy. In this work we develop an MTL based approach for classify…
▽ More
Multi-task learning (MTL) is a supervised learning paradigm in which the prediction models for several related tasks are learned jointly to achieve better generalization performance. When there are only a few training examples per task, MTL considerably outperforms the traditional Single task learning (STL) in terms of prediction accuracy. In this work we develop an MTL based approach for classifying documents that are archived within dual concept hierarchies, namely, DMOZ and Wikipedia. We solve the multi-class classification problem by defining one-versus-rest binary classification tasks for each of the different classes across the two hierarchical datasets. Instead of learning a linear discriminant for each of the different tasks independently, we use a MTL approach with relationships between the different tasks across the datasets established using the non-parametric, lazy, nearest neighbor approach. We also develop and evaluate a transfer learning (TL) approach and compare the MTL (and TL) methods against the standard single task learning and semi-supervised learning approaches. Our empirical results demonstrate the strength of our developed methods that show an improvement especially when there are fewer number of training examples per classification task.
△ Less
Submitted 5 June, 2017;
originally announced June 2017.
-
Embedding Feature Selection for Large-scale Hierarchical Classification
Authors:
Azad Naik,
Huzefa Rangwala
Abstract:
Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time a…
▽ More
Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45% less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.
△ Less
Submitted 5 June, 2017;
originally announced June 2017.
-
Inconsistent Node Flattening for Improving Top-down Hierarchical Classification
Authors:
Azad Naik,
Huzefa Rangwala
Abstract:
Large-scale classification of data where classes are structurally organized in a hierarchy is an important area of research. Top-down approaches that exploit the hierarchy during the learning and prediction phase are efficient for large scale hierarchical classification. However, accuracy of top-down approaches is poor due to error propagation i.e., prediction errors made at higher levels in the h…
▽ More
Large-scale classification of data where classes are structurally organized in a hierarchy is an important area of research. Top-down approaches that exploit the hierarchy during the learning and prediction phase are efficient for large scale hierarchical classification. However, accuracy of top-down approaches is poor due to error propagation i.e., prediction errors made at higher levels in the hierarchy cannot be corrected at lower levels. One of the main reason behind errors at the higher levels is the presence of inconsistent nodes that are introduced due to the arbitrary process of creating these hierarchies by domain experts. In this paper, we propose two different data-driven approaches (local and global) for hierarchical structure modification that identifies and flattens inconsistent nodes present within the hierarchy. Our extensive empirical evaluation of the proposed approaches on several image and text datasets with varying distribution of features, classes and training instances per class shows improved classification performance over competing hierarchical modification approaches. Specifically, we see an improvement upto 7% in Macro-F1 score with our approach over best TD baseline. SOURCE CODE: http://www.cs.gmu.edu/~mlbio/InconsistentNodeFlattening
△ Less
Submitted 5 June, 2017;
originally announced June 2017.
-
Deciding when to stop: Efficient stopping of active learning guided drug-target prediction
Authors:
Maja Temerinac-Ott,
Armaghan W. Naik,
Robert F. Murphy
Abstract:
Active learning has shown to reduce the number of experiments needed to obtain high-confidence drug-target predictions. However, in order to actually save experiments using active learning, it is crucial to have a method to evaluate the quality of the current prediction and decide when to stop the experimentation process. Only by applying reliable stoping criteria to active learning, time and cost…
▽ More
Active learning has shown to reduce the number of experiments needed to obtain high-confidence drug-target predictions. However, in order to actually save experiments using active learning, it is crucial to have a method to evaluate the quality of the current prediction and decide when to stop the experimentation process. Only by applying reliable stoping criteria to active learning, time and costs in the experimental process can be actually saved. We compute active learning traces on simulated drug-target matrices in order to learn a regression model for the accuracy of the active learner. By analyzing the performance of the regression model on simulated data, we design stopping criteria for previously unseen experimental matrices. We demonstrate on four previously characterized drug effect data sets that applying the stopping criteria can result in upto 40% savings of the total experiments for highly accurate predictions.
△ Less
Submitted 9 April, 2015;
originally announced April 2015.