Search | arXiv e-print repository

arXiv:2503.14356 [pdf, other]

Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis

Authors: Alexander Partin, Priyanka Vasanthakumari, Oleksandr Narykov, Andreas Wilke, Natasha Koussa, Sara E. Jones, Yitan Zhu, Jamie C. Overbeek, Rajeev Jain, Gayara Demini Fernando, Cesar Sanchez-Villalobos, Cristina Garcia-Cardona, Jamaludin Mohd-Yusof, Nicholas Chia, Justin M. Wozniak, Souparno Ghosh, Ranadip Pal, Thomas S. Brettin, M. Ryan Weil, Rick L. Stevens

Abstract: Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it dif… ▽ More Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, six standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: 18 pages, 9 figures

arXiv:2409.12215 [pdf, other]

Assessing Reusability of Deep Learning-Based Monotherapy Drug Response Prediction Models Trained with Omics Data

Authors: Jamie C. Overbeek, Alexander Partin, Thomas S. Brettin, Nicholas Chia, Oleksandr Narykov, Priyanka Vasanthakumari, Andreas Wilke, Yitan Zhu, Austin Clyde, Sara Jones, Rohan Gnanaolivu, Yuanhang Liu, Jun Jiang, Chen Wang, Carter Knutson, Andrew McNaughton, Neeraj Kumar, Gayara Demini Fernando, Souparno Ghosh, Cesar Sanchez-Villalobos, Ruibo Zhang, Ranadip Pal, M. Ryan Weil, Rick L. Stevens

Abstract: Cancer drug response prediction (DRP) models present a promising approach towards precision oncology, tailoring treatments to individual patient profiles. While deep learning (DL) methods have shown great potential in this area, models that can be successfully translated into clinical practice and shed light on the molecular mechanisms underlying treatment response will likely emerge from collabor… ▽ More Cancer drug response prediction (DRP) models present a promising approach towards precision oncology, tailoring treatments to individual patient profiles. While deep learning (DL) methods have shown great potential in this area, models that can be successfully translated into clinical practice and shed light on the molecular mechanisms underlying treatment response will likely emerge from collaborative research efforts. This highlights the need for reusable and adaptable models that can be improved and tested by the wider scientific community. In this study, we present a scoring system for assessing the reusability of prediction DRP models, and apply it to 17 peer-reviewed DL-based DRP models. As part of the IMPROVE (Innovative Methodologies and New Data for Predictive Oncology Model Evaluation) project, which aims to develop methods for systematic evaluation and comparison DL models across scientific domains, we analyzed these 17 DRP models focusing on three key categories: software environment, code modularity, and data availability and preprocessing. While not the primary focus, we also attempted to reproduce key performance metrics to verify model behavior and adaptability. Our assessment of 17 DRP models reveals both strengths and shortcomings in model reusability. To promote rigorous practices and open-source sharing, we offer recommendations for developing and sharing prediction models. Following these recommendations can address many of the issues identified in this study, improving model reusability without adding significant burdens on researchers. This work offers the first comprehensive assessment of reusability and reproducibility across diverse DRP models, providing insights into current model sharing practices and promoting standards within the DRP and broader AI-enabled scientific research community. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: 12 pages, 2 figures

arXiv:2407.04486 [pdf, other]

Variational and Explanatory Neural Networks for Encoding Cancer Profiles and Predicting Drug Responses

Authors: Tianshu Feng, Rohan Gnanaolivu, Abolfazl Safikhani, Yuanhang Liu, Jun Jiang, Nicholas Chia, Alexander Partin, Priyanka Vasanthakumari, Yitan Zhu, Chen Wang

Abstract: Human cancers present a significant public health challenge and require the discovery of novel drugs through translational research. Transcriptomics profiling data that describes molecular activities in tumors and cancer cell lines are widely utilized for predicting anti-cancer drug responses. However, existing AI models face challenges due to noise in transcriptomics data and lack of biological i… ▽ More Human cancers present a significant public health challenge and require the discovery of novel drugs through translational research. Transcriptomics profiling data that describes molecular activities in tumors and cancer cell lines are widely utilized for predicting anti-cancer drug responses. However, existing AI models face challenges due to noise in transcriptomics data and lack of biological interpretability. To overcome these limitations, we introduce VETE (Variational and Explanatory Transcriptomics Encoder), a novel neural network framework that incorporates a variational component to mitigate noise effects and integrates traceable gene ontology into the neural network architecture for encoding cancer transcriptomics data. Key innovations include a local interpretability-guided method for identifying ontology paths, a visualization tool to elucidate biological mechanisms of drug responses, and the application of centralized large scale hyperparameter optimization. VETE demonstrated robust accuracy in cancer cell line classification and drug response prediction. Additionally, it provided traceable biological explanations for both tasks and offers insights into the mechanisms underlying its predictions. VETE bridges the gap between AI-driven predictions and biologically meaningful insights in cancer research, which represents a promising advancement in the field. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2210.03198 [pdf, other]

Metabolic Model-based Ecological Modeling for Probiotic Design

Authors: James D. Brunner, Nicholas Chia

Abstract: The microbial community composition in the human gut has a profound effect on human health. This observation has lead to extensive use of microbiome therapies, including over-the-counter ``probiotic" treatments intended to alter the composition of the microbiome. Despite so much promise and commercial interest, the factors that contribute to the success or failure of microbiome-targeted treatments… ▽ More The microbial community composition in the human gut has a profound effect on human health. This observation has lead to extensive use of microbiome therapies, including over-the-counter ``probiotic" treatments intended to alter the composition of the microbiome. Despite so much promise and commercial interest, the factors that contribute to the success or failure of microbiome-targeted treatments remain unclear. We investigate the biotic interactions that lead to successful engraftment of a novel bacterial strain introduced to the microbiome as in probiotic treatments. We use pairwise genome-scale metabolic modeling with a generalized resource allocation constraint to build a network of interactions between 818 species with well developed models available in the AGORA database. We create induced sub-graphs using the taxa present in samples from three experimental engraftment studies and assess the likelihood of invader engraftment based on network structure. To do so, we use a set of dynamical models designed to reflect connect network topology to growth dynamics. We show that a generalized Lotka-Volterra model has strong ability to predict if a particular invader or probiotic will successfully engraft into an individual's microbiome. Furthermore, we show that the mechanistic nature of the model is useful for revealing which microbe-microbe interactions potentially drive engraftment. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: 18 pages, 6 figures

arXiv:2006.02961 [pdf, other]

Confidence in the dynamic spread of epidemics under biased sampling conditions

Authors: James D. Brunner, Nicholas Chia

Abstract: The interpretation of sampling data plays a crucial role in policy response to the spread of a disease during an epidemic, such as the COVID-19 epidemic of 2020. However, this is a non-trivial endeavor due to the complexity of real world conditions and limits to the availability of diagnostic tests, which necessitate a bias in testing favoring symptomatic individuals. A thorough understanding of s… ▽ More The interpretation of sampling data plays a crucial role in policy response to the spread of a disease during an epidemic, such as the COVID-19 epidemic of 2020. However, this is a non-trivial endeavor due to the complexity of real world conditions and limits to the availability of diagnostic tests, which necessitate a bias in testing favoring symptomatic individuals. A thorough understanding of sampling confidence and bias is necessary in order make accurate conclusions. In this manuscript, we provide a stochastic model of sampling for assessing confidence in disease metrics such as trend detection, peak detection, and disease spread estimation. Our model simulates testing for a disease in an epidemic with known dynamics, allowing us to use Monte-Carlo sampling to assess metric confidence. This model can provide realistic simulated data which can be used in the design and calibration of data analysis and prediction methods. As an example, we use this method to show that trends in the disease may be identified using under $10000$ biased samples each day, and an estimate of disease spread can be made with additional $1000-2000$ unbiased samples each day. We also demonstrate that the model can be used to assess more advanced metrics by finding the precision and recall of a strategy for finding peaks in the dynamics. △ Less

Submitted 28 July, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: 11 figures, 2 tables, 15 pages

MSC Class: 92-10; 62D05

arXiv:2003.03638 [pdf, other]

doi 10.1371/journal.pcbi.1007786

Minimizing the number of optimizations for efficient community dynamic flux balance analysis

Authors: James D. Brunner, Nicholas Chia

Abstract: Dynamic flux balance analysis uses a quasi-steady state assumption to calculate an organism's metabolic activity at each time-step of a dynamic simulation, using the well-known technique of flux balance analysis. For microbial communities, this calculation is especially costly and involves solving a linear constrained optimization problem for each member of the community at each time step. However… ▽ More Dynamic flux balance analysis uses a quasi-steady state assumption to calculate an organism's metabolic activity at each time-step of a dynamic simulation, using the well-known technique of flux balance analysis. For microbial communities, this calculation is especially costly and involves solving a linear constrained optimization problem for each member of the community at each time step. However, this is unnecessary and inefficient, as prior solutions can be used to inform future time steps. Here, we show that a basis for the space of internal fluxes can be chosen for each microbe in a community and this basis can be used to simulate forward by solving a relatively inexpensive system of linear equations at most time steps. We can use this solution as long as the resulting metabolic activity remains within the optimization problem's constraints (i.e. the solution to the linear system of equations remains a feasible to the linear program). As the solution becomes infeasible, it first becomes a feasible but degenerate solution to the optimization problem, and we can solve a different but related optimization problem to choose an appropriate basis to continue forward simulation. We demonstrate the efficiency and robustness of our method by comparing with currently used methods on a four species community, and show that our method requires at least $91\%$ fewer optimizations to be solved. For reproducibility, we prototyped the method using Python. Source code is available at \verb|https://github.com/jdbrunner/surfin_fba|. △ Less

Submitted 28 July, 2020; v1 submitted 7 March, 2020; originally announced March 2020.

Comments: 9 figures

MSC Class: 92-08; 92D25

arXiv:1907.04436 [pdf, other]

doi 10.1098/rsif.2019.0423

Metabolite mediated modeling of microbial community dynamics captures emergent behavior more effectively than species-species modeling

Authors: James D. Brunner, Nicholas Chia

Abstract: Personalized models of the gut microbiome are valuable for disease prevention and treatment. For this, one requires a mathematical model that predicts microbial community composition and the emergent behavior of microbial communities. We seek a modeling strategy that can capture emergent behavior when built from sets of universal individual interactions. Our investigation reveals that species-meta… ▽ More Personalized models of the gut microbiome are valuable for disease prevention and treatment. For this, one requires a mathematical model that predicts microbial community composition and the emergent behavior of microbial communities. We seek a modeling strategy that can capture emergent behavior when built from sets of universal individual interactions. Our investigation reveals that species-metabolite interaction modeling is better able to capture emergent behavior in community composition dynamics than direct species-species modeling. Using publicly available data, we examine the ability of species-species models and species-metabolite models to predict trio growth experiments from the outcomes of pair growth experiments. We compare quadratic species-species interaction models and quadratic species-metabolite interaction models, and conclude that only species-metabolite models have the necessary complexity to to explain a wide variety of interdependent growth outcomes. We also show that general species-species interaction models cannot match patterns observed in community growth dynamics, whereas species-metabolite models can. We conclude that species-metabolite modeling will be important in the development of accurate, clinically useful models of microbial communities. △ Less

Submitted 19 August, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

Comments: 23 pages, 8 Figures

MSC Class: 92D25

arXiv:1807.09400 [pdf, other]

doi 10.1103/PhysRevE.99.032413

Extreme value analysis of gut microbial alterations in colorectal cancer

Authors: Stephanie Danni Song, Patricio Jeraldo, Jun Chen, Nicholas Chia

Abstract: Gut microbes play a key role in colorectal carcinogenesis, yet reaching a consensus on microbial signatures remains a challenge. This is in part due to a reliance on mean value estimates. We present an extreme value analysis for overcoming these limitations. By characterizing a power law fit to the relative abundances of microbes, we capture the same microbial signatures as more complex meta-analy… ▽ More Gut microbes play a key role in colorectal carcinogenesis, yet reaching a consensus on microbial signatures remains a challenge. This is in part due to a reliance on mean value estimates. We present an extreme value analysis for overcoming these limitations. By characterizing a power law fit to the relative abundances of microbes, we capture the same microbial signatures as more complex meta-analyses. Importantly, we show that our method is robust to the variations inherent in microbial community profiling and point to future directions for developing sensitive, reliable analytical methods. △ Less

Submitted 13 February, 2019; v1 submitted 24 July, 2018; originally announced July 2018.

Journal ref: Phys. Rev. E 99, 032413 (2019)

arXiv:1706.01787 [pdf]

doi 10.1038/ncomms15393

Global metabolic interaction network of the human gut microbiota for context-specific community-scale analysis

Authors: Jaeyun Sung, Seunghyeon Kim, Josephine Jill T. Cabatbat, Sungho Jang, Yong-Su Jin, Gyoo Yeol Jung, Nicholas Chia, Pan-Jun Kim

Abstract: A system-level framework of complex microbe-microbe and host-microbe chemical cross-talk would help elucidate the role of our gut microbiota in health and disease. Here we report a literature-curated interspecies network of the human gut microbiota, called NJS16. This is an extensive data resource composed of ~570 microbial species and 3 human cell types metabolically interacting through >4,400 sm… ▽ More A system-level framework of complex microbe-microbe and host-microbe chemical cross-talk would help elucidate the role of our gut microbiota in health and disease. Here we report a literature-curated interspecies network of the human gut microbiota, called NJS16. This is an extensive data resource composed of ~570 microbial species and 3 human cell types metabolically interacting through >4,400 small-molecule transport and macromolecule degradation events. Based on the contents of our network, we develop a mathematical approach to elucidate representative microbial and metabolic features of the gut microbial community in a given population, such as a disease cohort. Applying this strategy to microbiome data from type 2 diabetes patients reveals a context-specific infrastructure of the gut microbial ecosystem, core microbial entities with large metabolic influence, and frequently-produced metabolic compounds that might indicate relevant community metabolic processes. Our network presents a foundation towards integrative investigations of community-scale microbial activities within the human gut. △ Less

Submitted 6 June, 2017; originally announced June 2017.

Comments: Supplementary material is available at the journal website

Journal ref: Nat. Commun. 8, 15393 (2017)

arXiv:1012.2166 [pdf, other]

doi 10.1007/s10955-010-0112-8

Statistical Mechanics of Horizontal Gene Transfer in Evolutionary Ecology

Authors: Nicholas Chia, Nigel Goldenfeld

Abstract: The biological world, especially its majority microbial component, is strongly interacting and may be dominated by collective effects. In this review, we provide a brief introduction for statistical physicists of the way in which living cells communicate genetically through transferred genes, as well as the ways in which they can reorganize their genomes in response to environmental pressure. We d… ▽ More The biological world, especially its majority microbial component, is strongly interacting and may be dominated by collective effects. In this review, we provide a brief introduction for statistical physicists of the way in which living cells communicate genetically through transferred genes, as well as the ways in which they can reorganize their genomes in response to environmental pressure. We discuss how genome evolution can be thought of as related to the physical phenomenon of annealing, and describe the sense in which genomes can be said to exhibit an analogue of information entropy. As a direct application of these ideas, we analyze the variation with ocean depth of transposons in marine microbial genomes, predicting trends that are consistent with recent observations using metagenomic surveys. △ Less

Submitted 9 December, 2010; originally announced December 2010.

Comments: Accepted by Journal of Statistical Physics

arXiv:1005.3349 [pdf, other]

doi 10.1103/PhysRevE.83.021906

The dynamics of gene duplication and transposons in microbial genomes following a sudden environmental change

Authors: Nicholas Chia, Nigel Goldenfeld

Abstract: A variety of genome transformations can occur as a microbial population adapts to a large environmental change. In particular, genomic surveys indicate that, following the transition to an obligate, host-dependent symbiont, the density of transposons first rises, then subsequently declines over evolutionary time. Here, we show that these observations can be accounted for by a class of generic stoc… ▽ More A variety of genome transformations can occur as a microbial population adapts to a large environmental change. In particular, genomic surveys indicate that, following the transition to an obligate, host-dependent symbiont, the density of transposons first rises, then subsequently declines over evolutionary time. Here, we show that these observations can be accounted for by a class of generic stochastic models for the evolution of genomes in the presence of continuous selection and gene duplication. The models use a fitness function that allows for partial contributions from multiple gene copies, is an increasing but bounded function of copy number, and is optimal for one fully adapted gene copy. We use Monte Carlo simulation to show that the dynamics result in an initial rise in gene copy number followed by a subsequent fall due to adaptation to the new environmental parameters. These results are robust for reasonable gene duplication and mutation parameters when adapting to a novel target sequence. Our model provides a generic explanation for the dynamics of microbial transposon density following a large environmental changes such as host restriction. △ Less

Submitted 19 January, 2011; v1 submitted 18 May, 2010; originally announced May 2010.

arXiv:0811.3407 [pdf, ps, other]

doi 10.1103/PhysRevE.80.030901

Lambda-prophage induction modeled as a cooperative failure mode of lytic repression

Authors: Nicholas Chia, Ido Golding, Nigel Goldenfeld

Abstract: We analyze a system-level model for lytic repression of lambda-phage in E. coli using reliability theory, showing that the repressor circuit comprises 4 redundant components whose failure mode is prophage induction. Our model reflects the specific biochemical mechanisms involved in regulation, including long-range cooperative binding, and its detailed predictions for prophage induction in E. col… ▽ More We analyze a system-level model for lytic repression of lambda-phage in E. coli using reliability theory, showing that the repressor circuit comprises 4 redundant components whose failure mode is prophage induction. Our model reflects the specific biochemical mechanisms involved in regulation, including long-range cooperative binding, and its detailed predictions for prophage induction in E. coli under ultra-violet radiation are in good agreement with experimental data. △ Less

Submitted 3 December, 2008; v1 submitted 20 November, 2008; originally announced November 2008.

Comments: added reference

arXiv:q-bio/0406009 [pdf, ps, other]

Finite Width Model Sequence Comparison

Authors: Ralf Bundschuh, Nicholas Chia

Abstract: Sequence comparison is a widely used computational technique in modern molecular biology. In spite of the frequent use of sequence comparisons the important problem of assigning statistical significance to a given degree of similarity is still outstanding. Analytical approaches to filling this gap usually make use of an approximation that neglects certain correlations in the disorder underlying… ▽ More Sequence comparison is a widely used computational technique in modern molecular biology. In spite of the frequent use of sequence comparisons the important problem of assigning statistical significance to a given degree of similarity is still outstanding. Analytical approaches to filling this gap usually make use of an approximation that neglects certain correlations in the disorder underlying the sequence comparison algorithm. Here, we use the longest common subsequence problem, a prototype sequence comparison problem, to analytically establish that this approximation does make a difference to certain sequence comparison statistics. In the course of establishing this difference we develop a method that can systematically deal with these disorder correlations. △ Less

Submitted 3 June, 2004; originally announced June 2004.

Showing 1–13 of 13 results for author: Chia, N