-
Enhancing generalizability of model discovery across parameter space with multi-experiment equation learning (ME-EQL)
Authors:
Maria-Veronica Ciocanel,
John T. Nardini,
Kevin B. Flores,
Erica M. Rutter,
Suzanne S. Sindi,
Alexandria Volkening
Abstract:
Agent-based modeling (ABM) is a powerful tool for understanding self-organizing biological systems, but it is computationally intensive and often not analytically tractable. Equation learning (EQL) methods can derive continuum models from ABM data, but they typically require extensive simulations for each parameter set, raising concerns about generalizability. In this work, we extend EQL to Multi-…
▽ More
Agent-based modeling (ABM) is a powerful tool for understanding self-organizing biological systems, but it is computationally intensive and often not analytically tractable. Equation learning (EQL) methods can derive continuum models from ABM data, but they typically require extensive simulations for each parameter set, raising concerns about generalizability. In this work, we extend EQL to Multi-experiment equation learning (ME-EQL) by introducing two methods: one-at-a-time ME-EQL (OAT ME-EQL), which learns individual models for each parameter set and connects them via interpolation, and embedded structure ME-EQL (ES ME-EQL), which builds a unified model library across parameters. We demonstrate these methods using a birth--death mean-field model and an on-lattice agent-based model of birth, death, and migration with spatial structure. Our results show that both methods significantly reduce the relative error in recovering parameters from agent-based simulations, with OAT ME-EQL offering better generalizability across parameter space. Our findings highlight the potential of equation learning from multiple experiments to enhance the generalizability and interpretability of learned models for complex biological systems.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data
Authors:
Ekaterina Redekop,
Mara Pleasure,
Vedrana Ivezic,
Zichen Wang,
Kimberly Flores,
Anthony Sisk,
William Speier,
Corey Arnold
Abstract:
Foundation models in digital pathology use massive datasets to learn useful compact feature representations of complex histology images. However, there is limited transparency into what drives the correlation between dataset size and performance, raising the question of whether simply adding more data to increase performance is always necessary. In this study, we propose a prototype-guided diffusi…
▽ More
Foundation models in digital pathology use massive datasets to learn useful compact feature representations of complex histology images. However, there is limited transparency into what drives the correlation between dataset size and performance, raising the question of whether simply adding more data to increase performance is always necessary. In this study, we propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale, enabling large-scale self-supervised learning and reducing reliance on real patient samples while preserving downstream performance. Using guidance from histological prototypes during sampling, our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using ~60x-760x less data than models trained on large real-world datasets. Notably, models trained using our synthetic data showed statistically comparable or better performance across multiple evaluation metrics and tasks, even when compared to models trained on orders of magnitude larger datasets. Our hybrid approach, combining synthetic and real data, further enhanced performance, achieving top results in several evaluations. These findings underscore the potential of generative AI to create compelling training data for digital pathology, significantly reducing the reliance on extensive clinical datasets and highlighting the efficiency of our approach.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Digital Volumetric Biopsy Cores Improve Gleason Grading of Prostate Cancer Using Deep Learning
Authors:
Ekaterina Redekop,
Mara Pleasure,
Zichen Wang,
Anthony Sisk,
Yang Zong,
Kimberly Flores,
William Speier,
Corey W. Arnold
Abstract:
Prostate cancer (PCa) was the most frequently diagnosed cancer among American men in 2023. The histological grading of biopsies is essential for diagnosis, and various deep learning-based solutions have been developed to assist with this task. Existing deep learning frameworks are typically applied to individual 2D cross-sections sliced from 3D biopsy tissue specimens. This process impedes the ana…
▽ More
Prostate cancer (PCa) was the most frequently diagnosed cancer among American men in 2023. The histological grading of biopsies is essential for diagnosis, and various deep learning-based solutions have been developed to assist with this task. Existing deep learning frameworks are typically applied to individual 2D cross-sections sliced from 3D biopsy tissue specimens. This process impedes the analysis of complex tissue structures such as glands, which can vary depending on the tissue slice examined. We propose a novel digital pathology data source called a "volumetric core," obtained via the extraction and co-alignment of serially sectioned tissue sections using a novel morphology-preserving alignment framework. We trained an attention-based multiple-instance learning (ABMIL) framework on deep features extracted from volumetric patches to automatically classify the Gleason Grade Group (GGG). To handle volumetric patches, we used a modified video transformer with a deep feature extractor pretrained using self-supervised learning. We ran our morphology-preserving alignment framework to construct 10,210 volumetric cores, leaving out 30% for pretraining. The rest of the dataset was used to train ABMIL, which resulted in a 0.958 macro-average AUC, 0.671 F1 score, 0.661 precision, and 0.695 recall averaged across all five GGG significantly outperforming the 2D baselines.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Estimation of Parameter Distributions for Reaction-Diffusion Equations with Competition using Aggregate Spatiotemporal Data
Authors:
Kyle Nguyen,
Erica M. Rutter,
Kevin Flores
Abstract:
Reaction diffusion equations have been used to model a wide range of biological phenomenon related to population spread and proliferation from ecology to cancer. It is commonly assumed that individuals in a population have homogeneous diffusion and growth rates, however, this assumption can be inaccurate when the population is intrinsically divided into many distinct subpopulations that compete wi…
▽ More
Reaction diffusion equations have been used to model a wide range of biological phenomenon related to population spread and proliferation from ecology to cancer. It is commonly assumed that individuals in a population have homogeneous diffusion and growth rates, however, this assumption can be inaccurate when the population is intrinsically divided into many distinct subpopulations that compete with each other. In previous work, the task of inferring the degree of phenotypic heterogeneity between subpopulations from total population density has been performed within a framework that combines parameter distribution estimation with reaction-diffusion models. Here, we extend this approach so that it is compatible with reaction-diffusion models that include competition between subpopulations. We use a reaction-diffusion model of Glioblastoma multiforme, an aggressive type of brain cancer, to test our approach on simulated data that are similar to measurements that could be collected in practice. We use Prokhorov metric framework and convert the reaction-diffusion model to a random differential equation model to estimate joint distributions of diffusion and growth rates among heterogeneous subpopulations. We then compare the new random differential equation model performance against other partial differential equation models' performance. We find that the random differential equation is more capable at predicting the cell density compared to other models while being more time efficient. Finally, we use $k$-means clustering to predict the number of subpopulations based on the recovered distributions.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
Few-Shot Learning Enables Population-Scale Analysis of Leaf Traits in Populus trichocarpa
Authors:
John Lagergren,
Mirko Pavicic,
Hari B. Chhetri,
Larry M. York,
P. Doug Hyatt,
David Kainer,
Erica M. Rutter,
Kevin Flores,
Jack Bailey-Bale,
Marie Klein,
Gail Taylor,
Daniel Jacobson,
Jared Streich
Abstract:
Plant phenotyping is typically a time-consuming and expensive endeavor, requiring large groups of researchers to meticulously measure biologically relevant plant traits, and is the main bottleneck in understanding plant adaptation and the genetic architecture underlying complex traits at population scale. In this work, we address these challenges by leveraging few-shot learning with convolutional…
▽ More
Plant phenotyping is typically a time-consuming and expensive endeavor, requiring large groups of researchers to meticulously measure biologically relevant plant traits, and is the main bottleneck in understanding plant adaptation and the genetic architecture underlying complex traits at population scale. In this work, we address these challenges by leveraging few-shot learning with convolutional neural networks (CNNs) to segment the leaf body and visible venation of 2,906 P. trichocarpa leaf images obtained in the field. In contrast to previous methods, our approach (i) does not require experimental or image pre-processing, (ii) uses the raw RGB images at full resolution, and (iii) requires very few samples for training (e.g., just eight images for vein segmentation). Traits relating to leaf morphology and vein topology are extracted from the resulting segmentations using traditional open-source image-processing tools, validated using real-world physical measurements, and used to conduct a genome-wide association study to identify genes controlling the traits. In this way, the current work is designed to provide the plant phenotyping community with (i) methods for fast and accurate image-based feature extraction that require minimal training data, and (ii) a new population-scale data set, including 68 different leaf phenotypes, for domain scientists and machine learning researchers. All of the few-shot learning code, data, and results are made publicly available.
△ Less
Submitted 18 May, 2023; v1 submitted 24 January, 2023;
originally announced January 2023.
-
Topological data analysis distinguishes parameter regimes in the Anderson-Chaplain model of angiogenesis
Authors:
John T. Nardini,
Bernadette J. Stolz,
Kevin B. Flores,
Heather A. Harrington,
Helen M. Byrne
Abstract:
Angiogenesis is the process by which blood vessels form from pre-existing vessels. It plays a key role in many biological processes, including embryonic development and wound healing, and contributes to many diseases including cancer and rheumatoid arthritis. The structure of the resulting vessel networks determines their ability to deliver nutrients and remove waste products from biological tissu…
▽ More
Angiogenesis is the process by which blood vessels form from pre-existing vessels. It plays a key role in many biological processes, including embryonic development and wound healing, and contributes to many diseases including cancer and rheumatoid arthritis. The structure of the resulting vessel networks determines their ability to deliver nutrients and remove waste products from biological tissues. Here we simulate the Anderson-Chaplain model of angiogenesis at different parameter values and quantify the vessel architectures of the resulting synthetic data. Specifically, we propose a topological data analysis (TDA) pipeline for systematic analysis of the model. TDA is a vibrant and relatively new field of computational mathematics for studying the shape of data. We compute topological and standard descriptors of model simulations generated by different parameter values. We show that TDA of model simulation data stratifies parameter space into regions with similar vessel morphology. The methodologies proposed here are widely applicable to other synthetic and experimental data including wound healing, development, and plant biology.
△ Less
Submitted 22 April, 2021; v1 submitted 2 January, 2021;
originally announced January 2021.
-
Biologically-informed neural networks guide mechanistic modeling from sparse experimental data
Authors:
John H. Lagergren,
John T. Nardini,
Ruth E. Baker,
Matthew J. Simpson,
Kevin B. Flores
Abstract:
Biologically-informed neural networks (BINNs), an extension of physics-informed neural networks [1], are introduced and used to discover the underlying dynamics of biological systems from sparse experimental data. In the present work, BINNs are trained in a supervised learning framework to approximate in vitro cell biology assay experiments while respecting a generalized form of the governing reac…
▽ More
Biologically-informed neural networks (BINNs), an extension of physics-informed neural networks [1], are introduced and used to discover the underlying dynamics of biological systems from sparse experimental data. In the present work, BINNs are trained in a supervised learning framework to approximate in vitro cell biology assay experiments while respecting a generalized form of the governing reaction-diffusion partial differential equation (PDE). By allowing the diffusion and reaction terms to be multilayer perceptrons (MLPs), the nonlinear forms of these terms can be learned while simultaneously converging to the solution of the governing PDE. Further, the trained MLPs are used to guide the selection of biologically interpretable mechanistic forms of the PDE terms which provides new insights into the biological and physical mechanisms that govern the dynamics of the observed system. The method is evaluated on sparse real-world data from wound healing assays with varying initial cell densities [2].
△ Less
Submitted 26 May, 2020;
originally announced May 2020.
-
Learning Equations from Biological Data with Limited Time Samples
Authors:
John T. Nardini,
John H. Lagergren,
Andrea Hawkins-Daarud,
Lee Curtin,
Bethan Morris,
Erica M. Rutter,
Kristin R. Swanson,
Kevin B. Flores
Abstract:
Equation learning methods present a promising tool to aid scientists in the modeling process for biological data. Previous equation learning studies have demonstrated that these methods can infer models from rich datasets, however, the performance of these methods in the presence of common challenges from biological data has not been thoroughly explored. We present an equation learning methodology…
▽ More
Equation learning methods present a promising tool to aid scientists in the modeling process for biological data. Previous equation learning studies have demonstrated that these methods can infer models from rich datasets, however, the performance of these methods in the presence of common challenges from biological data has not been thoroughly explored. We present an equation learning methodology comprised of data denoising, equation learning, model selection and post-processing steps that infers a dynamical systems model from noisy spatiotemporal data. The performance of this methodology is thoroughly investigated in the face of several common challenges presented by biological data, namely, sparse data sampling, large noise levels, and heterogeneity between datasets. We find that this methodology can accurately infer the correct underlying equation and predict unobserved system dynamics from a small number of time samples when the data is sampled over a time interval exhibiting both linear and nonlinear dynamics. Our findings suggest that equation learning methods can be used for model discovery and selection in many areas of biology when an informative dataset is used. We focus on glioblastoma multiforme modeling as a case study in this work to highlight how these results are informative for data-driven modeling-based tumor invasion predictions.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.