-
Complex multiannual cycles of Mycoplasma pneumoniae: persistence and the role of stochasticity
Authors:
Bjarke Frost Nielsen,
Sang Woo Park,
Emily Howerton,
Olivia Frost Lorentzen,
Mogens H. Jensen,
Bryan T. Grenfell
Abstract:
The epidemiological dynamics of Mycoplasma pneumoniae are characterized by complex and poorly understood multiannual cycles, posing challenges for forecasting. Using Bayesian methods to fit a seasonally forced transmission model to long-term surveillance data from Denmark (1958-1995, 2010-2025), we investigate the mechanisms driving recurrent outbreaks of M. pneumoniae. The period of the multiannu…
▽ More
The epidemiological dynamics of Mycoplasma pneumoniae are characterized by complex and poorly understood multiannual cycles, posing challenges for forecasting. Using Bayesian methods to fit a seasonally forced transmission model to long-term surveillance data from Denmark (1958-1995, 2010-2025), we investigate the mechanisms driving recurrent outbreaks of M. pneumoniae. The period of the multiannual cycles (predominantly approx. 5 years in Denmark) are explained as a consequence of the interaction of two time-scales in the system, one intrinsic and one extrinsic (seasonal). While it provides an excellent fit to shorter time series (a few decades), we find that the deterministic model eventually settles into an annual cycle, failing to reproduce the observed 4-5-year periodicity long-term. Upon further analysis, the system is found to exhibit transient chaos and thus high sensitivity to stochasticity. We show that environmental (but not purely demographic) stochasticity can sustain the multi-year cycles via stochastic resonance. The disruptive effects of COVID-19 non-pharmaceutical interventions (NPIs) on M. pneumoniae circulation constitute a natural experiment on the effects of large perturbations. Consequently, the effects of NPIs are included in the model and medium-term predictions are explored. Our findings highlight the intrinsic sensitivity of M. pneumoniae dynamics to perturbations and interventions, underscoring the limitations of deterministic epidemic models for long-term prediction. More generally, our results emphasize the potential role of stochasticity as a driver of complex cycles across endemic and recurring pathogens.
△ Less
Submitted 16 April, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
Reliable algorithm selection for machine learning-guided design
Authors:
Clara Fannjiang,
Ji Won Park
Abstract:
Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. H…
▽ More
Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion -- for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
ZAPBench: A Benchmark for Whole-Brain Activity Prediction in Zebrafish
Authors:
Jan-Matthis Lueckmann,
Alexander Immer,
Alex Bo-Yuan Chen,
Peter H. Li,
Mariela D. Petkova,
Nirmala A. Iyer,
Luuk Willem Hesselink,
Aparna Dev,
Gudrun Ihrke,
Woohyun Park,
Alyson Petruncio,
Aubrey Weigel,
Wyatt Korff,
Florian Engert,
Jeff W. Lichtman,
Misha B. Ahrens,
Michał Januszewski,
Viren Jain
Abstract:
Data-driven benchmarks have led to significant progress in key scientific modeling domains including weather and structural biology. Here, we introduce the Zebrafish Activity Prediction Benchmark (ZAPBench) to measure progress on the problem of predicting cellular-resolution neural activity throughout an entire vertebrate brain. The benchmark is based on a novel dataset containing 4d light-sheet m…
▽ More
Data-driven benchmarks have led to significant progress in key scientific modeling domains including weather and structural biology. Here, we introduce the Zebrafish Activity Prediction Benchmark (ZAPBench) to measure progress on the problem of predicting cellular-resolution neural activity throughout an entire vertebrate brain. The benchmark is based on a novel dataset containing 4d light-sheet microscopy recordings of over 70,000 neurons in a larval zebrafish brain, along with motion stabilized and voxel-level cell segmentations of these data that facilitate development of a variety of forecasting methods. Initial results from a selection of time series and volumetric video modeling approaches achieve better performance than naive baseline methods, but also show room for further improvement. The specific brain used in the activity recording is also undergoing synaptic-level anatomical mapping, which will enable future integration of detailed structural information into forecasting methods.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Forecasting Whole-Brain Neuronal Activity from Volumetric Video
Authors:
Alexander Immer,
Jan-Matthis Lueckmann,
Alex Bo-Yuan Chen,
Peter H. Li,
Mariela D. Petkova,
Nirmala A. Iyer,
Aparna Dev,
Gudrun Ihrke,
Woohyun Park,
Alyson Petruncio,
Aubrey Weigel,
Wyatt Korff,
Florian Engert,
Jeff W. Lichtman,
Misha B. Ahrens,
Viren Jain,
Michał Januszewski
Abstract:
Large-scale neuronal activity recordings with fluorescent calcium indicators are increasingly common, yielding high-resolution 2D or 3D videos. Traditional analysis pipelines reduce this data to 1D traces by segmenting regions of interest, leading to inevitable information loss. Inspired by the success of deep learning on minimally processed data in other domains, we investigate the potential of f…
▽ More
Large-scale neuronal activity recordings with fluorescent calcium indicators are increasingly common, yielding high-resolution 2D or 3D videos. Traditional analysis pipelines reduce this data to 1D traces by segmenting regions of interest, leading to inevitable information loss. Inspired by the success of deep learning on minimally processed data in other domains, we investigate the potential of forecasting neuronal activity directly from volumetric videos. To capture long-range dependencies in high-resolution volumetric whole-brain recordings, we design a model with large receptive fields, which allow it to integrate information from distant regions within the brain. We explore the effects of pre-training and perform extensive model selection, analyzing spatio-temporal trade-offs for generating accurate forecasts. Our model outperforms trace-based forecasting approaches on ZAPBench, a recently proposed benchmark on whole-brain activity prediction in zebrafish, demonstrating the advantages of preserving the spatial structure of neuronal activity.
△ Less
Submitted 27 February, 2025;
originally announced March 2025.
-
Antibody DomainBed: Out-of-Distribution Generalization in Therapeutic Protein Design
Authors:
Nataša Tagasovska,
Ji Won Park,
Matthieu Kirchmeyer,
Nathan C. Frey,
Andrew Martin Watkins,
Aya Abdelsalam Ismail,
Arian Rokkum Jamasb,
Edith Lee,
Tyler Bryson,
Stephen Ra,
Kyunghyun Cho
Abstract:
Machine learning (ML) has demonstrated significant promise in accelerating drug design. Active ML-guided optimization of therapeutic molecules typically relies on a surrogate model predicting the target property of interest. The model predictions are used to determine which designs to evaluate in the lab, and the model is updated on the new measurements to inform the next cycle of decisions. A key…
▽ More
Machine learning (ML) has demonstrated significant promise in accelerating drug design. Active ML-guided optimization of therapeutic molecules typically relies on a surrogate model predicting the target property of interest. The model predictions are used to determine which designs to evaluate in the lab, and the model is updated on the new measurements to inform the next cycle of decisions. A key challenge is that the experimental feedback from each cycle inspires changes in the candidate proposal or experimental protocol for the next cycle, which lead to distribution shifts. To promote robustness to these shifts, we must account for them explicitly in the model training. We apply domain generalization (DG) methods to classify the stability of interactions between an antibody and antigen across five domains defined by design cycles. Our results suggest that foundational models and ensembling improve predictive performance on out-of-distribution domains. We publicly release our codebase extending the DG benchmark ``DomainBed,'' and the associated dataset of antibody sequences and structures emulating distribution shifts across design cycles.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Blind Biological Sequence Denoising with Self-Supervised Set Learning
Authors:
Nathan Ng,
Ji Won Park,
Jae Hyeon Lee,
Ryan Lewis Kelly,
Stephen Ra,
Kyunghyun Cho
Abstract:
Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are avai…
▽ More
Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of $\leq 6$ subreads with 17% fewer errors and large reads of $>6$ subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
PropertyDAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design
Authors:
Ji Won Park,
Samuel Stanton,
Saeed Saremi,
Andrew Watkins,
Henri Dwyer,
Vladimir Gligorijevic,
Richard Bonneau,
Stephen Ra,
Kyunghyun Cho
Abstract:
Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarch…
▽ More
Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarchical dependency structure. We consider a common use case where some regions of the Pareto frontier are prioritized over others according to a specified $\textit{partial ordering}$ in the objectives. For instance, when designing antibodies, we would like to maximize the binding affinity to a target antigen only if it can be expressed in live cell culture -- modeling the experimental dependency in which affinity can only be measured for antibodies that can be expressed and thus produced in viable quantities. In general, we may want to confer a partial ordering to the properties such that each property is optimized conditioned on its parent properties satisfying some feasibility condition. To this end, we present PropertyDAG, a framework that operates on top of the traditional multi-objective BO to impose this desired ordering on the objectives, e.g. expression $\rightarrow$ affinity. We demonstrate its performance over multiple simulated active learning iterations on a penicillin production task, toy numerical problem, and a real-world antibody design task.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Multi-segment preserving sampling for deep manifold sampler
Authors:
Daniel Berenberg,
Jae Hyeon Lee,
Simon Kelow,
Ji Won Park,
Andrew Watkins,
Vladimir Gligorijević,
Richard Bonneau,
Stephen Ra,
Kyunghyun Cho
Abstract:
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guide…
▽ More
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Network reinforcement driven drug repurposing for COVID-19 by exploiting disease-gene-drug associations
Authors:
Yonghyun Nam,
Jae-Seung Yun,
Seung Mi Lee,
Ji Won Park,
Ziqi Chen,
Brian Lee,
Anurag Verma,
Xia Ning,
Li Shen,
Dokyoon Kim
Abstract:
Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for developing treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already…
▽ More
Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for developing treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already been in place or will be performed at clinical sites in the near future. Additionally, patients with comorbidities such as diabetes mellitus, obesity, liver cirrhosis, kidney diseases, hypertension, and asthma are at higher risk for severe illness from COVID-19. Thus, the relationship of comorbidity disease with COVID-19 may help to find repurposable drugs. To reduce trial and error in finding treatments for COVID-19, we propose building a network-based drug repurposing framework to prioritize repurposable drugs. First, we utilized knowledge of COVID-19 to construct a disease-gene-drug network (DGDr-Net) representing a COVID-19-centric interactome with components for diseases, genes, and drugs. DGDr-Net consisted of 592 diseases, 26,681 human genes and 2,173 drugs, and medical information for 18 common comorbidities. The DGDr-Net recommended candidate repurposable drugs for COVID-19 through network reinforcement driven scoring algorithms. The scoring algorithms determined the priority of recommendations by utilizing graph-based semi-supervised learning. From the predicted scores, we recommended 30 drugs, including dexamethasone, resveratrol, methotrexate, indomethacin, quercetin, etc., as repurposable drugs for COVID-19, and the results were verified with drugs that have been under clinical trials. The list of drugs via a data-driven computational approach could help reduce trial-and-error in finding treatment for COVID-19.
△ Less
Submitted 12 August, 2020;
originally announced August 2020.
-
A note on observation processes in epidemic models
Authors:
Sang Woo Park,
Benjamin M. Bolker
Abstract:
Many disease models focus on characterizing the underlying transmission mechanism but make simple, possibly naive assumptions about how infections are reported. In this note, we use a simple deterministic Susceptible-Infected-Removed (SIR) model to compare two common assumptions about disease incidence reports: individuals can report their infection as soon as they become infected or as soon as th…
▽ More
Many disease models focus on characterizing the underlying transmission mechanism but make simple, possibly naive assumptions about how infections are reported. In this note, we use a simple deterministic Susceptible-Infected-Removed (SIR) model to compare two common assumptions about disease incidence reports: individuals can report their infection as soon as they become infected or as soon as they recover. We show that incorrect assumptions about the underlying observation processes can bias estimates of the basic reproduction number and lead to overly narrow confidence intervals.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Ebola cases and health system demand in Liberia
Authors:
John M. Drake,
RajReni B. Kaul,
Laura Alexander,
Suzanne M. O'Regan,
Andrew M. Kramer,
J. Tomlin Pulliam,
Matthew J. Ferrari,
Andrew W. Park
Abstract:
In 2014, a major epidemic of human Ebola virus disease emerged in West Africa, where human-to-human transmission has now been been sustained for greater than 10 months. In the summer of 2014, there was great uncertainty about the answers to several key policy questions concerning the path to containment. In recent years, epidemic models have been used to guide public health interventions. But, mod…
▽ More
In 2014, a major epidemic of human Ebola virus disease emerged in West Africa, where human-to-human transmission has now been been sustained for greater than 10 months. In the summer of 2014, there was great uncertainty about the answers to several key policy questions concerning the path to containment. In recent years, epidemic models have been used to guide public health interventions. But, model-based policy relies on high quality causal understanding of transmission, including the availability of appropriate dynamic transmission models and reliable reporting about the sequence of case incidence for model fitting, which were lacking for this epidemic. To investigate the range of potential transmission scenarios, we developed a multi-type branching process model that incorporates key heterogeneities and time-varying parameters to reflect changing human behavior and deliberate interventions. Ensembles of this model were evaluated at a set of parameters that were both epidemiologically plausible and capable of reproducing the observed trajectory. Results suggest that epidemic outcome depends on both hospital capacity and individual behavior. The model predicts that if hospital capacity is not increased soon, then transmission may outpace the rate of isolation and the ability to provide care for the ill, infectious, and dying. Similarly, containment will probably require individuals to adopt behaviors that increase the rates of case identification and isolation and secure burial of the deceased. Given current knowledge, it is uncertain that this epidemic will be contained even with 99% hospitalization rate at the currently projected hospital capacity.
△ Less
Submitted 30 October, 2014;
originally announced October 2014.
-
A Modified Cross Correlation Algorithm for Reference-free Image Alignment of Non-Circular Projections in Single-Particle Electron Microscopy
Authors:
Wooram Park,
Gregory S. Chirikjian
Abstract:
In this paper we propose a modified cross correlation method to align images from the same class in single-particle electron microscopy of highly non-spherical structures. In this new method, First we coarsely align projection images, and then re-align the resulting images using the cross correlation (CC) method. The coarse alignment is obtained by matching the centers of mass and the principal ax…
▽ More
In this paper we propose a modified cross correlation method to align images from the same class in single-particle electron microscopy of highly non-spherical structures. In this new method, First we coarsely align projection images, and then re-align the resulting images using the cross correlation (CC) method. The coarse alignment is obtained by matching the centers of mass and the principal axes of the images. The distribution of misalignment in this coarse alignment can be quantified based on the statistical properties of the additive background noise. As a consequence, the search space for re-alignment in the cross correlation method can be reduced to achieve better alignment. In order to overcome problems associated with false peaks in the cross correlations function, we use artificially blurred images for the early stage of the iterative cross correlation method and segment the intermediate class average from every iteration step. These two additional manipulations combined with the reduced search space size in the cross correlation method yield better alignments for low signal-to-noise ratio images than both classical cross correlation and maximum likelihood(ML) methods.
△ Less
Submitted 6 May, 2011;
originally announced May 2011.