-
Steering Generative Models with Experimental Data for Protein Fitness Optimization
Authors:
Jason Yang,
Wenda Chu,
Daniel Khalil,
Raul Astudillo,
Bruce J. Wittmann,
Frances H. Arnold,
Yisong Yue
Abstract:
Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent developments in steering protein generative models (e.g diffusion models, language models) offer a promising approach. However, by and large, past studies have optimized surrogate rewards and/or utilized large amounts…
▽ More
Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent developments in steering protein generative models (e.g diffusion models, language models) offer a promising approach. However, by and large, past studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured by low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages compared to alternatives such as reinforcement learning with protein language models.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
An epidemical model with nonlocal spatial infections
Authors:
Su Yang,
Weiqi Chu,
Panayotis Kevrekidis
Abstract:
The SIR model is one of the most prototypical compartmental models in epidemiology. Generalizing this ordinary differential equation (ODE) framework into a spatially distributed partial differential equation (PDE) model is a considerable challenge. In the present work, we extend a recently proposed model based on nearest-neighbor spatial interactions by one of the authors in~\cite{vaziry2022modell…
▽ More
The SIR model is one of the most prototypical compartmental models in epidemiology. Generalizing this ordinary differential equation (ODE) framework into a spatially distributed partial differential equation (PDE) model is a considerable challenge. In the present work, we extend a recently proposed model based on nearest-neighbor spatial interactions by one of the authors in~\cite{vaziry2022modelling} towards a nonlocal, nonlinear PDE variant of the SIR prototype. We then seek to develop a set of tools that provide insights for this PDE framework. Stationary states and their stability analysis offer a perspective on the early spatial growth of the infection. Evolutionary computational dynamics enable visualization of the spatio-temporal progression of infection and recovery, allowing for an appreciation of the effect of varying parameters of the nonlocal kernel, such as, e.g., its width parameter. These features are explored in both one- and two-dimensional settings. At a model-reduction level, we develop a sequence of interpretable moment-based diagnostics to observe how these reflect the total number of infections, the epidemic's epicenter, and its spread. Finally, we propose a data-driven methodology based on the sparse identification of nonlinear dynamics (SINDy) to identify approximate closed-form dynamical equations for such quantities. These approaches may pave the way for further spatio-temporal studies, enabling the quantification of epidemics.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Modeling, Inference, and Prediction in Mobility-Based Compartmental Models for Epidemiology
Authors:
Ning Jiang,
Weiqi Chu,
Yao Li
Abstract:
Classical compartmental models in epidemiology often assume a homogeneous population for simplicity, which neglects the inherent heterogeneity among individuals. This assumption frequently leads to inaccurate predictions when applied to real-world data. For example, evidence has shown that classical models overestimate the final pandemic size in the H1N1-2009 and COVID-19 outbreaks. To address thi…
▽ More
Classical compartmental models in epidemiology often assume a homogeneous population for simplicity, which neglects the inherent heterogeneity among individuals. This assumption frequently leads to inaccurate predictions when applied to real-world data. For example, evidence has shown that classical models overestimate the final pandemic size in the H1N1-2009 and COVID-19 outbreaks. To address this issue, we introduce individual mobility as a key factor in disease transmission and control. We characterize disease dynamics using mobility distribution functions for each compartment and propose a mobility-based compartmental model that incorporates population heterogeneity. Our results demonstrate that, for the same basic reproduction number, our mobility-based model predicts a smaller final pandemic size compared to the classical models, effectively addressing the common overestimation problem. Additionally, we infer mobility distributions from the time series of the infected population. We provide sufficient conditions for uniquely identifying the mobility distribution from a dataset and propose a machine-learning-based approach to learn mobility from both synthesized and real-world data.
△ Less
Submitted 6 September, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Integrating Deep Learning and Synthetic Biology: A Co-Design Approach for Enhancing Gene Expression via N-terminal Coding Sequences
Authors:
Zhanglu Yan,
Weiran Chu,
Yuhua Sheng,
Kaiwen Tang,
Shida Wang,
Yanfeng Liu,
Weng-Fai Wong
Abstract:
N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. T…
▽ More
N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. This paper introduces a deep learning/synthetic biology co-designed few-shot training workflow for NCS optimization. Our method utilizes k-nearest encoding followed by word2vec to encode the NCS, then performs feature extraction using attention mechanisms, before constructing a time-series network for predicting gene expression intensity, and finally a direct search algorithm identifies the optimal NCS with limited training data. We took green fluorescent protein (GFP) expressed by Bacillus subtilis as a reporting protein of NCSs, and employed the fluorescence enhancement factor as the metric of NCS optimization. Within just six iterative experiments, our model generated an NCS (MLD62) that increased average GFP expression by 5.41-fold, outperforming the state-of-the-art NCS designs. Extending our findings beyond GFP, we showed that our engineered NCS (MLD62) can effectively boost the production of N-acetylneuraminic acid by enhancing the expression of the crucial rate-limiting GNA1 gene, demonstrating its practical utility. We have open-sourced our NCS expression database and experimental procedures for public use.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Uncertainty-Aware Self-supervised Neural Network for Liver $T_{1ρ}$ Mapping with Relaxation Constraint
Authors:
Chaoxing Huang,
Yurui Qian,
Simon Chun Ho Yu,
Jian Hou,
Baiyan Jiang,
Queenie Chan,
Vincent Wai-Sun Wong,
Winnie Chiu-Wing Chu,
Weitian Chen
Abstract:
$T_{1ρ}$ mapping is a promising quantitative MRI technique for the non-invasive assessment of tissue properties. Learning-based approaches can map $T_{1ρ}$ from a reduced number of $T_{1ρ}$ weighted images, but requires significant amounts of high quality training data. Moreover, existing methods do not provide the confidence level of the $T_{1ρ}…
▽ More
$T_{1ρ}$ mapping is a promising quantitative MRI technique for the non-invasive assessment of tissue properties. Learning-based approaches can map $T_{1ρ}$ from a reduced number of $T_{1ρ}$ weighted images, but requires significant amounts of high quality training data. Moreover, existing methods do not provide the confidence level of the $T_{1ρ}$ estimation. To address these problems, we proposed a self-supervised learning neural network that learns a $T_{1ρ}$ mapping using the relaxation constraint in the learning process. Epistemic uncertainty and aleatoric uncertainty are modelled for the $T_{1ρ}$ quantification network to provide a Bayesian confidence estimation of the $T_{1ρ}$ mapping. The uncertainty estimation can also regularize the model to prevent it from learning imperfect data. We conducted experiments on $T_{1ρ}$ data collected from 52 patients with non-alcoholic fatty liver disease. The results showed that our method outperformed the existing methods for $T_{1ρ}$ quantification of the liver using as few as two $T_{1ρ}$-weighted images. Our uncertainty estimation provided a feasible way of modelling the confidence of the self-supervised learning based $T_{1ρ}$ estimation, which is consistent with the reality in liver $T_{1ρ}$ imaging.
△ Less
Submitted 25 October, 2022; v1 submitted 7 July, 2022;
originally announced July 2022.
-
Investigations of the Underlying Mechanisms of HIF-1α and CITED2 Binding to TAZ1
Authors:
Wen-Ting Chu,
Xiakun Chu,
Jin Wang
Abstract:
The TAZ1 domain of CREB binding protein is crucial for transcriptional regulation and recognizes multiple targets. The interactions between TAZ1 and its specific targets are related to the cellular hypoxic negative feedback regulation. Previous experiments reported that one of the TAZ1 targets CITED2 is an efficient competitor of another target HIF-1α. Here by developing the structure-based models…
▽ More
The TAZ1 domain of CREB binding protein is crucial for transcriptional regulation and recognizes multiple targets. The interactions between TAZ1 and its specific targets are related to the cellular hypoxic negative feedback regulation. Previous experiments reported that one of the TAZ1 targets CITED2 is an efficient competitor of another target HIF-1α. Here by developing the structure-based models of TAZ1 complexes we have uncovered the underlying mechanisms of the competitions between HIF-1α and CITED2 binding to TAZ1. Our results are consistent with the experimental hypothesis on the competition mechanisms and the apparent affinity. In addition, the simulations prove the dominant position of forming TAZ1-CITED2 complex in both thermodynamics and kinetics. For thermodynamics, TAZ1-CITED2 is the lowest basin located on the free energy surface of binding in the ternary system. For kinetics, the results suggest that CITED2 binds to TAZ1 faster than HIF-1α. Besides, the analysis of contact map and f values in this study will be helpful for further experiments on TAZ1 systems.
△ Less
Submitted 2 September, 2019;
originally announced September 2019.