-
Improving Protein Sequence Design through Designability Preference Optimization
Authors:
Fanglei Xue,
Andrew Kubaney,
Zhichun Guo,
Joseph K. Min,
Ge Liu,
Yi Yang,
David Baker
Abstract:
Protein sequence design methods have demonstrated strong performance in sequence generation for de novo protein design. However, as the training objective was sequence recovery, it does not guarantee designability--the likelihood that a designed sequence folds into the desired structure. To bridge this gap, we redefine the training objective by steering sequence generation toward high designabilit…
▽ More
Protein sequence design methods have demonstrated strong performance in sequence generation for de novo protein design. However, as the training objective was sequence recovery, it does not guarantee designability--the likelihood that a designed sequence folds into the desired structure. To bridge this gap, we redefine the training objective by steering sequence generation toward high designability. To do this, we integrate Direct Preference Optimization (DPO), using AlphaFold pLDDT scores as the preference signal, which significantly improves the in silico design success rate. To further refine sequence generation at a finer, residue-level granularity, we introduce Residue-level Designability Preference Optimization (ResiDPO), which applies residue-level structural rewards and decouples optimization across residues. This enables direct improvement in designability while preserving regions that already perform well. Using a curated dataset with residue-level annotations, we fine-tune LigandMPNN with ResiDPO to obtain EnhancedMPNN, which achieves a nearly 3-fold increase in in silico design success rate (from 6.56% to 17.57%) on a challenging enzyme design benchmark.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
A transport approach to relate asymmetric protein segregation and population growth
Authors:
Jiseon Min,
Ariel Amir
Abstract:
Many unicellular organisms allocate their key proteins asymmetrically between the mother and daughter cells, especially in a stressed environment. A recent theoretical model is able to predict when the asymmetry in segregation of key proteins enhances the population fitness, extrapolating the solution at two limits where the segregation is perfectly asymmetric (asymmetry $a$ = 1) and when the asym…
▽ More
Many unicellular organisms allocate their key proteins asymmetrically between the mother and daughter cells, especially in a stressed environment. A recent theoretical model is able to predict when the asymmetry in segregation of key proteins enhances the population fitness, extrapolating the solution at two limits where the segregation is perfectly asymmetric (asymmetry $a$ = 1) and when the asymmetry is small ($0 \leq a \ll 1$). We generalize the model by introducing stochasticity and use a transport equation to obtain a self-consistent equation for the population growth rate and the distribution of the amount of key proteins. We provide two ways of solving the self-consistent equation: numerically by updating the solution for the self-consistent equation iteratively and analytically by expanding moments of the distribution. With these more powerful tools, we can extend the previous model by Lin et al. to include stochasticity to the segregation asymmetry. We show the stochastic model is equivalent to the deterministic one with a modified effective asymmetry parameter ($a_{\rm eff}$). We discuss the biological implication of our models and compare with other theoretical models.
△ Less
Submitted 1 May, 2021; v1 submitted 24 December, 2020;
originally announced December 2020.
-
Non-genetic variability: survival strategy or nuisance?
Authors:
Ethan Levien,
Jiseon Min,
Jane Kondev,
Ariel Amir
Abstract:
The observation that phenotypic variability is ubiquitous in isogenic populations has led to a multitude of experimental and theoretical studies seeking to probe the causes and consequences of this variability. Whether it be in the context of antibiotic treatments or exponential growth in constant environments, non-genetic variability has shown to have significant effects on population dynamics. H…
▽ More
The observation that phenotypic variability is ubiquitous in isogenic populations has led to a multitude of experimental and theoretical studies seeking to probe the causes and consequences of this variability. Whether it be in the context of antibiotic treatments or exponential growth in constant environments, non-genetic variability has shown to have significant effects on population dynamics. Here, we review research that elucidates the relationship between cell-to-cell variability and population dynamics. After summarizing the relevant experimental observations, we discuss models of bet-hedging and phenotypic switching. In the context of these models, we discuss how switching between phenotypes at the single-cell level can help populations survive in uncertain environments. Next, we review more fine-grained models of phenotypic variability where the relationship between single-cell growth rates, generation times and cell sizes is explicitly considered. Variability in these traits can have significant effects on the population dynamics, even in a constant environment. We show how these effects can be highly sensitive to the underlying model assumptions. We close by discussing a number of open questions, such as how environmental and intrinsic variability interact and what the role of non-genetic variability in evolutionary dynamics is.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Inference of a Multi-Domain Machine Learning Model to Predict Mortality in Hospital Stays for Patients with Cancer upon Febrile Neutropenia Onset
Authors:
Xinsong Du,
Jae Min,
Mattia Prosperi,
Rohit Bishnoi,
Dominick J. Lemas,
Chintan P. Shah
Abstract:
Febrile neutropenia (FN) has been associated with high mortality, especially among adults with cancer. Understanding the patient and provider level heterogeneity in FN hospital admissions has potential to inform personalized interventions focused on increasing survival of individuals with FN. We leverage machine learning techniques to disentangling the complex interactions among multi domain risk…
▽ More
Febrile neutropenia (FN) has been associated with high mortality, especially among adults with cancer. Understanding the patient and provider level heterogeneity in FN hospital admissions has potential to inform personalized interventions focused on increasing survival of individuals with FN. We leverage machine learning techniques to disentangling the complex interactions among multi domain risk factors in a population with FN. Data from the Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample and Nationwide Inpatient Sample (NIS) were used to build machine learning based models of mortality for adult cancer patients who were diagnosed with FN during a hospital admission. In particular, the importance of risk factors from different domains (including demographic, clinical, and hospital associated information) was studied. A set of more interpretable (decision tree, logistic regression) as well as more black box (random forest, gradient boosting, neural networks) models were analyzed and compared via multiple cross validation. Our results demonstrate that a linear prediction score of FN mortality among adults with cancer, based on admission information is effective in classifying high risk patients; clinical diagnoses is the domain with the highest predictive power. A number of the risk variables (e.g. sepsis, kidney failure, etc.) identified in this study are clinically actionable and may inform future studies looking at the patients prior medical history are warranted.
△ Less
Submitted 27 May, 2019; v1 submitted 20 February, 2019;
originally announced February 2019.
-
Optimal segregation of proteins: phase transitions and symmetry breaking
Authors:
Jie Lin,
Jiseon Min,
Ariel Amir
Abstract:
Asymmetric segregation of key proteins at cell division -- be it a beneficial or deleterious protein -- is ubiquitous in unicellular organisms and often considered as an evolved trait to increase fitness in a stressed environment. Here, we provide a general framework to describe the evolutionary origin of this asymmetric segregation. We compute the population fitness as a function of the protein s…
▽ More
Asymmetric segregation of key proteins at cell division -- be it a beneficial or deleterious protein -- is ubiquitous in unicellular organisms and often considered as an evolved trait to increase fitness in a stressed environment. Here, we provide a general framework to describe the evolutionary origin of this asymmetric segregation. We compute the population fitness as a function of the protein segregation asymmetry $a$, and show that the value of $a$ which optimizes the population growth manifests a phase transition between symmetric and asymmetric partitioning phases. Surprisingly, the nature of phase transition is different for the case of beneficial proteins as opposed to proteins which decrease the single-cell growth rate. Our study elucidates the optimization problem faced by evolution in the context of protein segregation, and motivates further investigation of asymmetric protein segregation in biological systems.
△ Less
Submitted 24 April, 2018;
originally announced April 2018.
-
Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing
Authors:
Saulo A. Aflitos,
Elio Schijlen,
Richard Finkers,
Sandra Smit,
Jun Wang,
Gengyun Zhang,
Ning Li,
Likai Mao,
Hans de Jong,
Freek Bakker,
Barbara Gravendeel,
Timo Breit,
Rob Dirks,
Henk Huits,
Darush Struss,
Ruth Wagner,
Hans van Leeuwen,
Roeland van Ham,
Laia Fito,
Laƫtitia Guigner,
Myrna Sevilla,
Philippe Ellul,
Eric W. Ganko,
Arvind Kapur,
Emmanuel Reclus
, et al. (32 additional authors not shown)
Abstract:
Genetic variation in the tomato clade was explored by sequencing a selection of 84 tomato accessions and related wild species representative for the Lycopersicon, Arcanum, Eriopersicon, and Neolycopersicon groups. We present a reconstruction of three new reference genomes in support of our comparative genome analyses. Sequence diversity in commercial breeding lines appears extremely low, indicatin…
▽ More
Genetic variation in the tomato clade was explored by sequencing a selection of 84 tomato accessions and related wild species representative for the Lycopersicon, Arcanum, Eriopersicon, and Neolycopersicon groups. We present a reconstruction of three new reference genomes in support of our comparative genome analyses. Sequence diversity in commercial breeding lines appears extremely low, indicating the dramatic genetic erosion of crop tomatoes. This is reflected by the SNP count in wild species which can exceed 10 million i.e. 20 fold higher than in crop accessions. Comparative sequence alignment reveals group, species, and accession specific polymorphisms, which explain characteristic fruit traits and growth habits in tomato accessions. Using gene models from the annotated Heinz reference genome, we observe a bias in dN/dS ratio in fruit and growth diversification genes compared to a random set of genes, which probably is the result of a positive selection. We detected highly divergent segments in wild S. lycopersicum species, and footprints of introgressions in crop accessions originating from a common donor accession. Phylogenetic relationships of fruit diversification and growth specific genes from crop accessions show incomplete resolution and are dependent on the introgression donor. In contrast, whole genome SNP information has sufficient power to resolve the phylogenetic placement of each accession in the four main groups in the Lycopersicon clade using Maximum Likelihood analyses. Phylogenetic relationships appear correlated with habitat and mating type and point to the occurrence of geographical races within these groups and thus are of practical importance for introgressive hybridization breeding. Our study illustrates the need for multiple reference genomes in support of tomato comparative genomics and Solanum genome evolution studies.
△ Less
Submitted 21 April, 2015;
originally announced April 2015.