-
Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra
Authors:
Alan N. Amin,
Andres Potapczynski,
Andrew Gordon Wilson
Abstract:
To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound prote…
▽ More
To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
△ Less
Submitted 28 June, 2025; v1 submitted 24 June, 2025;
originally announced June 2025.
-
Bayesian Optimization of Antibodies Informed by a Generative Model of Evolving Sequences
Authors:
Alan Nawzad Amin,
Nate Gruver,
Yilun Kuang,
Lily Li,
Hunter Elliott,
Calvin McCarter,
Aniruddh Raghu,
Peyton Greenside,
Andrew Gordon Wilson
Abstract:
To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We in…
▽ More
To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We introduce Clone-informed Bayesian Optimization (CloneBO), a Bayesian optimization procedure that efficiently optimizes antibodies in the lab by teaching a generative model how our immune system optimizes antibodies. Our immune system makes antibodies by iteratively evolving specific portions of their sequences to bind their target strongly and stably, resulting in a set of related, evolving sequences known as a clonal family. We train a large language model, CloneLM, on hundreds of thousands of clonal families and use it to design sequences with mutations that are most likely to optimize an antibody within the human immune system. We propose to guide our designs to fit previous measurements with a twisted sequential Monte Carlo procedure. We show that CloneBO optimizes antibodies substantially more efficiently than previous methods in realistic in silico experiments and designs stronger and more stable binders in in vitro wet lab experiments.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Biological Sequence Kernels with Guaranteed Flexibility
Authors:
Alan Nawzad Amin,
Eli Nathan Weinstein,
Debora Susan Marks
Abstract:
Applying machine learning to biological sequences - DNA, RNA and protein - has enormous potential to advance human health, environmental sustainability, and fundamental biological understanding. However, many existing machine learning methods are ineffective or unreliable in this problem domain. We study these challenges theoretically, through the lens of kernels. Methods based on kernels are ubiq…
▽ More
Applying machine learning to biological sequences - DNA, RNA and protein - has enormous potential to advance human health, environmental sustainability, and fundamental biological understanding. However, many existing machine learning methods are ineffective or unreliable in this problem domain. We study these challenges theoretically, through the lens of kernels. Methods based on kernels are ubiquitous: they are used to predict molecular phenotypes, design novel proteins, compare sequence distributions, and more. Many methods that do not use kernels explicitly still rely on them implicitly, including a wide variety of both deep learning and physics-based techniques. While kernels for other types of data are well-studied theoretically, the structure of biological sequence space (discrete, variable length sequences), as well as biological notions of sequence similarity, present unique mathematical challenges. We formally analyze how well kernels for biological sequences can approximate arbitrary functions on sequence space and how well they can distinguish different sequence distributions. In particular, we establish conditions under which biological sequence kernels are universal, characteristic and metrize the space of distributions. We show that a large number of existing kernel-based machine learning methods for biological sequences fail to meet our conditions and can as a consequence fail severely. We develop straightforward and computationally tractable ways of modifying existing kernels to satisfy our conditions, imbuing them with strong guarantees on accuracy and reliability. Our proof techniques build on and extend the theory of kernels with discrete masses. We illustrate our theoretical results in simulation and on real biological data sets.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Analytical Theory for Sequence-Specific Binary Fuzzy Complexes of Charged Intrinsically Disordered Proteins
Authors:
Alan N. Amin,
Yi-Hsuan Lin,
Suman Das,
Hue Sun Chan
Abstract:
Intrinsically disordered proteins (IDPs) are important for biological functions. In contrast to folded proteins, molecular recognition among certain IDPs is "fuzzy" in that their binding and/or phase separation are stochastically governed by the interacting IDPs' amino acid sequences while their assembled conformations remain largely disordered. To help elucidate a basic aspect of this fascinating…
▽ More
Intrinsically disordered proteins (IDPs) are important for biological functions. In contrast to folded proteins, molecular recognition among certain IDPs is "fuzzy" in that their binding and/or phase separation are stochastically governed by the interacting IDPs' amino acid sequences while their assembled conformations remain largely disordered. To help elucidate a basic aspect of this fascinating yet poorly understood phenomenon, the binding of a homo- or hetero-dimeric pair of polyampholytic IDPs is modeled statistical mechanically using cluster expansion. We find that the binding affinities of binary fuzzy complexes in the model correlate strongly with a newly derived simple "jSCD" parameter readily calculable from the pair of IDPs' sequence charge patterns. Predictions by our analytical theory are in essential agreement with coarse-grained explicit-chain simulations. This computationally efficient theoretical framework is expected to be broadly applicable to rationalizing and predicting sequence-specific IDP-IDP polyelectrostatic interactions.
△ Less
Submitted 7 July, 2020; v1 submitted 24 October, 2019;
originally announced October 2019.