-
Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra
Authors:
Alan N. Amin,
Andres Potapczynski,
Andrew Gordon Wilson
Abstract:
To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound prote…
▽ More
To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
△ Less
Submitted 28 June, 2025; v1 submitted 24 June, 2025;
originally announced June 2025.
-
Bayesian Optimization of Antibodies Informed by a Generative Model of Evolving Sequences
Authors:
Alan Nawzad Amin,
Nate Gruver,
Yilun Kuang,
Lily Li,
Hunter Elliott,
Calvin McCarter,
Aniruddh Raghu,
Peyton Greenside,
Andrew Gordon Wilson
Abstract:
To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We in…
▽ More
To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We introduce Clone-informed Bayesian Optimization (CloneBO), a Bayesian optimization procedure that efficiently optimizes antibodies in the lab by teaching a generative model how our immune system optimizes antibodies. Our immune system makes antibodies by iteratively evolving specific portions of their sequences to bind their target strongly and stably, resulting in a set of related, evolving sequences known as a clonal family. We train a large language model, CloneLM, on hundreds of thousands of clonal families and use it to design sequences with mutations that are most likely to optimize an antibody within the human immune system. We propose to guide our designs to fit previous measurements with a twisted sequential Monte Carlo procedure. We show that CloneBO optimizes antibodies substantially more efficiently than previous methods in realistic in silico experiments and designs stronger and more stable binders in in vitro wet lab experiments.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Plug-and-Play Stability for Intracortical Brain-Computer Interfaces: A One-Year Demonstration of Seamless Brain-to-Text Communication
Authors:
Chaofei Fan,
Nick Hahn,
Foram Kamdar,
Donald Avansino,
Guy H. Wilson,
Leigh Hochberg,
Krishna V. Shenoy,
Jaimie M. Henderson,
Francis R. Willett
Abstract:
Intracortical brain-computer interfaces (iBCIs) have shown promise for restoring rapid communication to people with neurological disorders such as amyotrophic lateral sclerosis (ALS). However, to maintain high performance over time, iBCIs typically need frequent recalibration to combat changes in the neural recordings that accrue over days. This requires iBCI users to stop using the iBCI and engag…
▽ More
Intracortical brain-computer interfaces (iBCIs) have shown promise for restoring rapid communication to people with neurological disorders such as amyotrophic lateral sclerosis (ALS). However, to maintain high performance over time, iBCIs typically need frequent recalibration to combat changes in the neural recordings that accrue over days. This requires iBCI users to stop using the iBCI and engage in supervised data collection, making the iBCI system hard to use. In this paper, we propose a method that enables self-recalibration of communication iBCIs without interrupting the user. Our method leverages large language models (LMs) to automatically correct errors in iBCI outputs. The self-recalibration process uses these corrected outputs ("pseudo-labels") to continually update the iBCI decoder online. Over a period of more than one year (403 days), we evaluated our Continual Online Recalibration with Pseudo-labels (CORP) framework with one clinical trial participant. CORP achieved a stable decoding accuracy of 93.84% in an online handwriting iBCI task, significantly outperforming other baseline methods. Notably, this is the longest-running iBCI stability demonstration involving a human participant. Our results provide the first evidence for long-term stabilization of a plug-and-play, high-performance communication iBCI, addressing a major barrier for the clinical translation of iBCIs.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
The carnivorous plant Genlisea harnesses active particle dynamics to prey on microfauna
Authors:
José Martín-Roca,
C. Miguel Barriuso-Gutiérrez,
Raúl Martínez Fernández,
Camila Betterelli Giuliano,
Rongjing Zhang,
Chantal Valeriani,
Laurence G. Wilson
Abstract:
Carnivory in plants is an unusual trait that has arisen multiple times, independently, throughout evolutionary history. Plants in the genus Genlisea are carnivorous, and feed on microorganisms that live in soil using modified subterranean leaf structures (rhizophylls). A surprisingly broad array of microfauna has been observed in the plants' digestive chambers, including ciliates, amoebae and soil…
▽ More
Carnivory in plants is an unusual trait that has arisen multiple times, independently, throughout evolutionary history. Plants in the genus Genlisea are carnivorous, and feed on microorganisms that live in soil using modified subterranean leaf structures (rhizophylls). A surprisingly broad array of microfauna has been observed in the plants' digestive chambers, including ciliates, amoebae and soil mites. Here we show, through experiments and simulations, that Genlisea exploit active matter physics to 'rectify' bacterial swimming and establish a local flux of bacteria through the structured environment of the rhizophyll towards the plant's digestion vesicle. In contrast, macromolecular digestion products are free to diffuse away from the digestion vesicle and establish a concentration gradient of carbon sources to draw larger microorganisms further inside the plant. Our experiments and simulations show that this mechanism is likely to be a localised one, and that no large-scale efflux of digested matter is present.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Protein Design with Guided Discrete Diffusion
Authors:
Nate Gruver,
Samuel Stanton,
Nathan C. Frey,
Tim G. J. Rudner,
Isidro Hotzel,
Julien Lafrance-Vanasse,
Arvind Rajpal,
Kyunghyun Cho,
Andrew Gordon Wilson
Abstract:
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to…
▽ More
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99% expression rate and 40% binding rate in exploratory in vitro experiments.
△ Less
Submitted 12 December, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders
Authors:
Samuel Stanton,
Wesley Maddox,
Nate Gruver,
Phillip Maffettone,
Emily Delaney,
Peyton Greenside,
Andrew Gordon Wilson
Abstract:
Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of mult…
▽ More
Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.
△ Less
Submitted 12 July, 2022; v1 submitted 23 March, 2022;
originally announced March 2022.
-
Quantification of tumour evolution and heterogeneity via Bayesian epiallele detection
Authors:
James E. Barrett,
Andrew Feber,
Javier Herrero,
Miljana Tanic,
Gareth Wilson,
Charles Swanton,
Stephan Beck
Abstract:
Motivation: Epigenetic heterogeneity within a tumour can play an important role in tumour evolution and the emergence of resistance to treatment. It is increasingly recognised that the study of DNA methylation (DNAm) patterns along the genome -- so-called `epialleles' -- offers greater insight into epigenetic dynamics than conventional analyses which examine DNAm marks individually.
Results: We…
▽ More
Motivation: Epigenetic heterogeneity within a tumour can play an important role in tumour evolution and the emergence of resistance to treatment. It is increasingly recognised that the study of DNA methylation (DNAm) patterns along the genome -- so-called `epialleles' -- offers greater insight into epigenetic dynamics than conventional analyses which examine DNAm marks individually.
Results: We have developed a Bayesian model to infer which epialleles are present in multiple regions of the same tumour. We apply our method to reduced representation bisulfite sequencing (RRBS) data from multiple regions of one lung cancer tumour and a matched normal sample. The model borrows information from all tumour regions to leverage greater statistical power. The total number of epialleles, the epiallele DNAm patterns, and a noise hyperparameter are all automatically inferred from the data. Uncertainty as to which epiallele an observed sequencing read originated from is explicitly incorporated by marginalising over the appropriate posterior densities. The degree to which tumour samples are contaminated with normal tissue can be estimated and corrected for. By tracing the distribution of epialleles throughout the tumour we can infer the phylogenetic history of the tumour, identify epialleles that differ between normal and cancer tissue, and define a measure of global epigenetic disorder.
△ Less
Submitted 20 February, 2017; v1 submitted 2 February, 2017;
originally announced February 2017.
-
Differential Dynamic Microscopy of Bacterial Motility
Authors:
Laurence G. Wilson,
Vincent A. Martinez,
Jana Schwarz-Linek,
J. Tailleur,
Peter N. Pusey,
Gary Bryant,
Wilson C. K. Poon
Abstract:
We demonstrate 'differential dynamic microscopy' (DDM) for the fast, high throughput characterization of the dynamics of active particles. Specifically, we characterize the swimming speed distribution and the fraction of motile cells in suspensions of Escherichia coli bacteria. By averaging over ~10^4 cells, our results are highly accurate compared to conventional tracking. The diffusivity of non-…
▽ More
We demonstrate 'differential dynamic microscopy' (DDM) for the fast, high throughput characterization of the dynamics of active particles. Specifically, we characterize the swimming speed distribution and the fraction of motile cells in suspensions of Escherichia coli bacteria. By averaging over ~10^4 cells, our results are highly accurate compared to conventional tracking. The diffusivity of non-motile cells is enhanced by an amount proportional to the concentration of motile cells.
△ Less
Submitted 1 October, 2010; v1 submitted 27 April, 2010;
originally announced April 2010.
-
Atmospheric Consequences of Cosmic Ray Variability in the Extragalactic Shock Model II: Revised ionization levels and their consequences
Authors:
A. L. Melott,
D. Atri,
B. C. Thomas,
M. V. Medvedev,
G. W. Wilson,
M. J. Murray
Abstract:
It has been suggested that galactic shock asymmetry induced by our galaxy's infall toward the Virgo Cluster may be a source of periodicity in cosmic ray exposure as the solar system oscillates perpendicular to the galactic plane. Here we investigate a mechanism by which cosmic rays might affect terrestrial biodiversity, ionization and dissociation in the atmosphere, resulting in depletion of ozo…
▽ More
It has been suggested that galactic shock asymmetry induced by our galaxy's infall toward the Virgo Cluster may be a source of periodicity in cosmic ray exposure as the solar system oscillates perpendicular to the galactic plane. Here we investigate a mechanism by which cosmic rays might affect terrestrial biodiversity, ionization and dissociation in the atmosphere, resulting in depletion of ozone and a resulting increase in the dangerous solar UVB flux on the ground, with an improved ionization background computation averaged over a massive ensemble (about 7 x 10^5) shower simulations. We study minimal and full exposure to the postulated extragalactic background. The atmospheric effects are greater than with our earlier, simplified ionization model. At the lower end of the range effects are too small to be of serious consequence. At the upper end of the range, ~6 % global average loss of ozone column density exceeds that currently experienced due to effects such as accumulated chlorofluorocarbons. The intensity is less than a nearby supernova or galactic gamma-ray burst, but the duration would be about 10^6 times longer. Present UVB enhancement from current ozone depletion ~3% is a documented stress on the biosphere, but a depletion of the magnitude found at the upper end of our range would double the global average UVB flux. For estimates at the upper end of the range of the cosmic ray variability over geologic time, the mechanism of atmospheric ozone depletion may provide a major biological stress, which could easily bring about major loss of biodiversity. Future high energy astrophysical observations will resolve the question of whether such depletion is likely.
△ Less
Submitted 5 March, 2010; v1 submitted 6 August, 2008;
originally announced August 2008.