-
ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours
Authors:
Feiwen Zhu,
Arkadiusz Nowaczynski,
Rundong Li,
Jie Xin,
Yifei Song,
Michal Marcinkiewicz,
Sukru Burc Eryilmaz,
Jun Yang,
Michael Andersch
Abstract:
AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute res…
▽ More
AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Pro-PRIME: A general Temperature-Guided Language model to engineer enhanced Stability and Activity in Proteins
Authors:
Fan Jiang,
Mingchen Li,
Jiajun Dong,
Yuanxi Yu,
Xinyu Sun,
Banghao Wu,
Jin Huang,
Liqi Kang,
Yufeng Pei,
Liang Zhang,
Shaojie Wang,
Wenxue Xu,
Jingyao Xin,
Wanli Ouyang,
Guisheng Fan,
Lirong Zheng,
Yang Tan,
Zhiqiang Hu,
Yi Xiong,
Yan Feng,
Guangyu Yang,
Qian Liu,
Jie Song,
Jia Liu,
Liang Hong
, et al. (1 additional authors not shown)
Abstract:
Designing protein mutants of both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants of improved stability and activity without any prior experimental mutagenesis data of the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive…
▽ More
Designing protein mutants of both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants of improved stability and activity without any prior experimental mutagenesis data of the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive power compared to current state-of-the-art models on the public mutagenesis dataset over 283 protein assays. Furthermore, we validated PRIME's predictions on five proteins, examining the top 30-45 single-site mutations' impact on various protein properties, including thermal stability, antigen-antibody binding affinity, and the ability to polymerize non-natural nucleic acid or resilience to extreme alkaline conditions. Remarkably, over 30% of the AI-recommended mutants exhibited superior performance compared to their pre-mutation counterparts across all proteins and desired properties. Moreover, we have developed an efficient, and successful method based on PRIME to rapidly obtain multi-site mutants with enhanced activity and stability. Hence, PRIME demonstrates the general applicability in protein engineering.
△ Less
Submitted 27 October, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
MIST-CF: Chemical formula inference from tandem mass spectra
Authors:
Samuel Goldman,
Jiayi Xin,
Joules Provenzano,
Connor W. Coley
Abstract:
Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parameterized fragmentation tree construction and scoring. In this work we extend our previous spectrum Tran…
▽ More
Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parameterized fragmentation tree construction and scoring. In this work we extend our previous spectrum Transformer methodology into an energy based modeling framework, MIST-CF, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network architectures. We further validate our approach on the CASMI2022 challenge dataset, achieving nearly equivalent performance to the winning entry within the positive mode category without any manual curation or post-processing of our results. These results demonstrate an exciting strategy to more powerfully leverage MS2 fragment peaks for predicting MS1 precursor chemical formula with data driven learning.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs
Authors:
Jackson Callaghan,
Colleen H. Xu,
Jiwen Xin,
Marco Alvarado Cano,
Anders Riutta,
Eric Zhou,
Rohan Juneja,
Yao Yao,
Madhumita Narayan,
Kristina Hanspers,
Ayushi Agrawal,
Alexander R. Pico,
Chunlei Wu,
Andrew I. Su
Abstract:
Knowledge graphs are an increasingly common data structure for representing biomedical information. These knowledge graphs can easily represent heterogeneous types of information, and many algorithms and tools exist for querying and analyzing graphs. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of dr…
▽ More
Knowledge graphs are an increasingly common data structure for representing biomedical information. These knowledge graphs can easily represent heterogeneous types of information, and many algorithms and tools exist for querying and analyzing graphs. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of drug side effects, and clinical decision support. Typically, knowledge graphs are constructed by centralization and integration of data from multiple disparate sources. Here, we describe BioThings Explorer, an application that can query a virtual, federated knowledge graph derived from the aggregated information in a network of biomedical web services. BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries. Because there is no large, centralized knowledge graph to maintain, BioThing Explorer is distributed as a lightweight application that dynamically retrieves information at query time. More information can be found at https://explorer.biothings.io, and code is available at https://github.com/biothings/biothings_explorer.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Prefix-Tree Decoding for Predicting Mass Spectra from Molecules
Authors:
Samuel Goldman,
John Bradshaw,
Jiayi Xin,
Connor W. Coley
Abstract:
Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites. However, such predictive tools are still limited as they occupy one of two extremes, either operating (a) by fragmenting molecules combinatorially with overly rigid constraints on potential rearrangements and poor time complexity or (b) by decoding lossy and nonphysical discretiz…
▽ More
Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites. However, such predictive tools are still limited as they occupy one of two extremes, either operating (a) by fragmenting molecules combinatorially with overly rigid constraints on potential rearrangements and poor time complexity or (b) by decoding lossy and nonphysical discretized spectra vectors. In this work, we use a new intermediate strategy for predicting mass spectra from molecules by treating mass spectra as sets of molecular formulae, which are themselves multisets of atoms. After first encoding an input molecular graph, we decode a set of molecular subformulae, each of which specify a predicted peak in the mass spectrum, the intensities of which are predicted by a second model. Our key insight is to overcome the combinatorial possibilities for molecular subformulae by decoding the formula set using a prefix tree structure, atom-type by atom-type, representing a general method for ordered multiset decoding. We show promising empirical results on mass spectra prediction tasks.
△ Less
Submitted 3 December, 2023; v1 submitted 11 March, 2023;
originally announced March 2023.
-
Retrieved Sequence Augmentation for Protein Representation Learning
Authors:
Chang Ma,
Haiteng Zhao,
Lin Zheng,
Jiayi Xin,
Qintong Li,
Lijun Wu,
Zhihong Deng,
Yang Lu,
Qi Liu,
Lingpeng Kong
Abstract:
Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, a…
▽ More
Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
An integrated recurrent neural network and regression model with spatial and climatic couplings for vector-borne disease dynamics
Authors:
Zhijian Li,
Jack Xin,
Guofa Zhou
Abstract:
We developed an integrated recurrent neural network and nonlinear regression spatio-temporal model for vector-borne disease evolution. We take into account climate data and seasonality as external factors that correlate with disease transmitting insects (e.g. flies), also spill-over infections from neighboring regions surrounding a region of interest. The climate data is encoded to the model throu…
▽ More
We developed an integrated recurrent neural network and nonlinear regression spatio-temporal model for vector-borne disease evolution. We take into account climate data and seasonality as external factors that correlate with disease transmitting insects (e.g. flies), also spill-over infections from neighboring regions surrounding a region of interest. The climate data is encoded to the model through a quadratic embedding scheme motivated by recommendation systems. The neighboring regions' influence is modeled by a long short-term memory neural network. The integrated model is trained by stochastic gradient descent and tested on leish-maniasis data in Sri Lanka from 2013-2018 where infection outbreaks occurred. Our model outperformed ARIMA models across a number of regions with high infections, and an associated ablation study renders support to our modeling hypothesis and ideas.
△ Less
Submitted 23 January, 2022;
originally announced January 2022.
-
A Recurrent Neural Network and Differential Equation Based Spatiotemporal Infectious Disease Model with Application to COVID-19
Authors:
Zhijian Li,
Yunling Zheng,
Jack Xin,
Guofa Zhou
Abstract:
The outbreaks of Coronavirus Disease 2019 (COVID-19) have impacted the world significantly. Modeling the trend of infection and real-time forecasting of cases can help decision making and control of the disease spread. However, data-driven methods such as recurrent neural networks (RNN) can perform poorly due to limited daily samples in time. In this work, we develop an integrated spatiotemporal m…
▽ More
The outbreaks of Coronavirus Disease 2019 (COVID-19) have impacted the world significantly. Modeling the trend of infection and real-time forecasting of cases can help decision making and control of the disease spread. However, data-driven methods such as recurrent neural networks (RNN) can perform poorly due to limited daily samples in time. In this work, we develop an integrated spatiotemporal model based on the epidemic differential equations (SIR) and RNN. The former after simplification and discretization is a compact model of temporal infection trend of a region while the latter models the effect of nearest neighboring regions. The latter captures latent spatial information. %that is not publicly reported. We trained and tested our model on COVID-19 data in Italy, and show that it out-performs existing temporal models (fully connected NN, SIR, ARIMA) in 1-day, 3-day, and 1-week ahead forecasting especially in the regime of limited training data.
△ Less
Submitted 17 September, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Signal processing of acoustic signals in the time domain with an active nonlinear nonlocal cochlear model
Authors:
M. Drew LaMar,
J. Xin,
Y. Qi
Abstract:
A two space dimensional active nonlinear nonlocal cochlear model is formulated in the time domain to capture nonlinear hearing effects such as compression, multi-tone suppression and difference tones. The micromechanics of the basilar membrane (BM) are incorporated to model active cochlear properties. An active gain parameter is constructed in the form of a nonlinear nonlocal functional of BM disp…
▽ More
A two space dimensional active nonlinear nonlocal cochlear model is formulated in the time domain to capture nonlinear hearing effects such as compression, multi-tone suppression and difference tones. The micromechanics of the basilar membrane (BM) are incorporated to model active cochlear properties. An active gain parameter is constructed in the form of a nonlinear nonlocal functional of BM displacement. The model is discretized with a boundary integral method and numerically solved using an iterative second order accurate finite difference scheme. A block matrix structure of the discrete system is exploited to simplify the numerics with no loss of accuracy. Model responses to multiple frequency stimuli are shown in agreement with hearing experiments. A nonlinear spectrum is computed from the model, and compared with FFT spectrum for noisy tonal inputs. The discretized model is efficient and accurate, and can serve as a useful auditory signal processing tool.
△ Less
Submitted 6 July, 2010; v1 submitted 15 November, 2004;
originally announced November 2004.