Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Huguet, Guillaume; Vuckovic, James; Fatras, Kilian; Thibodeau-Laufer, Eric; Lemos, Pablo; Islam, Riashat; Liu, Cheng-Hao; Rector-Brooks, Jarrid; Akhound-Sadegh, Tara; Bronstein, Michael; Tong, Alexander; Bose, Avishek Joey

Computer Science > Machine Learning

arXiv:2405.20313 (cs)

[Submitted on 30 May 2024 (v1), last revised 11 Dec 2024 (this version, v2)]

Title:Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Authors:Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong, Avishek Joey Bose

View PDF HTML (experimental)

Abstract:Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

Comments:	Presented at NeurIPS 2024
Subjects:	Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Cite as:	arXiv:2405.20313 [cs.LG]
	(or arXiv:2405.20313v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2405.20313

Submission history

From: Guillaume Huguet [view email]
[v1] Thu, 30 May 2024 17:53:50 UTC (13,683 KB)
[v2] Wed, 11 Dec 2024 15:42:13 UTC (13,740 KB)

Computer Science > Machine Learning

Title:Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators