Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

Chen, Sully F.; Steele, Robert J.; Hocky, Glen M.; Lemeneh, Beakal; Lad, Shivanand P.; Oermann, Eric K.

Computer Science > Machine Learning

arXiv:2408.16245v3 (cs)

[Submitted on 29 Aug 2024 (v1), revised 1 Apr 2025 (this version, v3), latest version 18 Jun 2025 (v5)]

Title:Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

Authors:Sully F. Chen, Robert J. Steele, Glen M. Hocky, Beakal Lemeneh, Shivanand P. Lad, Eric K. Oermann

View PDF

Abstract:The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($\Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any \textit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.

Comments:	39 pages, 5 figures
Subjects:	Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Cite as:	arXiv:2408.16245 [cs.LG]
	(or arXiv:2408.16245v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2408.16245

Submission history

From: Sully Chen [view email]
[v1] Thu, 29 Aug 2024 03:56:40 UTC (2,165 KB)
[v2] Fri, 27 Sep 2024 06:09:41 UTC (2,188 KB)
[v3] Tue, 1 Apr 2025 17:10:17 UTC (4,145 KB)
[v4] Tue, 3 Jun 2025 07:17:19 UTC (3,793 KB)
[v5] Wed, 18 Jun 2025 06:10:32 UTC (4,229 KB)

Computer Science > Machine Learning

Title:Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators