S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Yin, Mingze; Zhou, Hanjing; Wu, Jialu; Zhu, Yiheng; Zhan, Yuxuan; Kong, Zitai; Xu, Hongxia; Hsieh, Chang-Yu; Chen, Jintai; Hou, Tingjun; Wu, Jian

Computer Science > Machine Learning

arXiv:2411.15215 (cs)

COVID-19 e-print

Important: e-prints posted on arXiv are not peer-reviewed by arXiv; they should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the field.

[Submitted on 20 Nov 2024]

Title:S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Authors:Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu

View PDF HTML (experimental)

Abstract:Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Cite as:	arXiv:2411.15215 [cs.LG]
	(or arXiv:2411.15215v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2411.15215

Submission history

From: Hanjing Zhou [view email]
[v1] Wed, 20 Nov 2024 14:24:26 UTC (5,382 KB)

Computer Science > Machine Learning

Title:S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators