Search | arXiv e-print repository

Aligning Proteins and Language: A Foundation Model for Protein Retrieval

Authors: Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara, Min Xu

Abstract: This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with… ▽ More This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large-scale dataset of approximately 200,000 protein-caption pairs with rich functional descriptors. We evaluate our model in both in-domain and more challenging cross-database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero-shot retrieval performance, highlighting the potential of multimodal foundation models for structure-function understanding in protein biology. △ Less

Submitted 27 May, 2025; originally announced June 2025.

Comments: 4 pages for body, 3 pages for appendix, 11 figures. Accepted to CVPR 2025 Workshop on Multimodal Foundation Models for Biomedicine: Challenges and Opportunities(MMFM-BIOMED)

arXiv:2206.06035 [pdf, other]

doi 10.1016/j.cag.2022.07.005

SHREC 2022: Protein-ligand binding site recognition

Authors: Luca Gagliardi, Andrea Raffo, Ulderico Fugacci, Silvia Biasotti, Walter Rocchia, Hao Huang, Boulbaba Ben Amor, Yi Fang, Yuanyuan Zhang, Xiao Wang, Charles Christoffer, Daisuke Kihara, Apostolos Axenopoulos, Stelios Mylonas, Petros Daras

Abstract: This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods i… ▽ More This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods in recognizing ligand binding sites in a protein based on its geometrical structure. Performances of the segmentation algorithms are analyzed according to two evaluation scores describing the capacity of a putative pocket to contact a ligand and to pinpoint the correct binding region. Despite some methods perform remarkably, we show that simple non-machine-learning approaches remain very competitive against data-driven algorithms. In general, the task of pocket detection remains a challenging learning problem which suffers of intrinsic difficulties due to the lack of negative examples (data imbalance problem). △ Less

Submitted 24 August, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

Journal ref: Computers & Graphics 107 (2022) 20-31

arXiv:2110.12040 [pdf]

Using Steered Molecular Dynamic Tension for Assessing Quality of Computational Protein Structure Models

Authors: Lyman Monroe, Daisuke Kihara

Abstract: The native structures of proteins, except for notable exceptions of intrinsically disordered proteins, in general take their most stable conformation in the physiological condition to maintain their structural framework so that their biological function can be properly carried out. Experimentally, the stability of a protein can be measured by several means, among which the pulling experiment using… ▽ More The native structures of proteins, except for notable exceptions of intrinsically disordered proteins, in general take their most stable conformation in the physiological condition to maintain their structural framework so that their biological function can be properly carried out. Experimentally, the stability of a protein can be measured by several means, among which the pulling experiment using the atomic force microscope (AFM) stands as a unique method. AFM directly measures the resistance from unfolding, which can be quantified from the observed force-extension profile. It has been shown that key features observed in an AFM pulling experiment can be well reproduced by computational molecular dynamics simulations. Here, we applied computational pulling for estimating the accuracy of computational protein structure models under the hypothesis that the structural stability would positively correlated with the accuracy, i.e. the closeness to the native, of a model. We used in total 4,929 structure models for 24 target proteins from the Critical Assessment of Techniques of Structure Prediction (CASP) and investigated if the magnitude of the break force, i.e., the force required to rearrange the model structure, from the force profile was sufficient information for selecting near-native models. We found that near-native models can be successfully selected by examining their break forces suggesting that high break force indeed indicates high stability of models. On the other hand, there were also near-native models that had relatively low peak forces. The mechanisms of the stability exhibited by the break forces were explored and discussed. △ Less

Submitted 22 October, 2021; originally announced October 2021.

arXiv:2110.07609 [pdf]

Application of Sequence Embedding in Protein Sequence-Based Predictions

Authors: Nabil Ibtehaz, Daisuke Kihara

Abstract: In sequence-based predictions, conventionally an input sequence is represented by a multiple sequence alignment (MSA) or a representation derived from MSA, such as a position-specific scoring matrix. Recently, inspired by the development in natural language processing, several applications of sequence embedding have been observed. Here, we review different approaches of protein sequence embeddings… ▽ More In sequence-based predictions, conventionally an input sequence is represented by a multiple sequence alignment (MSA) or a representation derived from MSA, such as a position-specific scoring matrix. Recently, inspired by the development in natural language processing, several applications of sequence embedding have been observed. Here, we review different approaches of protein sequence embeddings and their applications including protein contact prediction, secondary structure, prediction, and function prediction. △ Less

Submitted 14 October, 2021; originally announced October 2021.

arXiv:2105.05221 [pdf, other]

doi 10.1016/j.cag.2021.06.010

SHREC 2021: Retrieval and classification of protein surfaces equipped with physical and chemical properties

Authors: Andrea Raffo, Ulderico Fugacci, Silvia Biasotti, Walter Rocchia, Yonghuai Liu, Ekpo Otu, Reyer Zwiggelaar, David Hunter, Evangelia I. Zacharaki, Eleftheria Psatha, Dimitrios Laskos, Gerasimos Arvanitis, Konstantinos Moustakas, Tunde Aderinwale, Charles Christoffer, Woong-Hee Shin, Daisuke Kihara, Andrea Giachetti, Huu-Nghia Nguyen, Tuan-Duy Nguyen, Vinh-Thuyen Nguyen-Truong, Danh Le-Thanh, Hai-Dang Nguyen, Minh-Triet Tran

Abstract: This paper presents the methods that have participated in the SHREC 2021 contest on retrieval and classification of protein surfaces on the basis of their geometry and physicochemical properties. The goal of the contest is to assess the capability of different computational approaches to identify different conformations of the same protein, or the presence of common sub-parts, starting from a set… ▽ More This paper presents the methods that have participated in the SHREC 2021 contest on retrieval and classification of protein surfaces on the basis of their geometry and physicochemical properties. The goal of the contest is to assess the capability of different computational approaches to identify different conformations of the same protein, or the presence of common sub-parts, starting from a set of molecular surfaces. We addressed two problems: defining the similarity solely based on the surface geometry or with the inclusion of physicochemical information, such as electrostatic potential, amino acid hydrophobicity, and the presence of hydrogen bond donors and acceptors. Retrieval and classification performances, with respect to the single protein or the existence of common sub-sequences, are analysed according to a number of information retrieval indicators. △ Less

Submitted 17 October, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

ACM Class: I.3.8; I.6.3; J.3

Journal ref: Computers & Graphics 99 (2021) 1-21

arXiv:1812.10841 [pdf, other]

Three-Dimensional Krawtchouk Descriptors for Protein Local Surface Shape Comparison

Authors: Atilla Sit, Daisuke Kihara

Abstract: Direct comparison of three-dimensional (3D) objects is computationally expensive due to the need for translation, rotation, and scaling of the objects to evaluate their similarity. In applications of 3D object comparison, often identifying specific local regions of objects is of particular interest. We have recently developed a set of 2D moment invariants based on discrete orthogonal Krawtchouk po… ▽ More Direct comparison of three-dimensional (3D) objects is computationally expensive due to the need for translation, rotation, and scaling of the objects to evaluate their similarity. In applications of 3D object comparison, often identifying specific local regions of objects is of particular interest. We have recently developed a set of 2D moment invariants based on discrete orthogonal Krawtchouk polynomials for comparison of local image patches. In this work, we extend them to 3D and construct 3D Krawtchouk descriptors (3DKD) that are invariant under translation, rotation, and scaling. The new descriptors have the ability to extract local features of a 3D surface from any region-of-interest. This property enables comparison of two arbitrary local surface regions from different 3D objects. We present the new formulation of 3DKD and apply it to the local shape comparison of protein surfaces in order to predict ligand molecules that bind to query proteins. △ Less

Submitted 27 December, 2018; originally announced December 2018.

arXiv:1601.00891 [pdf, other]

doi 10.1186/s13059-016-1037-6

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Authors: Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca , et al. (122 additional authors not shown)

Abstract: Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a… ▽ More Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction. △ Less

Submitted 2 January, 2016; originally announced January 2016.

Comments: Submitted to Genome Biology

Showing 1–7 of 7 results for author: Kihara, D