Search | arXiv e-print repository

MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

Authors: Feiyang Cai, Jiahui Bai, Tao Tang, Joshua Luo, Tianyu Zhu, Ling Liu, Feng Luo

Abstract: Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministi… ▽ More Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (o3) achieves $79.2\%$ and $78.5\%$ accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $29.0\%$ accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2410.21335 [pdf, other]

E(3)-invariant diffusion model for pocket-aware peptide generation

Authors: Po-Yu Liang, Jun Bai

Abstract: Biologists frequently desire protein inhibitors for a variety of reasons, including use as research tools for understanding biological processes and application to societal problems in agriculture, healthcare, etc. Immunotherapy, for instance, relies on immune checkpoint inhibitors to block checkpoint proteins, preventing their binding with partner proteins and boosting immune cell function agains… ▽ More Biologists frequently desire protein inhibitors for a variety of reasons, including use as research tools for understanding biological processes and application to societal problems in agriculture, healthcare, etc. Immunotherapy, for instance, relies on immune checkpoint inhibitors to block checkpoint proteins, preventing their binding with partner proteins and boosting immune cell function against abnormal cells. Inhibitor discovery has long been a tedious process, which in recent years has been accelerated by computational approaches. Advances in artificial intelligence now provide an opportunity to make inhibitor discovery smarter than ever before. While extensive research has been conducted on computer-aided inhibitor discovery, it has mainly focused on either sequence-to-structure mapping, reverse mapping, or bio-activity prediction, making it unrealistic for biologists to utilize such tools. Instead, our work proposes a new method of computer-assisted inhibitor discovery: de novo pocket-aware peptide structure and sequence generation network. Our approach consists of two sequential diffusion models for end-to-end structure generation and sequence prediction. By leveraging angle and dihedral relationships between backbone atoms, we ensure an E(3)-invariant representation of peptide structures. Our results demonstrate that our method achieves comparable performance to state-of-the-art models, highlighting its potential in pocket-aware peptide design. This work offers a new approach for precise drug discovery using receptor-specific peptide generation. △ Less

Submitted 31 October, 2024; v1 submitted 27 October, 2024; originally announced October 2024.

arXiv:2410.11499 [pdf, other]

BSM: Small but Powerful Biological Sequence Model for Genes and Proteins

Authors: Weixi Xiang, Xueting Han, Xiujuan Chai, Jing Bai

Abstract: Modeling biological sequences such as DNA, RNA, and proteins is crucial for understanding complex processes like gene regulation and protein synthesis. However, most current models either focus on a single type or treat multiple types of data separately, limiting their ability to capture cross-modal relationships. We propose that by learning the relationships between these modalities, the model ca… ▽ More Modeling biological sequences such as DNA, RNA, and proteins is crucial for understanding complex processes like gene regulation and protein synthesis. However, most current models either focus on a single type or treat multiple types of data separately, limiting their ability to capture cross-modal relationships. We propose that by learning the relationships between these modalities, the model can enhance its understanding of each type. To address this, we introduce BSM, a small but powerful mixed-modal biological sequence foundation model, trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web. These datasets capture the genetic flow, gene-protein relationships, and the natural co-occurrence of diverse biological data, respectively. By training on mixed-modal data, BSM significantly enhances learning efficiency and cross-modal representation, outperforming models trained solely on unimodal data. With only 110M parameters, BSM achieves performance comparable to much larger models across both single-modal and mixed-modal tasks, and uniquely demonstrates in-context learning capability for mixed-modal tasks, which is absent in existing models. Further scaling to 270M parameters demonstrates even greater performance gains, highlighting the potential of BSM as a significant advancement in multimodal biological sequence modeling. △ Less

Submitted 15 October, 2024; originally announced October 2024.

arXiv:2409.10370 [pdf, other]

Uncovering the Mechanism of Hepatotoxiciy of PFAS Targeting L-FABP Using GCN and Computational Modeling

Authors: Lucas Jividen, Tibo Duran, Xi-Zhi Niu, Jun Bai

Abstract: Per- and polyfluoroalkyl substances (PFAS) are persistent environmental pollutants with known toxicity and bioaccumulation issues. Their widespread industrial use and resistance to degradation have led to global environmental contamination and significant health concerns. While a minority of PFAS have been extensively studied, the toxicity of many PFAS remains poorly understood due to limited dire… ▽ More Per- and polyfluoroalkyl substances (PFAS) are persistent environmental pollutants with known toxicity and bioaccumulation issues. Their widespread industrial use and resistance to degradation have led to global environmental contamination and significant health concerns. While a minority of PFAS have been extensively studied, the toxicity of many PFAS remains poorly understood due to limited direct toxicological data. This study advances the predictive modeling of PFAS toxicity by combining semi-supervised graph convolutional networks (GCNs) with molecular descriptors and fingerprints. We propose a novel approach to enhance the prediction of PFAS binding affinities by isolating molecular fingerprints to construct graphs where then descriptors are set as the node features. This approach specifically captures the structural, physicochemical, and topological features of PFAS without overfitting due to an abundance of features. Unsupervised clustering then identifies representative compounds for detailed binding studies. Our results provide a more accurate ability to estimate PFAS hepatotoxicity to provide guidance in chemical discovery of new PFAS and the development of new safety regulations. △ Less

Submitted 16 September, 2024; originally announced September 2024.

Comments: 8 pages, 9 figures, submitted to IEEE BIBM 2024

arXiv:2408.08341 [pdf, other]

Exploring Latent Space for Generating Peptide Analogs Using Protein Language Models

Authors: Po-Yu Liang, Xueting Huang, Tibo Duran, Andrew J. Wiemer, Jun Bai

Abstract: Generating peptides with desired properties is crucial for drug discovery and biotechnology. Traditional sequence-based and structure-based methods often require extensive datasets, which limits their effectiveness. In this study, we proposed a novel method that utilized autoencoder shaped models to explore the protein embedding space, and generate novel peptide analogs by leveraging protein langu… ▽ More Generating peptides with desired properties is crucial for drug discovery and biotechnology. Traditional sequence-based and structure-based methods often require extensive datasets, which limits their effectiveness. In this study, we proposed a novel method that utilized autoencoder shaped models to explore the protein embedding space, and generate novel peptide analogs by leveraging protein language models. The proposed method requires only a single sequence of interest, avoiding the need for large datasets. Our results show significant improvements over baseline models in similarity indicators of peptide structures, descriptors and bioactivities. The proposed method validated through Molecular Dynamics simulations on TIGIT inhibitors, demonstrates that our method produces peptide analogs with similar yet distinct properties, highlighting its potential to enhance peptide screening processes. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2301.10185 [pdf]

Flow cytometry with anti-diffraction light sheet (ADLS) by spatial light modulation

Authors: Yanyan Gong, Ming Zeng, Yueqiang Zhu, Shangyu Li, Wei Zhao, Ce Zhang, Tianyun Zhao, Kaige Wang, Jiangcun Yang, Jintao Bai

Abstract: Flow cytometry is a widespread and powerful technique, whose resolution is determined by its capacity to accurately distinguish fluorescently positive populations from negative ones. However, most informative results are discarded while performing the measurements of conventional flow cytometry, e.g., the cell size, shape, morphology, and distribution or location of labeled exosomes within the unp… ▽ More Flow cytometry is a widespread and powerful technique, whose resolution is determined by its capacity to accurately distinguish fluorescently positive populations from negative ones. However, most informative results are discarded while performing the measurements of conventional flow cytometry, e.g., the cell size, shape, morphology, and distribution or location of labeled exosomes within the unpurified biological samples. We, herein, propose a novel approach using an anti-diffraction light sheet with anisotroic feature to excite fluorescent tags. Constituted by an anti-diffraction Bessel-Gaussian beam array, the light sheet is 12 $μ$m wide, 12 $μ$m high, with a thickness of $~ 0.8 μ$m. The intensity profile of the excited fluorescent signal can, therefore, reflect the size and allow samples in the range from O(100 nm) to 10 $μ$m (e.g., blood cells) to be transported via hydrodynamic focusing in a microfluidic chip. The sampling rate is 500 kHz provides a capability of high throughput without sacrificing the spatial resolution. Consequently, the proposed anti-diffraction light-sheet flow cytometry (ADLSFC) can obtain more informative results than the conventional methodologies, and is able to provide multiple characteristics (e.g., the size and distribution of fluorescent signal) helping to distinguish the target samples from the complex backgrounds. △ Less

Submitted 23 January, 2023; originally announced January 2023.

arXiv:1411.1903 [pdf]

An embryo of protocell membrane: The capsule of graphene oxide

Authors: Zhan Li, Chunmei Wang, Longlong Tian, Jing Bai, Yang Zhao, Xin Zhang, Shiwei Cao, Wei Qi, Hongdeng Qiu, Suomin Wang, Keliang Shi, Youwen Xu, Zhang Mingliang, Bo Liu, Huijun Yao, Jie Liu, Wangsuo Wu, Xiaoli Wang

Abstract: Many signs indicate that the graphene could widely occur on the early Earth. Here, we report a new theory that graphene might be an embryo of protocell membrane, and found several evidences. Firstly, the graphene oxide and phospholipid-graphene oxide composite would curl into capsules in strongly acidic saturated solution of Pb(NO3)2 at low temperature, providing a protective space for biochemical… ▽ More Many signs indicate that the graphene could widely occur on the early Earth. Here, we report a new theory that graphene might be an embryo of protocell membrane, and found several evidences. Firstly, the graphene oxide and phospholipid-graphene oxide composite would curl into capsules in strongly acidic saturated solution of Pb(NO3)2 at low temperature, providing a protective space for biochemical reactions. Secondly, L-animi acids exhibit higher reactivity than D-animi acids for graphene oxides in favor of the formation of left-handed proteins. Thirdly, monolayer graphene with nanopores prepared by unfocused 84Kr25+ has high selectivity for permeation of the monovalent metal ions (Rb+ > K+ > Cs+ > Na+ > Li+), but does not allow Cl- through, which could be attributed to the ion exchange of oxygen-containing groups on the rim of nanopores. It is similar to K+ channels, which would cause efflux of some ions from capsule of graphene oxides with the decrease of pH in the primitive ocean, creating a suitable inner condition for the origin of life. Consequently, the strongly acidic, high salinity and strong radiation as well as temperature changes in the early Earth, regarded as negative factors, would be indispensable for the origin of protocell. In short, graphene bred life, but digested gradually by the evolution. △ Less

Submitted 12 November, 2014; v1 submitted 7 November, 2014; originally announced November 2014.

Comments: 1411.1903

Showing 1–7 of 7 results for author: Bai, J