Search | arXiv e-print repository

ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models

Authors: Zicheng Ma, Chuanliu Fan, Zhicong Wang, Zhenyu Chen, Xiaohan Lin, Yanheng Li, Shihao Feng, Jun Zhang, Ziqiang Cao, Yi Qin Gao

Abstract: Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inh… ▽ More Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks. △ Less

Submitted 13 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

Comments: 26 pages, 9 figures

arXiv:2502.15597 [pdf, other]

From FAIR to CURE: Guidelines for Computational Models of Biological Systems

Authors: Herbert M. Sauro, Eran Agmon, Michael L. Blinov, John H. Gennari, Joe Hellerstein, Adel Heydarabadipour, Peter Hunter, Bartholomew E. Jardine, Elebeoba May, David P. Nickerson, Lucian P. Smith, Gary D Bader, Frank Bergmann, Patrick M. Boyle, Andreas Drager, James R. Faeder, Song Feng, Juliana Freire, Fabian Frohlich, James A. Glazier, Thomas E. Gorochowski, Tomas Helikar, Stefan Hoops, Princess Imoukhuede, Sarah M. Keating , et al. (26 additional authors not shown)

Abstract: Guidelines for managing scientific data have been established under the FAIR principles requiring that data be Findable, Accessible, Interoperable, and Reusable. In many scientific disciplines, especially computational biology, both data and models are key to progress. For this reason, and recognizing that such models are a very special type of 'data', we argue that computational models, especiall… ▽ More Guidelines for managing scientific data have been established under the FAIR principles requiring that data be Findable, Accessible, Interoperable, and Reusable. In many scientific disciplines, especially computational biology, both data and models are key to progress. For this reason, and recognizing that such models are a very special type of 'data', we argue that computational models, especially mechanistic models prevalent in medicine, physiology and systems biology, deserve a complementary set of guidelines. We propose the CURE principles, emphasizing that models should be Credible, Understandable, Reproducible, and Extensible. We delve into each principle, discussing verification, validation, and uncertainty quantification for model credibility; the clarity of model descriptions and annotations for understandability; adherence to standards and open science practices for reproducibility; and the use of open standards and modular code for extensibility and reuse. We outline recommended and baseline requirements for each aspect of CURE, aiming to enhance the impact and trustworthiness of computational models, particularly in biomedical applications where credibility is paramount. Our perspective underscores the need for a more disciplined approach to modeling, aligning with emerging trends such as Digital Twins and emphasizing the importance of data and modeling standards for interoperability and reuse. Finally, we emphasize that given the non-trivial effort required to implement the guidelines, the community moves to automate as many of the guidelines as possible. △ Less

Submitted 21 February, 2025; originally announced February 2025.

arXiv:2501.16386 [pdf]

ILETIA: An AI-enhanced method for individualized trigger-oocyte pickup interval estimation of progestin-primed ovarian stimulation protocol

Authors: Binjian Wu, Qian Li, Zhe Kuang, Hongyuan Gao, Xinyi Liu, Haiyan Guo, Qiuju Chen, Xinyi Liu, Yangruizhe Jiang, Yuqi Zhang, Jinyin Zha, Mingyu Li, Qiuhan Ren, Sishuo Feng, Haicang Zhang, Xuefeng Lu, Jian Zhang

Abstract: In vitro fertilization-embryo transfer (IVF-ET) stands as one of the most prevalent treatments for infertility. During an IVF-ET cycle, the time interval between trigger shot and oocyte pickup (OPU) is a pivotal period for follicular maturation, which determines mature oocytes yields and impacts the success of subsequent procedures. However, accurately predicting this interval is severely hindered… ▽ More In vitro fertilization-embryo transfer (IVF-ET) stands as one of the most prevalent treatments for infertility. During an IVF-ET cycle, the time interval between trigger shot and oocyte pickup (OPU) is a pivotal period for follicular maturation, which determines mature oocytes yields and impacts the success of subsequent procedures. However, accurately predicting this interval is severely hindered by the variability of clinicians'experience that often leads to suboptimal oocyte retrieval rate. To address this challenge, we propose ILETIA, the first machine learning-based method that could predict the optimal trigger-OPU interval for patients receiving progestin-primed ovarian stimulation (PPOS) protocol. Specifically, ILETIA leverages a Transformer to learn representations from clinical tabular data, and then employs gradient-boosted trees for interval prediction. For model training and evaluating, we compiled a dataset PPOS-DS of nearly ten thousand patients receiving PPOS protocol, the largest such dataset to our knowledge. Experimental results demonstrate that our method achieves strong performance (AUROC = 0.889), outperforming both clinicians and other widely used computational models. Moreover, ILETIA also supports premature ovulation risk prediction in a specific OPU time (AUROC = 0.838). Collectively, by enabling more precise and individualized decisions, ILETIA has the potential to improve clinical outcomes and lay the foundation for future IVF-ET research. △ Less

Submitted 25 January, 2025; originally announced January 2025.

arXiv:2410.10516 [pdf, other]

UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Authors: Shikun Feng, Yuyan Ni, Yan Lu, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan

Abstract: Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain… ▽ More Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision. △ Less

Submitted 4 April, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

Comments: 11 pages, 5 figures

arXiv:2410.09346 [pdf]

Transcriptome and Redox Proteome Reveal Temporal Scales of Carbon Metabolism Regulation in Model Cyanobacteria Under Light Disturbance

Authors: Connah G. M. Johnson, Zachary Johnson, Liam S. Mackey, Xiaolu Li, Natalie C. Sadler, Tong Zhang, Wei-Jun Qian, Pavlo Bohutskyi, Song Feng, Margaret S. Cheung

Abstract: We develop a systems approach based on an energy-landscape concept to differentiate interactions involving redox activities and conformational changes of proteins and nucleic acids interactions in multi-layered protein-DNA regulatory networks under light disturbance. Our approach is a data-driven modeling workflow using a physics-informed machine learning algorithm to train a non-linear mathematic… ▽ More We develop a systems approach based on an energy-landscape concept to differentiate interactions involving redox activities and conformational changes of proteins and nucleic acids interactions in multi-layered protein-DNA regulatory networks under light disturbance. Our approach is a data-driven modeling workflow using a physics-informed machine learning algorithm to train a non-linear mathematical model for interpreting gene expression dynamics and to lead discovery for protein regulators using redox proteome analysis. We distinguish light-responsive elements within central carbon metabolism pathways from independent variables like circadian time using the publicly available transcriptome datasets of Synechococcus elongatus over diel cycles responding to light perturbations. Our approach provides interpretable de novo models for elucidating events of reactions in complex regulatory pathways in response to stressful disturbance from the environment. We discovered protein regulators in response to light disturbance in the proteome analysis involving shifts in protein abundance as well as cysteine redox states under constant illumination and after two hours of darkness. We discovered significant shifts in cysteine redox states in regulatory proteins such as transcription sigma factors and metabolic enzymes in the oxidative pentose phosphate pathway and the Calvin-Benson cycle, while the changes in their protein abundance were minimal. These results indicate that regulatory dynamics in reductant generation link photo-induced electron transport pathways and redox metabolic pathways with circadian rhythms through fast redox-induced conformational changes or slow expression regulations across networks. △ Less

Submitted 2 June, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

arXiv:2406.11568 [pdf, other]

doi 10.21437/Interspeech.2024-382

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

Authors: Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang

Abstract: In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade m… ▽ More In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade models. Our findings underscore the immense potential of E2E frameworks in speech neuroprosthesis, particularly as the technology behind brain-computer interfaces (BCIs) and the availability of relevant datasets continue to evolve. This work not only showcases the efficacy of combining LLMs with E2E decoding for enhancing speech neuroprosthesis but also sets a new direction for future research in BCI applications, underscoring the impact of LLMs in decoding complex neural signals for communication restoration. Code will be made available at https://github.com/FsFrancis15/BrainLLM. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Journal ref: Proceedings of Interspeech2024

arXiv:2405.10343 [pdf, other]

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning

Authors: Shikun Feng, Yuyan Ni, Minghao Li, Yanwen Huang, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan

Abstract: Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound un… ▽ More Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound understanding of existing pre-training methods, including 2D graph masking, 2D-3D contrastive learning, and 3D denoising, hampers the advancement of molecular foundation models. In this work, we provide a unified comprehension of existing pre-training methods through the lens of contrastive learning. Thus their distinctions lie in clustering different views of molecules, which is shown beneficial to specific downstream tasks. To achieve a complete and general-purpose molecular representation, we propose a novel pre-training framework, named UniCorn, that inherits the merits of the three methods, depicting molecular views in three different levels. SOTA performance across quantum, physicochemical, and biological tasks, along with comprehensive ablation study, validate the universality and effectiveness of UniCorn. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2402.13779 [pdf, other]

Contextual Molecule Representation Learning from Chemical Reaction Knowledge

Authors: Han Tang, Shikun Feng, Bicheng Lin, Yuyan Ni, JIngjing Liu, Wei-Ying Ma, Yanyan Lan

Abstract: In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms… ▽ More In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms within molecules, which brings insurmountable complexity to the masking-reconstruction paradigm. To tackle this challenge, we introduce REMO, a self-supervised learning framework that takes advantage of well-defined atom-combination rules in common chemistry. Specifically, REMO pre-trains graph/Transformer encoders on 1.7 million known chemical reactions in the literature. We propose two pre-training objectives: Masked Reaction Centre Reconstruction (MRCR) and Reaction Centre Identification (RCI). REMO offers a novel solution to MRL by exploiting the underlying shared patterns in chemical reactions as \textit{context} for pre-training, which effectively infers meaningful representations of common chemistry knowledge. Such contextual representations can then be utilized to support diverse downstream molecular tasks with minimum finetuning, such as affinity prediction and drug-drug interaction prediction. Extensive experimental results on MoleculeACE, ACNet, drug-drug interaction (DDI), and reaction type classification show that across all tested downstream tasks, REMO outperforms the standard baseline of single-molecule masked modeling used in current MRL. Remarkably, REMO is the pioneering deep learning model surpassing fingerprint-based methods in activity cliff benchmarks. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: Preprint. Under Review

arXiv:2311.16160 [pdf, other]

Protein-ligand binding representation learning from fine-grained interactions

Authors: Shikun Feng, Minghao Li, Yinjun Jia, Weiying Ma, Yanyan Lan

Abstract: The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different f… ▽ More The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different from existing pre-training approaches which treat proteins and ligands individually, we emphasize to discern the intricate binding patterns from fine-grained interactions. Specifically, this self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure given a pocket and ligand with a Transformer based interaction module, which naturally emulates the binding process. To ensure the representation of rich binding information, we introduce two pre-training tasks, i.e.~atomic pairwise distance map prediction and mask ligand reconstruction, which comprehensively model the fine-grained interactions from both structure and feature space. Extensive experiments have demonstrated the superiority of our method across various binding tasks, including protein-ligand affinity prediction, virtual screening and protein-ligand docking. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2311.02124 [pdf, other]

Sliced Denoising: A Physics-Informed Molecular Pre-Training Method

Authors: Yuyan Ni, Shikun Feng, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan

Abstract: While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their… ▽ More While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their accuracy is often compromised by ad-hoc noise design, leading to inaccurate learned force fields. To address this limitation, this paper proposes a new method for molecular pre-training, called sliced denoising (SliDe), which is based on the classical mechanical intramolecular potential theory. SliDe utilizes a novel noise strategy that perturbs bond lengths, angles, and torsion angles to achieve better sampling over conformations. Additionally, it introduces a random slicing approach that circumvents the computationally expensive calculation of the Jacobian matrix, which is otherwise essential for estimating the force field. By aligning with physical principles, SliDe shows a 42\% improvement in the accuracy of estimated force fields compared to current state-of-the-art denoising methods, and thus outperforms traditional baselines on various molecular property prediction tasks. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2310.14216 [pdf, other]

UniMAP: Universal SMILES-Graph Representation Learning

Authors: Shikun Feng, Lixin Yang, Yanwen Huang, Yuyan Ni, Weiying Ma, Yanyan Lan

Abstract: Molecular representation learning is fundamental for many drug related applications. Most existing molecular pre-training models are limited in using single molecular modality, either SMILES or graph representation. To effectively leverage both modalities, we argue that it is critical to capture the fine-grained 'semantics' between SMILES and graph, because subtle sequence/graph differences may le… ▽ More Molecular representation learning is fundamental for many drug related applications. Most existing molecular pre-training models are limited in using single molecular modality, either SMILES or graph representation. To effectively leverage both modalities, we argue that it is critical to capture the fine-grained 'semantics' between SMILES and graph, because subtle sequence/graph differences may lead to contrary molecular properties. In this paper, we propose a universal SMILE-graph representation learning model, namely UniMAP. Firstly, an embedding layer is employed to obtain the token and node/edge representation in SMILES and graph, respectively. A multi-layer Transformer is then utilized to conduct deep cross-modality fusion. Specially, four kinds of pre-training tasks are designed for UniMAP, including Multi-Level Cross-Modality Masking (CMM), SMILES-Graph Matching (SGM), Fragment-Level Alignment (FLA), and Domain Knowledge Learning (DKL). In this way, both global (i.e. SGM and DKL) and local (i.e. CMM and FLA) alignments are integrated to achieve comprehensive cross-modality fusion. We evaluate UniMAP on various downstream tasks, i.e. molecular property prediction, drug-target affinity prediction and drug-drug interaction. Experimental results show that UniMAP outperforms current state-of-the-art pre-training methods.We also visualize the learned representations to demonstrate the effect of multi-modality integration. △ Less

Submitted 4 November, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

arXiv:2307.10683 [pdf, other]

Fractional Denoising for 3D Molecular Pre-training

Authors: Shikun Feng, Yuyan Ni, Yanyan Lan, Zhi-Ming Ma, Wei-Ying Ma

Abstract: Coordinate denoising is a promising 3D molecular pre-training method, which has achieved remarkable performance in various downstream drug discovery tasks. Theoretically, the objective is equivalent to learning the force field, which is revealed helpful for downstream tasks. Nevertheless, there are two challenges for coordinate denoising to learn an effective force field, i.e. low coverage samples… ▽ More Coordinate denoising is a promising 3D molecular pre-training method, which has achieved remarkable performance in various downstream drug discovery tasks. Theoretically, the objective is equivalent to learning the force field, which is revealed helpful for downstream tasks. Nevertheless, there are two challenges for coordinate denoising to learn an effective force field, i.e. low coverage samples and isotropic force field. The underlying reason is that molecular distributions assumed by existing denoising methods fail to capture the anisotropic characteristic of molecules. To tackle these challenges, we propose a novel hybrid noise strategy, including noises on both dihedral angel and coordinate. However, denoising such hybrid noise in a traditional way is no more equivalent to learning the force field. Through theoretical deductions, we find that the problem is caused by the dependency of the input conformation for covariance. To this end, we propose to decouple the two types of noise and design a novel fractional denoising method (Frad), which only denoises the latter coordinate part. In this way, Frad enjoys both the merits of sampling more low-energy structures and the force field equivalence. Extensive experiments show the effectiveness of Frad in molecular representation, with a new state-of-the-art on 9 out of 12 tasks of QM9 and on 7 out of 8 targets of MD17. △ Less

Submitted 26 February, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

arXiv:2307.06235 [pdf, other]

Multimodal Molecular Pretraining via Modality Blending

Authors: Qiying Yu, Yudi Zhang, Yuyan Ni, Shikun Feng, Yanyan Lan, Hao Zhou, Jingjing Liu

Abstract: Self-supervised learning has recently gained growing interest in molecular modeling for scientific tasks such as AI-assisted drug discovery. Current studies consider leveraging both 2D and 3D molecular structures for representation learning. However, relying on straightforward alignment strategies that treat each modality separately, these methods fail to exploit the intrinsic correlation between… ▽ More Self-supervised learning has recently gained growing interest in molecular modeling for scientific tasks such as AI-assisted drug discovery. Current studies consider leveraging both 2D and 3D molecular structures for representation learning. However, relying on straightforward alignment strategies that treat each modality separately, these methods fail to exploit the intrinsic correlation between 2D and 3D representations that reflect the underlying structural characteristics of molecules, and only perform coarse-grained molecule-level alignment. To derive fine-grained alignment and promote structural molecule understanding, we introduce an atomic-relation level "blend-then-predict" self-supervised learning approach, MoleBLEND, which first blends atom relations represented by different modalities into one unified relation matrix for joint encoding, then recovers modality-specific information for 2D and 3D structures individually. By treating atom relationships as anchors, MoleBLEND organically aligns and integrates visually dissimilar 2D and 3D modalities of the same molecule at fine-grained atomic level, painting a more comprehensive depiction of each molecule. Extensive experiments show that MoleBLEND achieves state-of-the-art performance across major 2D/3D molecular benchmarks. We further provide theoretical insights from the perspective of mutual-information maximization, demonstrating that our method unifies contrastive, generative (cross-modality prediction) and mask-then-predict (single-modality prediction) objectives into one single cohesive framework. △ Less

Submitted 8 October, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

arXiv:2305.06488 [pdf]

A Platform for the Biomedical Application of Large Language Models

Authors: Sebastian Lobentanzer, Shaohong Feng, The BioChatter Consortium, Andreas Maier, Cankun Wang, Jan Baumbach, Nils Krehl, Qin Ma, Julio Saez-Rodriguez

Abstract: Current-generation Large Language Models (LLMs) have stirred enormous interest in recent months, yielding great potential for accessibility and automation, while simultaneously posing significant challenges and risk of misuse. To facilitate interfacing with LLMs in the biomedical space, while at the same time safeguarding their functionalities through sensible constraints, we propose a dedicated,… ▽ More Current-generation Large Language Models (LLMs) have stirred enormous interest in recent months, yielding great potential for accessibility and automation, while simultaneously posing significant challenges and risk of misuse. To facilitate interfacing with LLMs in the biomedical space, while at the same time safeguarding their functionalities through sensible constraints, we propose a dedicated, open-source framework: BioChatter. Based on open-source software packages, we synergise the many functionalities that are currently developing around LLMs, such as knowledge integration / retrieval-augmented generation, model chaining, and benchmarking, resulting in an easy-to-use and inclusive framework for application in many use cases of biomedicine. We focus on robust and user-friendly implementation, including ways to deploy privacy-preserving local open-source LLMs. We demonstrate use cases via two multi-purpose web apps (https://chat.biocypher.org), and provide documentation, support, and an open community. △ Less

Submitted 17 February, 2024; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: 31 pages, 3 figures

arXiv:2202.13004 [pdf]

SBbadger: Biochemical Reaction Networks with Definable Degree Distributions

Authors: Michael A. Kochen, H. Steven Wiley, Song Feng, Herbert M. Sauro

Abstract: Motivation: An essential step in developing computational tools for the inference, optimization, and simulation of biochemical reaction networks is gauging tool performance against earlier efforts using an appropriate set of benchmarks. General strategies for the assembly of benchmark models include collection from the literature, creation via subnetwork extraction and de novo generation. However,… ▽ More Motivation: An essential step in developing computational tools for the inference, optimization, and simulation of biochemical reaction networks is gauging tool performance against earlier efforts using an appropriate set of benchmarks. General strategies for the assembly of benchmark models include collection from the literature, creation via subnetwork extraction and de novo generation. However, with respect to biochemical reaction networks, these approaches and their associated tools are either poorly suited to generate models that reflect the wide range of properties found in natural biochemical networks or to do so in numbers that enable rigorous statistical analysis. Results: In this work we present SBbadger, a python-based software tool for the generation of synthetic biochemical reaction or metabolic networks with user-defined degree distributions, multiple available kinetic formalisms, and a host of other definable properties. SBbadger thus enables the creation of benchmark model sets that reflect properties of biological systems and generate the kinetics and model structures typically targeted by computational analysis and inference software. Here we detail the computational and algorithmic workflow of SBbadger, demonstrate its performance under various settings, provide samples outputs, and compare it to currently available biochemical reaction network generation software. △ Less

Submitted 12 September, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

arXiv:2106.06929 [pdf, other]

Dynamics and Sensitivity of Signaling Pathways

Authors: Michael A. Kochen, Steven S. Andrews, H. Steven Wiley, Song Feng, Herbert M. Sauro

Abstract: Signaling pathways serve to communicate information about extracellular conditions into the cell, to both the nucleus and cytoplasmic processes to control cell responses. Genetic mutations in signaling network components are frequently associated with cancer and can result in cells acquiring an ability to divide and grow uncontrollably. Because signaling pathways play such a significant role in ca… ▽ More Signaling pathways serve to communicate information about extracellular conditions into the cell, to both the nucleus and cytoplasmic processes to control cell responses. Genetic mutations in signaling network components are frequently associated with cancer and can result in cells acquiring an ability to divide and grow uncontrollably. Because signaling pathways play such a significant role in cancer initiation and advancement, their constituent proteins are attractive therapeutic targets. In this review, we discuss how signaling pathway modeling can assist with identifying effective drugs for treating diseases, such as cancer. An achievement that would facilitate the use of such models is their ability to identify controlling biochemical parameters in signaling pathways, such as molecular abundances and chemical reaction rates, because this would help determine effective points of attack by therapeutics. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2010.03068 [pdf, other]

Hypergraph Models of Biological Networks to Identify Genes Critical to Pathogenic Viral Response

Authors: Song Feng, Emily Heath, Brett Jefferson, Cliff Joslyn, Henry Kvinge, Hugh D. Mitchell, Brenda Praggastis, Amie J. Eisfeld, Amy C. Sims, Larissa B. Thackray, Shufang Fan, Kevin B. Walters, Peter J. Halfmann, Danielle Westhoff-Smith, Qing Tan, Vineet D. Menachery, Timothy P. Sheahan, Adam S. Cockrell, Jacob F. Kocher, Kelly G. Stratton, Natalie C. Heller, Lisa M. Bramer, Michael S. Diamond, Ralph S. Baric, Katrina M. Waters , et al. (3 additional authors not shown)

Abstract: Background: Representing biological networks as graphs is a powerful approach to reveal underlying patterns, signatures, and critical components from high-throughput biomolecular data. However, graphs do not natively capture the multi-way relationships present among genes and proteins in biological systems. Hypergraphs are generalizations of graphs that naturally model multi-way relationships and… ▽ More Background: Representing biological networks as graphs is a powerful approach to reveal underlying patterns, signatures, and critical components from high-throughput biomolecular data. However, graphs do not natively capture the multi-way relationships present among genes and proteins in biological systems. Hypergraphs are generalizations of graphs that naturally model multi-way relationships and have shown promise in modeling systems such as protein complexes and metabolic reactions. In this paper we seek to understand how hypergraphs can more faithfully identify, and potentially predict, important genes based on complex relationships inferred from genomic expression data sets. Results: We compiled a novel data set of transcriptional host response to pathogenic viral infections and formulated relationships between genes as a hypergraph where hyperedges represent significantly perturbed genes, and vertices represent individual biological samples with specific experimental conditions. We find that hypergraph betweenness centrality is a superior method for identification of genes important to viral response when compared with graph centrality. Conclusions: Our results demonstrate the utility of using hypergraphs to represent complex biological systems and highlight central important responses in common to a variety of highly pathogenic viruses. △ Less

Submitted 6 October, 2020; originally announced October 2020.

MSC Class: 92C42; 92-08; 05C65

arXiv:2009.11241 [pdf, other]

Deep learning for peptide identification from metaproteomics datasets

Authors: Xuan Guo, Shichao Feng

Abstract: Metaproteomics are becoming widely used in microbiome research for gaining insights into the functional state of the microbial community. Current metaproteomics studies are generally based on high-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. The identification of peptides and proteins from MS data involves the computational procedure of searching MS/MS spectra ag… ▽ More Metaproteomics are becoming widely used in microbiome research for gaining insights into the functional state of the microbial community. Current metaproteomics studies are generally based on high-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. The identification of peptides and proteins from MS data involves the computational procedure of searching MS/MS spectra against a predefined protein sequence database and assigning top-scored peptides to spectra. Existing computational tools are still far from being able to extract all the information out of large MS/MS datasets acquired from metaproteome samples. In this paper, we proposed a deep-learning-based algorithm, called DeepFilter, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Compared with other post-processing tools, including Percolator, Q-ranker, PeptideProphet, and Iprophet, DeepFilter identified 20% and 10% more peptide-spectrum-matches and proteins, respectively, on marine microbial and soil microbial metaproteome samples with false discovery rate at 1%. △ Less

Submitted 23 September, 2020; originally announced September 2020.

arXiv:1903.08615 [pdf, other]

doi 10.1063/1.5096774

Scaling methods for accelerating kinetic Monte Carlo simulations of chemical reaction networks

Authors: Yen Ting Lin, Song Feng, William S. Hlavacek

Abstract: Various kinetic Monte Carlo algorithms become inefficient when some of the population sizes in a system are large, which gives rise to a large number of reaction events per unit time. Here, we present a new acceleration algorithm based on adaptive and heterogeneous scaling of reaction rates and stoichiometric coefficients. The algorithm is conceptually related to the commonly used idea of accelera… ▽ More Various kinetic Monte Carlo algorithms become inefficient when some of the population sizes in a system are large, which gives rise to a large number of reaction events per unit time. Here, we present a new acceleration algorithm based on adaptive and heterogeneous scaling of reaction rates and stoichiometric coefficients. The algorithm is conceptually related to the commonly used idea of accelerating a stochastic simulation by considering a sub-volume $λΩ$ ($0<λ<1$) within a system of interest, which reduces the number of reaction events per unit time occurring in a simulation by a factor $1/λ$ at the cost of greater error in unbiased estimates of first moments and biased overestimates of second moments. Our new approach offers two unique benefits. First, scaling is adaptive and heterogeneous, which eliminates the pitfall of overaggressive scaling. Second, there is no need for an \emph{a priori} classification of populations as discrete or continuous (as in a hybrid method), which is problematic when discreteness of a chemical species changes during a simulation. The method requires specification of only a single algorithmic parameter, $N_c$, a global critical population size above which populations are effectively scaled down to increase simulation efficiency. The method, which we term partial scaling, is implemented in the open-source BioNetGen software package. We demonstrate that partial scaling can significantly accelerate simulations without significant loss of accuracy for several published models of biological systems. These models characterize activation of the mitogen-activated protein kinase ERK, prion protein aggregation, and T-cell receptor signaling. △ Less

Submitted 10 May, 2019; v1 submitted 20 March, 2019; originally announced March 2019.

Comments: 18 pages, 7 figures, 1 table

arXiv:1811.09366 [pdf]

Prediction of Cytochrome P450-Mediated Metabolism Using a Combination of QSAR Derived Reactivity and Induced Fit Docking

Authors: Shulu Feng, Richard A. Friesner

Abstract: Prediction of metabolism in cytochrome P450s remains to be a crucial yet challenging topic in discovering and designing drugs, agrochemicals and nutritional supplements. The problem is challenging because the rate of P450 metabolism depends upon both the intrinsic chemical reactivity of the site and the protein-ligand geometry that is energetically accessible in the active site of a given P450 iso… ▽ More Prediction of metabolism in cytochrome P450s remains to be a crucial yet challenging topic in discovering and designing drugs, agrochemicals and nutritional supplements. The problem is challenging because the rate of P450 metabolism depends upon both the intrinsic chemical reactivity of the site and the protein-ligand geometry that is energetically accessible in the active site of a given P450 isozyme. We have addressed this problem using a two-level screening system. The first level implements an empirical QSAR-based scoring function employing the local chemical motifs to characterize the intrinsic reactivity. The second level uses molecular docking and molecular mechanics to account for the geometrical effects, including induced-fit effects in the protein which can be very important in P450 interactions with ligands. This approach has achieved high accuracy for both the P450 3A4 and 2D6 isoforms. In identifying at least one metabolic site in the top two ranked positions, the prediction rate can reach as high as 92.7% for the test set of isoform 3A4. For the 2D6 isoform, 100% accuracy is achieved on this basic evaluation metric, and, because this active site is considerably smaller and more selective than 3A4, very high precision is attained for full prediction of all metabolic sites. The method also requires considerably less CPU time than our previous efforts, which involved a large number of expensive simulations for each ligand to be evaluated. After screening using the empirical score function, only a few best candidates are left for each ligand, making the number of necessary estimations in the second level very small, which significantly reduces the computation time. △ Less

Submitted 23 November, 2018; originally announced November 2018.

arXiv:1802.00462 [pdf]

In silico evolution of signaling networks using rule-based models: bistable response dynamics

Authors: Song Feng, Orkun S. Soyer

Abstract: One of the ultimate goals in biology is to understand the design principles of biological systems. Such principles, if they exist, can help us better understand complex, natural biological systems and guide the engineering of de novo ones. Towards deciphering design principles, in silico evolution of biological systems with proper abstraction is a promising approach. Here, we demonstrate the appli… ▽ More One of the ultimate goals in biology is to understand the design principles of biological systems. Such principles, if they exist, can help us better understand complex, natural biological systems and guide the engineering of de novo ones. Towards deciphering design principles, in silico evolution of biological systems with proper abstraction is a promising approach. Here, we demonstrate the application of in silico evolution combined with rule-based modelling for exploring design principles of cellular signaling networks. This application is based on a computational platform, called BioJazz, which allows in silico evolution of signaling networks with unbounded complexity. We provide a detailed introduction to BioJazz architecture and implementation and describe how it can be used to evolve and/or design signaling networks with defined dynamics. For the latter, we evolve signaling networks with switch-like response dynamics and demonstrate how BioJazz can result in new biological insights on network structures that can endow bistable response dynamics. This example also demonstrated both the power of BioJazz in evolving and designing signaling networks and its limitations at the current stage of development. △ Less

Submitted 6 February, 2018; v1 submitted 1 February, 2018; originally announced February 2018.

Comments: 24 pages, 7 figures

arXiv:1801.10227 [pdf, other]

Generalizing Gillespie's direct method to enable network-free simulations

Authors: Ryan Suderman, Eshan D. Mitra, Yen Ting Lin, Keesha E. Erickson, Song Feng, William S. Hlavacek

Abstract: Gillespie's direct method for stochastic simulation of chemical kinetics is a staple of computational systems biology research. However, the algorithm requires explicit enumeration of all reactions and all chemical species that may arise in the system. In many cases, this is not feasible due to the combinatorial explosion of reactions and species in biological networks. Rule-based modeling framewo… ▽ More Gillespie's direct method for stochastic simulation of chemical kinetics is a staple of computational systems biology research. However, the algorithm requires explicit enumeration of all reactions and all chemical species that may arise in the system. In many cases, this is not feasible due to the combinatorial explosion of reactions and species in biological networks. Rule-based modeling frameworks provide a way to exactly represent networks containing such combinatorial complexity, and generalizations of Gillespie's direct method have been developed as simulation engines for rule-based modeling languages. Here, we provide both a high-level description of the algorithms underlying the simulation engines, termed network-free simulation algorithms, and how they have been applied in systems biology research. We also define a generic rule-based modeling framework and describe a number of technical details required for adapting Gillespie's direct method for network-free simulation. Finally, we briefly discuss potential avenues for advancing network-free simulation and the role they continue to play in modeling dynamical systems in biology. △ Less

Submitted 30 January, 2018; originally announced January 2018.

Comments: 27 pages, 6 figures

arXiv:1601.00891 [pdf, other]

doi 10.1186/s13059-016-1037-6

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Authors: Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca , et al. (122 additional authors not shown)

Abstract: Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a… ▽ More Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction. △ Less

Submitted 2 January, 2016; originally announced January 2016.

Comments: Submitted to Genome Biology

arXiv:1508.03373 [pdf, other]

A martingale analysis of first passage times of time-dependent Wiener diffusion models

Authors: Vaibhav Srivastava, Samuel F. Feng, Jonathan D. Cohen, Naomi Ehrich Leonard, Amitai Shenhav

Abstract: Research in psychology and neuroscience has successfully modeled decision making as a process of noisy evidence accumulation to a decision bound. While there are several variants and implementations of this idea, the majority of these models make use of a noisy accumulation between two absorbing boundaries. A common assumption of these models is that decision parameters, e.g., the rate of accumula… ▽ More Research in psychology and neuroscience has successfully modeled decision making as a process of noisy evidence accumulation to a decision bound. While there are several variants and implementations of this idea, the majority of these models make use of a noisy accumulation between two absorbing boundaries. A common assumption of these models is that decision parameters, e.g., the rate of accumulation (drift rate), remain fixed over the course of a decision, allowing the derivation of analytic formulas for the probabilities of hitting the upper or lower decision threshold, and the mean decision time. There is reason to believe, however, that many types of behavior would be better described by a model in which the parameters were allowed to vary over the course of the decision process. In this paper, we use martingale theory to derive formulas for the mean decision time, hitting probabilities, and first passage time (FPT) densities of a Wiener process with time-varying drift between two time-varying absorbing boundaries. This model was first studied by Ratcliff (1980) in the two-stage form, and here we consider the same model for an arbitrary number of stages (i.e. intervals of time during which parameters are constant). Our calculations enable direct computation of mean decision times and hitting probabilities for the associated multistage process. We also provide a review of how martingale theory may be used to analyze similar models employing Wiener processes by re-deriving some classical results. In concert with a variety of numerical tools already available, the current derivations should encourage mathematical analysis of more complex models of decision making with time-varying evidence. △ Less

Submitted 30 September, 2016; v1 submitted 13 August, 2015; originally announced August 2015.

arXiv:math/9809203 [pdf, ps]

Large deviations for the Fleming-Viot process with neutral mutation and selection

Authors: Donald Dawson, Shui Feng

Abstract: Large deviation principles are established for the Fleming-Viot processes with neutral mutation and selection, and the corresponding equilibrium measures as the sampling rate goes to 0. All results are first proved for the finite allele model, and then generalized, through the projective limit technique, to the infinite allele model. Explicit expressions are obtained for the rate functions. Large deviation principles are established for the Fleming-Viot processes with neutral mutation and selection, and the corresponding equilibrium measures as the sampling rate goes to 0. All results are first proved for the finite allele model, and then generalized, through the projective limit technique, to the infinite allele model. Explicit expressions are obtained for the rate functions. △ Less

Submitted 16 September, 1998; originally announced September 1998.

Report number: FI-NP1998-005

Showing 1–25 of 25 results for author: Feng, S