-
Deciphering the unique dynamic activation pathway in a G protein-coupled receptor enables unveiling biased signaling and identifying cryptic allosteric sites in conformational intermediates
Authors:
Jigang Fan,
Chunhao Zhu,
Xiaobing Lan,
Haiming Zhuang,
Mingyu Li,
Jian Zhang,
Shaoyong Lu
Abstract:
Neurotensin receptor 1 (NTSR1), a member of the Class A G protein-coupled receptor superfamily, plays an important role in modulating dopaminergic neuronal activity and eliciting opioid-independent analgesia. Recent studies suggest that promoting \{beta}-arrestin-biased signaling in NTSR1 may diminish drugs of abuse, such as psychostimulants, thereby offering a potential avenue for treating human…
▽ More
Neurotensin receptor 1 (NTSR1), a member of the Class A G protein-coupled receptor superfamily, plays an important role in modulating dopaminergic neuronal activity and eliciting opioid-independent analgesia. Recent studies suggest that promoting \{beta}-arrestin-biased signaling in NTSR1 may diminish drugs of abuse, such as psychostimulants, thereby offering a potential avenue for treating human addiction-related disorders. In this study, we utilized a novel computational and experimental approach that combined nudged elastic band-based molecular dynamics simulations, Markov state models, temporal communication network analysis, site-directed mutagenesis, and conformational biosensors, to explore the intricate mechanisms underlying NTSR1 activation and biased signaling. Our study reveals a dynamic stepwise transition mechanism and activated transmission network associated with NTSR1 activation. It also yields valuable insights into the complex interplay between the unique polar network, non-conserved ion locks, and aromatic clusters in NTSR1 signaling. Moreover, we identified a cryptic allosteric site located in the intracellular region of the receptor that exists in an intermediate state within the activation pathway. Collectively, these findings contribute to a more profound understanding of NTSR1 activation and biased signaling at the atomic level, thereby providing a potential strategy for the development of NTSR1 allosteric modulators in the realm of G protein-coupled receptor biology, biophysics, and medicine.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens
Authors:
Shuqi Lu,
Haowei Lin,
Lin Yao,
Zhifeng Gao,
Xiaohong Ji,
Weinan E,
Linfeng Zhang,
Guolin Ke
Abstract:
Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining under…
▽ More
Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates 3D GU tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse 3D GU tasks within a single autoregressive framework. Extensive experiments across multiple microscopic 3D GU tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.
△ Less
Submitted 21 March, 2025; v1 submitted 20 March, 2025;
originally announced March 2025.
-
Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling
Authors:
Shuqi Lu,
Xiaohong Ji,
Bohang Zhang,
Lin Yao,
Siyuan Liu,
Zhifeng Gao,
Linfeng Zhang,
Guolin Ke
Abstract:
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molec…
▽ More
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components: (1) grid-based space discretization; (2) grid sampling/merging; and (3) efficient 3D positional encoding. Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.
△ Less
Submitted 18 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Structure Language Models for Protein Conformation Generation
Authors:
Jiarui Lu,
Xiaoyin Chen,
Stephen Zhewen Lu,
Chence Shi,
Hongyu Guo,
Yoshua Bengio,
Jian Tang
Abstract:
Proteins adopt multiple structural conformations to perform their diverse biological functions, and understanding these conformations is crucial for advancing drug discovery. Traditional physics-based simulation methods often struggle with sampling equilibrium conformations and are computationally expensive. Recently, deep generative models have shown promise in generating protein conformations as…
▽ More
Proteins adopt multiple structural conformations to perform their diverse biological functions, and understanding these conformations is crucial for advancing drug discovery. Traditional physics-based simulation methods often struggle with sampling equilibrium conformations and are computationally expensive. Recently, deep generative models have shown promise in generating protein conformations as a more efficient alternative. However, these methods predominantly rely on the diffusion process within a 3D geometric space, which typically centers around the vicinity of metastable states and is often inefficient in terms of runtime. In this paper, we introduce Structure Language Modeling (SLM) as a novel framework for efficient protein conformation generation. Specifically, the protein structures are first encoded into a compact latent space using a discrete variational auto-encoder, followed by conditional language modeling that effectively captures sequence-specific conformation distributions. This enables a more efficient and interpretable exploration of diverse ensemble modes compared to existing methods. Based on this general framework, we instantiate SLM with various popular LM architectures as well as proposing the ESMDiff, a novel BERT-like structure language model fine-tuned from ESM3 with masked diffusion. We verify our approach in various scenarios, including the equilibrium dynamics of BPTI, conformational change pairs, and intrinsically disordered proteins. SLM provides a highly efficient solution, offering a 20-100x speedup than existing methods in generating diverse conformations, shedding light on promising avenues for future research.
△ Less
Submitted 12 March, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Cell Morphology-Guided Small Molecule Generation with GFlowNets
Authors:
Stephen Zhewen Lu,
Ziqing Lu,
Ehsan Hajiramezanali,
Tommaso Biancalani,
Yoshua Bengio,
Gabriele Scalia,
MichaĆ Koziarski
Abstract:
High-content phenotypic screening, including high-content imaging (HCI), has gained popularity in the last few years for its ability to characterize novel therapeutics without prior knowledge of the protein target. When combined with deep learning techniques to predict and represent molecular-phenotype interactions, these advancements hold the potential to significantly accelerate and enhance drug…
▽ More
High-content phenotypic screening, including high-content imaging (HCI), has gained popularity in the last few years for its ability to characterize novel therapeutics without prior knowledge of the protein target. When combined with deep learning techniques to predict and represent molecular-phenotype interactions, these advancements hold the potential to significantly accelerate and enhance drug discovery applications. This work focuses on the novel task of HCI-guided molecular design. Generative models for molecule design could be guided by HCI data, for example with a supervised model that links molecules to phenotypes of interest as a reward function. However, limited labeled data, combined with the high-dimensional readouts, can make training these methods challenging and impractical. We consider an alternative approach in which we leverage an unsupervised multimodal joint embedding to define a latent similarity as a reward for GFlowNets. The proposed model learns to generate new molecules that could produce phenotypic effects similar to those of the given image target, without relying on pre-annotated phenotypic labels. We demonstrate that the proposed method generates molecules with high morphological and structural similarity to the target, increasing the likelihood of similar biological activity, as confirmed by an independent oracle model.
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
In silico bioactivity prediction of proteins interacting with graphene-based nanomaterials guides rational design of biosensor
Authors:
Jing Ye,
Minzhi Fan,
Xiaoyu Zhang,
Shasha Lu,
Mengyao Chai,
Yunshan Zhang,
Xiaoyu Zhao,
Shuang Li,
Diming Zhang
Abstract:
Graphene based nanomaterials have attracted significant attention for their potentials in biomedical and biotechnology applications in recent years, owing to the outstanding physical and chemical properties. However, the interaction mechanism and impact on biological activity of macro and micro biomolecules still require more concerns and further research in order to enhance their applicability in…
▽ More
Graphene based nanomaterials have attracted significant attention for their potentials in biomedical and biotechnology applications in recent years, owing to the outstanding physical and chemical properties. However, the interaction mechanism and impact on biological activity of macro and micro biomolecules still require more concerns and further research in order to enhance their applicability in biosensors, etc. Herein, an integrated method has been developed to predict the protein bioactivity performance when interacting with nanomaterials for protein based biosensor. Molecular dynamics simulation and molecular docking technique were consolidated to investigate several nanomaterials C60 fullerene, single walled carbon nanotube, pristine graphene and graphene oxide, and their effect when interacting with protein. The adsorption behavior, secondary structure changes and protein bioactivity changes were simulated, and the results of protein activity simulation were verified in combination with atomic force spectrum, circular dichroism spectrum fluorescence and electrochemical experiments. The best quantification alignment between bioactivity obtained by simulation and experiment measurements was further explored. The two proteins, RNase A and Exonuclease III, were regarded as analysis model for the proof of concept, and the prediction accuracy of protein bioactivty could reach up to 0.98.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains
Authors:
Jiale Zhao,
Wanru Zhuang,
Jia Song,
Yaqi Li,
Shuqi Lu
Abstract:
In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue t…
▽ More
In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.
△ Less
Submitted 2 June, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures
Authors:
Zhuoyuan Wang,
Jiacong Mi,
Shan Lu,
Jieyue He
Abstract:
The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations…
▽ More
The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allows for the capture of pivotal molecular structural characteristics and high-level semantic information. Upon completion of pre-training, Graph Neural Network (GNN) Encoder is used for the prediction of downstream tasks. In comparison to advanced baseline models, MolIG exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups such as MoleculeNet Benchmark Group and ADMET Benchmark Group.
△ Less
Submitted 19 April, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking?
Authors:
Yuejiang Yu,
Shuqi Lu,
Zhifeng Gao,
Hang Zheng,
Guolin Ke
Abstract:
Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditiona…
▽ More
Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on given pockets. Overall, our work explicitly reveals some potential problems in current deep learning models for molecular docking and provides several suggestions for future works.
△ Less
Submitted 23 February, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
3D Molecular Generation via Virtual Dynamics
Authors:
Shuqi Lu,
Lin Yao,
Xi Chen,
Hang Zheng,
Di He,
Guolin Ke
Abstract:
Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., dir…
▽ More
Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., directly generating a molecule with a 3D structure and binding position in the pocket, is a new promising way to address this issue. Herein, we propose VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen consists of several carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket like in early attempts, in VD-Gen, we first randomly initialize many virtual particles in the pocket; then iteratively move these virtual particles, making the distribution of virtual particles approximate the distribution of molecular atoms. After virtual particles are stabilized in 3D space, we extract a 3D molecule from them. Finally, we further refine atoms in the extracted molecule by iterative movement again, to get a high-quality 3D molecule, and predict a confidence score for it. Extensive experiment results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules to fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
An Early Warning Approach to Monitor COVID-19 Activity with Multiple Digital Traces in Near Real-Time
Authors:
Nicole E. Kogan,
Leonardo Clemente,
Parker Liautaud,
Justin Kaashoek,
Nicholas B. Link,
Andre T. Nguyen,
Fred S. Lu,
Peter Huybers,
Bernd Resch,
Clemens Havas,
Andreas Petutschnig,
Jessica Davis,
Matteo Chinazzi,
Backtosch Mustafa,
William P. Hanage,
Alessandro Vespignani,
Mauricio Santillana
Abstract:
Non-pharmaceutical interventions (NPIs) have been crucial in curbing COVID-19 in the United States (US). Consequently, relaxing NPIs through a phased re-opening of the US amid still-high levels of COVID-19 susceptibility could lead to new epidemic waves. This calls for a COVID-19 early warning system. Here we evaluate multiple digital data streams as early warning indicators of increasing or decre…
▽ More
Non-pharmaceutical interventions (NPIs) have been crucial in curbing COVID-19 in the United States (US). Consequently, relaxing NPIs through a phased re-opening of the US amid still-high levels of COVID-19 susceptibility could lead to new epidemic waves. This calls for a COVID-19 early warning system. Here we evaluate multiple digital data streams as early warning indicators of increasing or decreasing state-level US COVID-19 activity between January and June 2020. We estimate the timing of sharp changes in each data stream using a simple Bayesian model that calculates in near real-time the probability of exponential growth or decay. Analysis of COVID-19-related activity on social network microblogs, Internet searches, point-of-care medical software, and a metapopulation mechanistic model, as well as fever anomalies captured by smart thermometer networks, shows exponential growth roughly 2-3 weeks prior to comparable growth in confirmed COVID-19 cases and 3-4 weeks prior to comparable growth in COVID-19 deaths across the US over the last 6 months. We further observe exponential decay in confirmed cases and deaths 5-6 weeks after implementation of NPIs, as measured by anonymized and aggregated human mobility data from mobile phones. Finally, we propose a combined indicator for exponential growth in multiple data streams that may aid in developing an early warning system for future COVID-19 outbreaks. These efforts represent an initial exploratory framework, and both continued study of the predictive power of digital indicators as well as further development of the statistical approach are needed.
△ Less
Submitted 3 July, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
From data towards knowledge: Revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data
Authors:
Songjian Lu,
Bo Jin,
Ashley Cowart,
Xinghua Lu
Abstract:
Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that un…
▽ More
Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining approaches towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal with respect to a functional module, and 3) revealing the architecture of a signaling system organize signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis have led to many hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architect of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which readily can be translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed".
△ Less
Submitted 22 February, 2013; v1 submitted 21 February, 2013;
originally announced February 2013.
-
Drug hypersensitivity caused by alteration of the MHC-presented self-peptide repertoire
Authors:
David A. Ostrov,
Barry J. Grant,
Yuri A. Pompeu,
John Sidney,
Mikkel Harndahl,
Scott Southwood,
Carla Oseroff,
Shun Lu,
Jean Jakoncic,
Cesar Augusto. F. de Oliveira,
Lun Yang,
Hu Mei,
Leming Shi,
Jeffrey Shabanowitz,
A. Michelle English,
Amanda Wriston,
Andrew Lucas,
Elizabeth Phillips,
Simon Mallal,
Howard Grey,
Alessandro Sette,
Donald F. Hunt,
Soren Buus,
Bjoern Peters
Abstract:
Idiosyncratic adverse drug reactions are unpredictable, dose independent and potentially life threatening; this makes them a major factor contributing to the cost and uncertainty of drug development. Clinical data suggest that many such reactions involve immune mechanisms, and genetic association studies have identified strong linkage between drug hypersensitivity reactions to several drugs and sp…
▽ More
Idiosyncratic adverse drug reactions are unpredictable, dose independent and potentially life threatening; this makes them a major factor contributing to the cost and uncertainty of drug development. Clinical data suggest that many such reactions involve immune mechanisms, and genetic association studies have identified strong linkage between drug hypersensitivity reactions to several drugs and specific HLA alleles. One of the strongest such genetic associations found has been for the antiviral drug abacavir, which causes severe adverse reactions exclusively in patients expressing the HLA molecular variant B*57:01. Abacavir adverse reactions were recently shown to be driven by drug-specific activation of cytokine-producing, cytotoxic CD8+ T cells that required HLA-B*57:01 molecules for their function. However, the mechanism by which abacavir induces this pathologic T cell response remains unclear. Here we show that abacavir can bind within the F-pocket of the peptide-binding groove of HLA-B*57:01 thereby altering its specificity. This supports a novel explanation for HLA-linked idiosyncratic adverse drug reactions; namely that drugs can alter the repertoire of self-peptides presented to T cells thus causing the equivalent of an alloreactive T cell response. Indeed, we identified specific self-peptides that are presented only in the presence of abacavir, and that were recognized by T cells of hypersensitive patients. The assays we have established can be applied to test additional compounds with suspected HLA linked hypersensitivities in vitro. Where successful, these assays could speed up the discovery and mechanistic understanding of HLA linked hypersensitivities as well as guide the development of safer drugs.
△ Less
Submitted 21 May, 2012;
originally announced May 2012.