-
Structure-Aware Contrastive Learning with Fine-Grained Binding Representations for Drug Discovery
Authors:
Jing Lan,
Hexiao Ding,
Hongzhao Chen,
Yufeng Jiang,
Nga-Chun Ng,
Gwing Kei Yip,
Gerald W. Y. Cheng,
Yunlin Mao,
Jing Cai,
Liang-ting Lin,
Jung Sun Yoo
Abstract:
Accurate identification of drug-target interactions (DTI) remains a central challenge in computational pharmacology, where sequence-based methods offer scalability. This work introduces a sequence-based drug-target interaction framework that integrates structural priors into protein representations while maintaining high-throughput screening capability. Evaluated across multiple benchmarks, the mo…
▽ More
Accurate identification of drug-target interactions (DTI) remains a central challenge in computational pharmacology, where sequence-based methods offer scalability. This work introduces a sequence-based drug-target interaction framework that integrates structural priors into protein representations while maintaining high-throughput screening capability. Evaluated across multiple benchmarks, the model achieves state-of-the-art performance on Human and BioSNAP datasets and remains competitive on BindingDB. In virtual screening tasks, it surpasses prior methods on LIT-PCBA, yielding substantial gains in AUROC and BEDROC. Ablation studies confirm the critical role of learned aggregation, bilinear attention, and contrastive alignment in enhancing predictive robustness. Embedding visualizations reveal improved spatial correspondence with known binding pockets and highlight interpretable attention patterns over ligand-residue contacts. These results validate the framework's utility for scalable and structure-aware DTI prediction.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction
Authors:
Huajun Zhou,
Fengtao Zhou,
Jiabo Ma,
Yingxue Xu,
Xi Wang,
Xiuming Zhang,
Li Liang,
Zhenhui Li,
Hao Chen
Abstract:
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates patho…
▽ More
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE's generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
Contrastive Multi-Task Learning with Solvent-Aware Augmentation for Drug Discovery
Authors:
Jing Lan,
Hexiao Ding,
Hongzhao Chen,
Yufeng Jiang,
Nga-Chun Ng,
Gerald W. Y. Cheng,
Zongxi Li,
Jing Cai,
Liang-ting Lin,
Jung Sun Yoo
Abstract:
Accurate prediction of protein-ligand interactions is essential for computer-aided drug discovery. However, existing methods often fail to capture solvent-dependent conformational changes and lack the ability to jointly learn multiple related tasks. To address these limitations, we introduce a pre-training method that incorporates ligand conformational ensembles generated under diverse solvent con…
▽ More
Accurate prediction of protein-ligand interactions is essential for computer-aided drug discovery. However, existing methods often fail to capture solvent-dependent conformational changes and lack the ability to jointly learn multiple related tasks. To address these limitations, we introduce a pre-training method that incorporates ligand conformational ensembles generated under diverse solvent conditions as augmented input. This design enables the model to learn both structural flexibility and environmental context in a unified manner. The training process integrates molecular reconstruction to capture local geometry, interatomic distance prediction to model spatial relationships, and contrastive learning to build solvent-invariant molecular representations. Together, these components lead to significant improvements, including a 3.7% gain in binding affinity prediction, an 82% success rate on the PoseBusters Astex docking benchmarks, and an area under the curve of 97.1% in virtual screening. The framework supports solvent-aware, multi-task modeling and produces consistent results across benchmarks. A case study further demonstrates sub-angstrom docking accuracy with a root-mean-square deviation of 0.157 angstroms, offering atomic-level insight into binding mechanisms and advancing structure-based drug design.
△ Less
Submitted 27 August, 2025; v1 submitted 3 August, 2025;
originally announced August 2025.
-
An AI-native experimental laboratory for autonomous biomolecular engineering
Authors:
Mingyu Wu,
Zhaoguo Wang,
Jiabin Wang,
Zhiyuan Dong,
Jingkai Yang,
Qingting Li,
Tianyu Huang,
Lei Zhao,
Mingqiang Li,
Fei Wang,
Chunhai Fan,
Haibo Chen
Abstract:
Autonomous scientific research, capable of independently conducting complex experiments and serving non-specialists, represents a long-held aspiration. Achieving it requires a fundamental paradigm shift driven by artificial intelligence (AI). While autonomous experimental systems are emerging, they remain confined to areas featuring singular objectives and well-defined, simple experimental workflo…
▽ More
Autonomous scientific research, capable of independently conducting complex experiments and serving non-specialists, represents a long-held aspiration. Achieving it requires a fundamental paradigm shift driven by artificial intelligence (AI). While autonomous experimental systems are emerging, they remain confined to areas featuring singular objectives and well-defined, simple experimental workflows, such as chemical synthesis and catalysis. We present an AI-native autonomous laboratory, targeting highly complex scientific experiments for applications like autonomous biomolecular engineering. This system autonomously manages instrumentation, formulates experiment-specific procedures and optimization heuristics, and concurrently serves multiple user requests. Founded on a co-design philosophy of models, experiments, and instruments, the platform supports the co-evolution of AI models and the automation system. This establishes an end-to-end, multi-user autonomous laboratory that handles complex, multi-objective experiments across diverse instrumentation. Our autonomous laboratory supports fundamental nucleic acid functions-including synthesis, transcription, amplification, and sequencing. It also enables applications in fields such as disease diagnostics, drug development, and information storage. Without human intervention, it autonomously optimizes experimental performance to match state-of-the-art results achieved by human scientists. In multi-user scenarios, the platform significantly improves instrument utilization and experimental efficiency. This platform paves the way for advanced biomaterials research to overcome dependencies on experts and resource barriers, establishing a blueprint for science-as-a-service at scale.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
From Sentences to Sequences: Rethinking Languages in Biological System
Authors:
Ke Liu,
Shuaike Shen,
Hao Chen
Abstract:
The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentall…
▽ More
The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at \href{https://github.com/zjuKeLiu/RiFold}{github.com/zjuKeLiu/RiFold}
△ Less
Submitted 3 July, 2025; v1 submitted 1 July, 2025;
originally announced July 2025.
-
Recent Developments in GNNs for Drug Discovery
Authors:
Zhengyu Fang,
Xiaoge Zhang,
Anyin Zhao,
Xiao Li,
Huiyuan Chen,
Jing Li
Abstract:
In this paper, we review recent developments and the role of Graph Neural Networks (GNNs) in computational drug discovery, including molecule generation, molecular property prediction, and drug-drug interaction prediction. By summarizing the most recent developments in this area, we underscore the capabilities of GNNs to comprehend intricate molecular patterns, while exploring both their current a…
▽ More
In this paper, we review recent developments and the role of Graph Neural Networks (GNNs) in computational drug discovery, including molecule generation, molecular property prediction, and drug-drug interaction prediction. By summarizing the most recent developments in this area, we underscore the capabilities of GNNs to comprehend intricate molecular patterns, while exploring both their current and prospective applications. We initiate our discussion by examining various molecular representations, followed by detailed discussions and categorization of existing GNN models based on their input types and downstream application tasks. We also collect a list of commonly used benchmark datasets for a variety of applications. We conclude the paper with brief discussions and summarize common trends in this important research area.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Antibiotic Resistance Microbiology Dataset (ARMD): A Resource for Antimicrobial Resistance from EHRs
Authors:
Fateme Nateghi Haredasht,
Fatemeh Amrollahi,
Manoj Maddali,
Nicholas Marshall,
Stephen P. Ma,
Lauren N. Cooper,
Andrew O. Johnson,
Ziming Wei,
Richard J. Medford,
Sanjat Kanjilal,
Niaz Banaei,
Stanley Deresinski,
Mary K. Goldstein,
Steven M. Asch,
Amy Chang,
Jonathan H. Chen
Abstract:
The Antibiotic Resistance Microbiology Dataset (ARMD) is a de-identified resource derived from electronic health records (EHR) that facilitates research in antimicrobial resistance (AMR). ARMD encompasses big data from adult patients collected from over 15 years at two academic-affiliated hospitals, focusing on microbiological cultures, antibiotic susceptibilities, and associated clinical and demo…
▽ More
The Antibiotic Resistance Microbiology Dataset (ARMD) is a de-identified resource derived from electronic health records (EHR) that facilitates research in antimicrobial resistance (AMR). ARMD encompasses big data from adult patients collected from over 15 years at two academic-affiliated hospitals, focusing on microbiological cultures, antibiotic susceptibilities, and associated clinical and demographic features. Key attributes include organism identification, susceptibility patterns for 55 antibiotics, implied susceptibility rules, and de-identified patient information. This dataset supports studies on antimicrobial stewardship, causal inference, and clinical decision-making. ARMD is designed to be reusable and interoperable, promoting collaboration and innovation in combating AMR. This paper describes the dataset's acquisition, structure, and utility while detailing its de-identification process.
△ Less
Submitted 21 July, 2025; v1 submitted 8 March, 2025;
originally announced March 2025.
-
MotifBench: A standardized protein design benchmark for motif-scaffolding problems
Authors:
Zhuoqi Zheng,
Bo Zhang,
Kieran Didi,
Kevin K. Yang,
Jason Yim,
Joseph L. Watson,
Hai-Feng Chen,
Brian L. Trippe
Abstract:
The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with re…
▽ More
The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed-backbone sequence design methods. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at github.com/blt2114/MotifBench. The MotifBench test cases are more difficult compared to earlier benchmarks, and include protein design problems for which solutions are known but on which, to the best of our knowledge, state-of-the-art methods fail to identify any solution.
△ Less
Submitted 19 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
UniMatch: Universal Matching from Atom to Task for Few-Shot Drug Discovery
Authors:
Ruifeng Li,
Mingqian Li,
Wei Liu,
Yuhua Zhou,
Xiangxin Zhou,
Yuan Yao,
Qiang Zhang,
Hongyang Chen
Abstract:
Drug discovery is crucial for identifying candidate drugs for various diseases.However, its low success rate often results in a scarcity of annotations, posing a few-shot learning problem. Existing methods primarily focus on single-scale features, overlooking the hierarchical molecular structures that determine different molecular properties. To address these issues, we introduce Universal Matchin…
▽ More
Drug discovery is crucial for identifying candidate drugs for various diseases.However, its low success rate often results in a scarcity of annotations, posing a few-shot learning problem. Existing methods primarily focus on single-scale features, overlooking the hierarchical molecular structures that determine different molecular properties. To address these issues, we introduce Universal Matching Networks (UniMatch), a dual matching framework that integrates explicit hierarchical molecular matching with implicit task-level matching via meta-learning, bridging multi-level molecular representations and task-level generalization. Specifically, our approach explicitly captures structural features across multiple levels, such as atoms, substructures, and molecules, via hierarchical pooling and matching, facilitating precise molecular representation and comparison. Additionally, we employ a meta-learning strategy for implicit task-level matching, allowing the model to capture shared patterns across tasks and quickly adapt to new ones. This unified matching framework ensures effective molecular alignment while leveraging shared meta-knowledge for fast adaptation. Our experimental results demonstrate that UniMatch outperforms state-of-the-art methods on the MoleculeNet and FS-Mol benchmarks, achieving improvements of 2.87% in AUROC and 6.52% in delta AUPRC. UniMatch also shows excellent generalization ability on the Meta-MolNet benchmark.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Generalizable Cervical Cancer Screening via Large-scale Pretraining and Test-Time Adaptation
Authors:
Hao Jiang,
Cheng Jin,
Huangjing Lin,
Yanning Zhou,
Xi Wang,
Jiabo Ma,
Li Ding,
Jun Hou,
Runsheng Liu,
Zhizhong Chai,
Luyang Luo,
Huijuan Shi,
Yinling Qian,
Qiong Wang,
Changzhong Li,
Anjia Han,
Ronald Cheong Kin Chan,
Hao Chen
Abstract:
Cervical cancer is a leading malignancy in female reproductive system. While AI-assisted cytology offers a cost-effective and non-invasive screening solution, current systems struggle with generalizability in complex clinical scenarios. To address this issue, we introduced Smart-CCS, a generalizable Cervical Cancer Screening paradigm based on pretraining and adaptation to create robust and general…
▽ More
Cervical cancer is a leading malignancy in female reproductive system. While AI-assisted cytology offers a cost-effective and non-invasive screening solution, current systems struggle with generalizability in complex clinical scenarios. To address this issue, we introduced Smart-CCS, a generalizable Cervical Cancer Screening paradigm based on pretraining and adaptation to create robust and generalizable screening systems. To develop and validate Smart-CCS, we first curated a large-scale, multi-center dataset named CCS-127K, which comprises a total of 127,471 cervical cytology whole-slide images collected from 48 medical centers. By leveraging large-scale self-supervised pretraining, our CCS models are equipped with strong generalization capability, potentially generalizing across diverse scenarios. Then, we incorporated test-time adaptation to specifically optimize the trained CCS model for complex clinical settings, which adapts and refines predictions, improving real-world applicability. We conducted large-scale system evaluation among various cohorts. In retrospective cohorts, Smart-CCS achieved an overall area under the curve (AUC) value of 0.965 and sensitivity of 0.913 for cancer screening on 11 internal test datasets. In external testing, system performance maintained high at 0.950 AUC across 6 independent test datasets. In prospective cohorts, our Smart-CCS achieved AUCs of 0.947, 0.924, and 0.986 in three prospective centers, respectively. Moreover, the system demonstrated superior sensitivity in diagnosing cervical cancer, confirming the accuracy of our cancer screening results by using histology findings for validation. Interpretability analysis with cell and slide predictions further indicated that the system's decision-making aligns with clinical practice. Smart-CCS represents a significant advancement in cancer screening across diverse clinical contexts.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Language modulates vision: Evidence from neural networks and human brain-lesion models
Authors:
Haoyang Chen,
Bo Liu,
Shuyue Wang,
Xiaosha Wang,
Wenjuan Han,
Yixin Zhu,
Xiaochun Wang,
Yanchao Bi
Abstract:
Comparing information structures in between deep neural networks (DNNs) and the human brain has become a key method for exploring their similarities and differences. Recent research has shown better alignment of vision-language DNN models, such as CLIP, with the activity of the human ventral occipitotemporal cortex (VOTC) than earlier vision models, supporting the idea that language modulates huma…
▽ More
Comparing information structures in between deep neural networks (DNNs) and the human brain has become a key method for exploring their similarities and differences. Recent research has shown better alignment of vision-language DNN models, such as CLIP, with the activity of the human ventral occipitotemporal cortex (VOTC) than earlier vision models, supporting the idea that language modulates human visual perception. However, interpreting the results from such comparisons is inherently limited due to the "black box" nature of DNNs. To address this, we combined model-brain fitness analyses with human brain lesion data to examine how disrupting the communication pathway between the visual and language systems causally affects the ability of vision-language DNNs to explain the activity of the VOTC. Across four diverse datasets, CLIP consistently captured unique variance in VOTC neural representations, relative to both label-supervised (ResNet) and unsupervised (MoCo) models. This advantage tended to be left-lateralized at the group level, aligning with the human language network. Analyses of 33 stroke patients revealed that reduced white matter integrity between the VOTC and the language region in the left angular gyrus was correlated with decreased CLIP-brain correspondence and increased MoCo-brain correspondence, indicating a dynamic influence of language processing on the activity of the VOTC. These findings support the integration of language modulation in neurocognitive models of human vision, reinforcing concepts from vision-language DNN models. The sensitivity of model-brain similarity to specific brain lesions demonstrates that leveraging manipulation of the human brain is a promising framework for evaluating and developing brain-like computer models.
△ Less
Submitted 17 September, 2025; v1 submitted 23 January, 2025;
originally announced January 2025.
-
A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following
Authors:
Yin Fang,
Xinle Deng,
Kangwei Liu,
Ningyu Zhang,
Jingyang Qian,
Penghui Yang,
Xiaohui Fan,
Huajun Chen
Abstract:
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often ineffici…
▽ More
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
△ Less
Submitted 14 January, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification
Authors:
Zhengrui Guo,
Conghao Xiong,
Jiabo Ma,
Qichen Sun,
Lishuang Feng,
Jinzhuo Wang,
Hao Chen
Abstract:
Few-shot learning presents a critical solution for cancer diagnosis in computational pathology (CPath), addressing fundamental limitations in data availability, particularly the scarcity of expert annotations and patient privacy constraints. A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of co…
▽ More
Few-shot learning presents a critical solution for cancer diagnosis in computational pathology (CPath), addressing fundamental limitations in data availability, particularly the scarcity of expert annotations and patient privacy constraints. A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of contained patches, where a significant portion of these patches lacks diagnostically relevant information, potentially diluting the model's ability to learn and focus on critical diagnostic features. While recent works attempt to address this by incorporating additional knowledge, several crucial gaps hinder further progress: (1) despite the emergence of powerful pathology foundation models (FMs), their potential remains largely untapped, with most approaches limiting their use to basic feature extraction; (2) current language guidance mechanisms attempt to align text prompts with vast numbers of WSI patches all at once, struggling to leverage rich pathological semantic information. To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. Our approach implements a progressive three-stage compression strategy: we first leverage FMs for global visual redundancy elimination, and integrate compressed features with language prompts for semantic relevance assessment, then perform neighbor-aware visual token filtering while preserving spatial coherence. Extensive experiments on pathological datasets spanning breast, lung, and ovarian cancers demonstrate its superior performance in few-shot pathology diagnosis. Codes are available at https://github.com/dddavid4real/FOCUS.
△ Less
Submitted 20 March, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Contextual Representation Anchor Network to Alleviate Selection Bias in Few-Shot Drug Discovery
Authors:
Ruifeng Li,
Wei Liu,
Xiangxin Zhou,
Mingqian Li,
Qiang Zhang,
Hongyang Chen,
Xuemin Lin
Abstract:
In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness le…
▽ More
In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness leads to suboptimal performance. To overcome this challenge, we present a novel method named contextual representation anchor Network (CRA), where an anchor refers to a cluster center of the representations of molecules and serves as a bridge to transfer enriched contextual knowledge into molecular representations and enhance their expressiveness. CRA introduces a dual-augmentation mechanism that includes context augmentation, which dynamically retrieves analogous unlabeled molecules and captures their task-specific contextual knowledge to enhance the anchors, and anchor augmentation, which leverages the anchors to augment the molecular representations. We evaluate our approach on the MoleculeNet and FS-Mol benchmarks, as well as in domain transfer experiments. The results demonstrate that CRA outperforms the state-of-the-art by 2.60% and 3.28% in AUC and $Δ$AUC-PR metrics, respectively, and exhibits superior generalization capabilities.
△ Less
Submitted 29 October, 2024; v1 submitted 27 October, 2024;
originally announced October 2024.
-
BLEND: Behavior-guided Neural Population Dynamics Modeling via Privileged Knowledge Distillation
Authors:
Zhengrui Guo,
Fangxu Zhou,
Wei Wu,
Qichen Sun,
Lishuang Feng,
Jinzhuo Wang,
Hao Chen
Abstract:
Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfect…
▽ More
Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfectly paired neural-behavioral datasets in real-world scenarios when deploying these models, a critical yet understudied research question emerges: how to develop a model that performs well using only neural activity as input at inference, while benefiting from the insights gained from behavioral signals during training?
To this end, we propose BLEND, the behavior-guided neural population dynamics modeling framework via privileged knowledge distillation. By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs. A student model is then distilled using only neural activity. Unlike existing methods, our framework is model-agnostic and avoids making strong assumptions about the relationship between behavior and neural activity. This allows BLEND to enhance existing neural dynamics modeling architectures without developing specialized models from scratch. Extensive experiments across neural population activity modeling and transcriptomic neuron identity prediction tasks demonstrate strong capabilities of BLEND, reporting over 50% improvement in behavioral decoding and over 15% improvement in transcriptomic neuron identity prediction after behavior-guided distillation. Furthermore, we empirically explore various behavior-guided distillation strategies within the BLEND framework and present a comprehensive analysis of effectiveness and implications for model performance.
△ Less
Submitted 6 February, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
Boltzmann-Aligned Inverse Folding Model as a Predictor of Mutational Effects on Protein-Protein Interactions
Authors:
Xiaoran Jiao,
Weian Mao,
Wengong Jin,
Peiyuan Yang,
Hao Chen,
Chunhua Shen
Abstract:
Predicting the change in binding free energy ($ΔΔG$) is crucial for understanding and modulating protein-protein interactions, which are critical in drug design. Due to the scarcity of experimental $ΔΔG$ data, existing methods focus on pre-training, while neglecting the importance of alignment. In this work, we propose the Boltzmann Alignment technique to transfer knowledge from pre-trained invers…
▽ More
Predicting the change in binding free energy ($ΔΔG$) is crucial for understanding and modulating protein-protein interactions, which are critical in drug design. Due to the scarcity of experimental $ΔΔG$ data, existing methods focus on pre-training, while neglecting the importance of alignment. In this work, we propose the Boltzmann Alignment technique to transfer knowledge from pre-trained inverse folding models to $ΔΔG$ prediction. We begin by analyzing the thermodynamic definition of $ΔΔG$ and introducing the Boltzmann distribution to connect energy with protein conformational distribution. However, the protein conformational distribution is intractable; therefore, we employ Bayes' theorem to circumvent direct estimation and instead utilize the log-likelihood provided by protein inverse folding models for $ΔΔG$ estimation. Compared to previous inverse folding-based methods, our method explicitly accounts for the unbound state of protein complex in the $ΔΔG$ thermodynamic cycle, introducing a physical inductive bias and achieving both supervised and unsupervised state-of-the-art (SoTA) performance. Experimental results on SKEMPI v2 indicate that our method achieves Spearman coefficients of 0.3201 (unsupervised) and 0.5134 (supervised), significantly surpassing the previously reported SoTA values of 0.2632 and 0.4324, respectively. Futhermore, we demonstrate the capability of our method on binding energy prediction, protein-protein docking and antibody optimization tasks.
△ Less
Submitted 12 October, 2024;
originally announced October 2024.
-
Advancing biomolecular understanding and design following human instructions
Authors:
Xiang Zhuang,
Keyan Ding,
Tianwen Lyu,
Yinuo Jiang,
Xiaotong Li,
Zhuoyi Xiang,
Zeyuan Wang,
Ming Qin,
Kehua Feng,
Jike Wang,
Qiang Zhang,
Huajun Chen
Abstract:
Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology and enzyme engineering. Recent breakthroughs in artificial intelligence have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between artificial intelligence's computationa…
▽ More
Understanding and designing biomolecules, such as proteins and small molecules, is central to advancing drug discovery, synthetic biology and enzyme engineering. Recent breakthroughs in artificial intelligence have revolutionized biomolecular research, achieving remarkable accuracy in biomolecular prediction and design. However, a critical gap remains between artificial intelligence's computational capabilities and researchers' intuitive goals, particularly in using natural language to bridge complex tasks with human intentions. Large language models have shown potential to interpret human intentions, yet their application to biomolecular research remains nascent due to challenges including specialized knowledge requirements, multimodal data integration, and semantic alignment between natural language and biomolecules. To address these limitations, we present InstructBioMol, a large language model designed to bridge natural language and biomolecules through a comprehensive any-to-any alignment of natural language, molecules and proteins. This model can integrate multimodal biomolecules as the input, and enable researchers to articulate design goals in natural language, providing biomolecular outputs that meet precise biological needs. Experimental results demonstrate that InstructBioMol can understand and design biomolecules following human instructions. In particular, it can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an enzyme-substrate pair prediction score of 70.4. This highlights its potential to transform real-world biomolecular research. The code is available at https://github.com/HICAI-ZJU/InstructBioMol.
△ Less
Submitted 25 July, 2025; v1 submitted 10 October, 2024;
originally announced October 2024.
-
Brain-JEPA: Brain Dynamics Foundation Model with Gradient Positioning and Spatiotemporal Masking
Authors:
Zijian Dong,
Ruilin Li,
Yilei Wu,
Thuan Tinh Nguyen,
Joanna Su Xian Chong,
Fang Ji,
Nathanael Ren Jie Tong,
Christopher Li Hsian Chen,
Juan Helen Zhou
Abstract:
We introduce Brain-JEPA, a brain dynamics foundation model with the Joint-Embedding Predictive Architecture (JEPA). This pioneering model achieves state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction through fine-tuning. Furthermore, it excels in off-the-shelf evaluations (e.g., linear probing) and demonstrates superior generalizability across d…
▽ More
We introduce Brain-JEPA, a brain dynamics foundation model with the Joint-Embedding Predictive Architecture (JEPA). This pioneering model achieves state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction through fine-tuning. Furthermore, it excels in off-the-shelf evaluations (e.g., linear probing) and demonstrates superior generalizability across different ethnic groups, surpassing the previous large model for brain activity significantly. Brain-JEPA incorporates two innovative techniques: Brain Gradient Positioning and Spatiotemporal Masking. Brain Gradient Positioning introduces a functional coordinate system for brain functional parcellation, enhancing the positional encoding of different Regions of Interest (ROIs). Spatiotemporal Masking, tailored to the unique characteristics of fMRI data, addresses the challenge of heterogeneous time-series patches. These methodologies enhance model performance and advance our understanding of the neural circuits underlying cognition. Overall, Brain-JEPA is paving the way to address pivotal questions of building brain functional coordinate system and masking brain activity at the AI-neuroscience interface, and setting a potentially new paradigm in brain activity analysis through downstream adaptation.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding
Authors:
Bing Hu,
Anita Layton,
Helen Chen
Abstract:
Artificial intelligence (AI) is increasingly used in every stage of drug development. One challenge facing drug discovery AI is that drug pharmacokinetic (PK) datasets are often collected independently from each other, often with limited overlap, creating data overlap sparsity. Data sparsity makes data curation difficult for researchers looking to answer research questions in poly-pharmacy, drug c…
▽ More
Artificial intelligence (AI) is increasingly used in every stage of drug development. One challenge facing drug discovery AI is that drug pharmacokinetic (PK) datasets are often collected independently from each other, often with limited overlap, creating data overlap sparsity. Data sparsity makes data curation difficult for researchers looking to answer research questions in poly-pharmacy, drug combination research, and high-throughput screening. We propose Imagand, a novel SMILES-to-Pharmacokinetic (S2PK) diffusion model capable of generating an array of PK target properties conditioned on SMILES inputs. We show that Imagand-generated synthetic PK data closely resembles real data univariate and bivariate distributions, and improves performance for downstream tasks. Imagand is a promising solution for data overlap sparsity and allows researchers to efficiently generate ligand PK data for drug discovery research. Code is available at https://github.com/bing1100/Imagand.
△ Less
Submitted 1 July, 2025; v1 submitted 14 August, 2024;
originally announced August 2024.
-
Multimodal Data Integration for Precision Oncology: Challenges and Future Directions
Authors:
Huajun Zhou,
Fengtao Zhou,
Chenyu Zhao,
Yingxue Xu,
Luyang Luo,
Hao Chen
Abstract:
The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade,…
▽ More
The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade, multimodal data integration technology for precision oncology has made significant strides, showcasing remarkable progress in understanding the intricate details within heterogeneous data modalities. These strides have exhibited tremendous potential for improving clinical decision-making and model interpretation, contributing to the advancement of cancer care and treatment. Given the rapid progress that has been achieved, we provide a comprehensive overview of about 300 papers detailing cutting-edge multimodal data integration techniques in precision oncology. In addition, we conclude the primary clinical applications that have reaped significant benefits, including early assessment, diagnosis, prognosis, and biomarker discovery. Finally, derived from the findings of this survey, we present an in-depth analysis that explores the pivotal challenges and reveals essential pathways for future research in the field of multimodal data integration for precision oncology.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Floating Anchor Diffusion Model for Multi-motif Scaffolding
Authors:
Ke Liu,
Weian Mao,
Shuaike Shen,
Xiaoran Jiao,
Zheng Sun,
Hao Chen,
Chunhua Shen
Abstract:
Motif scaffolding seeks to design scaffold structures for constructing proteins with functions derived from the desired motif, which is crucial for the design of vaccines and enzymes. Previous works approach the problem by inpainting or conditional generation. Both of them can only scaffold motifs with fixed positions, and the conditional generation cannot guarantee the presence of motifs. However…
▽ More
Motif scaffolding seeks to design scaffold structures for constructing proteins with functions derived from the desired motif, which is crucial for the design of vaccines and enzymes. Previous works approach the problem by inpainting or conditional generation. Both of them can only scaffold motifs with fixed positions, and the conditional generation cannot guarantee the presence of motifs. However, prior knowledge of the relative motif positions in a protein is not readily available, and constructing a protein with multiple functions in one protein is more general and significant because of the synergies between functions. We propose a Floating Anchor Diffusion (FADiff) model. FADiff allows motifs to float rigidly and independently in the process of diffusion, which guarantees the presence of motifs and automates the motif position design. Our experiments demonstrate the efficacy of FADiff with high success rates and designable novel scaffolds. To the best of our knowledge, FADiff is the first work to tackle the challenge of scaffolding multiple motifs without relying on the expertise of relative motif positions in the protein. Code is available at https://github.com/aim-uofa/FADiff.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Synthetic Data from Diffusion Models Improve Drug Discovery Prediction
Authors:
Bing Hu,
Ashish Saragadam,
Anita Layton,
Helen Chen
Abstract:
Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating d…
▽ More
Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Physical formula enhanced multi-task learning for pharmacokinetics prediction
Authors:
Ruifeng Li,
Dongzhan Zhou,
Ancheng Shen,
Ao Zhang,
Mao Su,
Mingqian Li,
Hongyang Chen,
Gang Chen,
Yin Zhang,
Shufei Zhang,
Yuqiang Li,
Wanli Ouyang
Abstract:
Artificial intelligence (AI) technology has demonstrated remarkable potential in drug dis-covery, where pharmacokinetics plays a crucial role in determining the dosage, safety, and efficacy of new drugs. A major challenge for AI-driven drug discovery (AIDD) is the scarcity of high-quality data, which often requires extensive wet-lab work. A typical example of this is pharmacokinetic experiments. I…
▽ More
Artificial intelligence (AI) technology has demonstrated remarkable potential in drug dis-covery, where pharmacokinetics plays a crucial role in determining the dosage, safety, and efficacy of new drugs. A major challenge for AI-driven drug discovery (AIDD) is the scarcity of high-quality data, which often requires extensive wet-lab work. A typical example of this is pharmacokinetic experiments. In this work, we develop a physical formula enhanced mul-ti-task learning (PEMAL) method that predicts four key parameters of pharmacokinetics simultaneously. By incorporating physical formulas into the multi-task framework, PEMAL facilitates effective knowledge sharing and target alignment among the pharmacokinetic parameters, thereby enhancing the accuracy of prediction. Our experiments reveal that PEMAL significantly lowers the data demand, compared to typical Graph Neural Networks. Moreover, we demonstrate that PEMAL enhances the robustness to noise, an advantage that conventional Neural Networks do not possess. Another advantage of PEMAL is its high flexibility, which can be potentially applied to other multi-task machine learning scenarios. Overall, our work illustrates the benefits and potential of using PEMAL in AIDD and other scenarios with data scarcity and noise.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
BrainMass: Advancing Brain Network Analysis for Diagnosis with Large-scale Self-Supervised Learning
Authors:
Yanwu Yang,
Chenfei Ye,
Guinan Su,
Ziyao Zhang,
Zhikai Chang,
Hairui Chen,
Piu Chan,
Yue Yu,
Ting Ma
Abstract:
Foundation models pretrained on large-scale datasets via self-supervised learning demonstrate exceptional versatility across various tasks. Due to the heterogeneity and hard-to-collect medical data, this approach is especially beneficial for medical image analysis and neuroscience research, as it streamlines broad downstream tasks without the need for numerous costly annotations. However, there ha…
▽ More
Foundation models pretrained on large-scale datasets via self-supervised learning demonstrate exceptional versatility across various tasks. Due to the heterogeneity and hard-to-collect medical data, this approach is especially beneficial for medical image analysis and neuroscience research, as it streamlines broad downstream tasks without the need for numerous costly annotations. However, there has been limited investigation into brain network foundation models, limiting their adaptability and generalizability for broad neuroscience studies. In this study, we aim to bridge this gap. In particular, (1) we curated a comprehensive dataset by collating images from 30 datasets, which comprises 70,781 samples of 46,686 participants. Moreover, we introduce pseudo-functional connectivity (pFC) to further generates millions of augmented brain networks by randomly dropping certain timepoints of the BOLD signal. (2) We propose the BrainMass framework for brain network self-supervised learning via mask modeling and feature alignment. BrainMass employs Mask-ROI Modeling (MRM) to bolster intra-network dependencies and regional specificity. Furthermore, Latent Representation Alignment (LRA) module is utilized to regularize augmented brain networks of the same participant with similar topological properties to yield similar latent representations by aligning their latent embeddings. Extensive experiments on eight internal tasks and seven external brain disorder diagnosis tasks show BrainMass's superior performance, highlighting its significant generalizability and adaptability. Nonetheless, BrainMass demonstrates powerful few/zero-shot learning abilities and exhibits meaningful interpretation to various diseases, showcasing its potential use for clinical applications.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents
Authors:
Jiyi Wang,
Likai Tang,
Huimiao Chen,
Marcelo G Mattar,
Sen Song
Abstract:
Replay is a powerful strategy to promote learning in artificial intelligence and the brain. However, the conditions to generate it and its functional advantages have not been fully recognized. In this study, we develop a modular reinforcement learning model that could generate replay. We prove that replay generated in this way helps complete the task. We also analyze the information contained in t…
▽ More
Replay is a powerful strategy to promote learning in artificial intelligence and the brain. However, the conditions to generate it and its functional advantages have not been fully recognized. In this study, we develop a modular reinforcement learning model that could generate replay. We prove that replay generated in this way helps complete the task. We also analyze the information contained in the representation and provide a mechanism for how replay makes a difference. Our design avoids complex assumptions and enables replay to emerge naturally within a task-optimized paradigm. Our model also reproduces key phenomena observed in biological agents. This research explores the structural biases in modular ANN to generate replay and its potential utility in developing efficient RL.
△ Less
Submitted 6 October, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Geometric-Facilitated Denoising Diffusion Model for 3D Molecule Generation
Authors:
Can Xu,
Haosen Wang,
Weigang Wang,
Pengfei Zheng,
Hongyang Chen
Abstract:
Denoising diffusion models have shown great potential in multiple research areas. Existing diffusion-based generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves pro…
▽ More
Denoising diffusion models have shown great potential in multiple research areas. Existing diffusion-based generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves proposing an effective neural network as the denoising kernel that is capable to capture complex multi-body interatomic relationships and learn high-quality features. Due to the discrete nature of graphs, mainstream diffusion-based methods for molecules heavily rely on predefined rules and generate edges in an indirect manner. The second challenge involves accommodating molecule generation to diffusion and accurately predicting the existence of bonds. In our research, we view the iterative way of updating molecule conformations in diffusion process is consistent with molecular dynamics and introduce a novel molecule generation method named Geometric-Facilitated Molecular Diffusion (GFMDiff). For the first challenge, we introduce a Dual-Track Transformer Network (DTN) to fully excevate global spatial relationships and learn high quality representations which contribute to accurate predictions of features and geometries. As for the second challenge, we design Geometric-Facilitated Loss (GFLoss) which intervenes the formation of bonds during the training period, instead of directly embedding edges into the latent space. Comprehensive experiments on current benchmarks demonstrate the superiority of GFMDiff.
△ Less
Submitted 22 April, 2024; v1 submitted 5 January, 2024;
originally announced January 2024.
-
De novo protein design using geometric vector field networks
Authors:
Weian Mao,
Muzhi Zhu,
Zheng Sun,
Shuaike Shen,
Lin Yuanbo Wu,
Hao Chen,
Chunhua Shen
Abstract:
Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far,…
▽ More
Innovations like protein diffusion have enabled significant progress in de novo protein design, which is a vital topic in life science. These methods typically depend on protein structure encoders to model residue backbone frames, where atoms do not exist. Most prior encoders rely on atom-wise features, such as angles and distances between atoms, which are not available in this context. Thus far, only several simple encoders, such as IPA, have been proposed for this scenario, exposing the frame modeling as a bottleneck. In this work, we proffer the Vector Field Network (VFN), which enables network layers to perform learnable vector computations between coordinates of frame-anchored virtual atoms, thus achieving a higher capability for modeling frames. The vector computation operates in a manner similar to a linear layer, with each input channel receiving 3D virtual atom coordinates instead of scalar values. The multiple feature vectors output by the vector computation are then used to update the residue representations and virtual atom coordinates via attention aggregation. Remarkably, VFN also excels in modeling both frames and atoms, as the real atoms can be treated as the virtual atoms for modeling, positioning VFN as a potential universal encoder. In protein diffusion (frame modeling), VFN exhibits an impressive performance advantage over IPA, excelling in terms of both designability (67.04% vs. 53.58%) and diversity (66.54% vs. 51.98%). In inverse folding (frame and atom modeling), VFN outperforms the previous SoTA model, PiFold (54.7% vs. 51.66%), on sequence recovery rate. We also propose a method of equipping VFN with the ESM model, which significantly surpasses the previous ESM-based SoTA (62.67% vs. 55.65%), LM-Design, by a substantial margin.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Node-based Knowledge Graph Contrastive Learning for Medical Relationship Prediction
Authors:
Zhiguang Fan,
Yuedong Yang,
Mingyuan Xu,
Hongming Chen
Abstract:
The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL,…
▽ More
The embedding of Biomedical Knowledge Graphs (BKGs) generates robust representations, valuable for a variety of artificial intelligence applications, including predicting drug combinations and reasoning disease-drug relationships. Meanwhile, contrastive learning (CL) is widely employed to enhance the distinctiveness of these representations. However, constructing suitable contrastive pairs for CL, especially within Knowledge Graphs (KGs), has been challenging. In this paper, we proposed a novel node-based contrastive learning method for knowledge graph embedding, NC-KGE. NC-KGE enhances knowledge extraction in embeddings and speeds up training convergence by constructing appropriate contrastive node pairs on KGs. This scheme can be easily integrated with other knowledge graph embedding (KGE) methods. For downstream task such as biochemical relationship prediction, we have incorporated a relation-aware attention mechanism into NC-KGE, focusing on the semantic relationships and node interactions. Extensive experiments show that NC-KGE performs competitively with state-of-the-art models on public datasets like FB15k-237 and WN18RR. Particularly in biomedical relationship prediction tasks, NC-KGE outperforms all baselines on datasets such as PharmKG8k-28, DRKG17k-21, and BioKG72k-14, especially in predicting drug combination relationships. We release our code at https://github.com/zhi520/NC-KGE.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
InstructProtein: Aligning Human and Protein Language via Knowledge Instruction
Authors:
Zeyuan Wang,
Qiang Zhang,
Keyan Ding,
Ming Qin,
Xiang Zhuang,
Xiaotong Li,
Huajun Chen
Abstract:
Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function…
▽ More
Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing annotation imbalance and instruction deficits in existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover, InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design, effectively bridging the gap between protein and human language understanding.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
EC-Conf: An Ultra-fast Diffusion Model for Molecular Conformation Generation with Equivariant Consistency
Authors:
Zhiguang Fan,
Yuedong Yang,
Mingyuan Xu,
Hongming Chen
Abstract:
Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. In this paper, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly us…
▽ More
Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. In this paper, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling. We release our code at https://github.com/zhi520/EcConf.
△ Less
Submitted 23 November, 2023; v1 submitted 31 July, 2023;
originally announced August 2023.
-
Quantifying interictal intracranial EEG to predict focal epilepsy
Authors:
Ryan S Gallagher,
Nishant Sinha,
Akash R Pattnaik,
William K. S. Ojemann,
Alfredo Lucas,
Joshua J. LaRocque,
John M Bernabei,
Adam S Greenblatt,
Elizabeth M Sweeney,
H Isaac Chen,
Kathryn A Davis,
Erin C Conrad,
Brian Litt
Abstract:
Intracranial EEG (IEEG) is used for 2 main purposes, to determine: (1) if epileptic networks are amenable to focal treatment and (2) where to intervene. Currently these questions are answered qualitatively and sometimes differently across centers. There is a need for objective, standardized methods to guide surgical decision making and to enable large scale data analysis across centers and prospec…
▽ More
Intracranial EEG (IEEG) is used for 2 main purposes, to determine: (1) if epileptic networks are amenable to focal treatment and (2) where to intervene. Currently these questions are answered qualitatively and sometimes differently across centers. There is a need for objective, standardized methods to guide surgical decision making and to enable large scale data analysis across centers and prospective clinical trials.
We analyzed interictal data from 101 patients with drug resistant epilepsy who underwent presurgical evaluation with IEEG. We chose interictal data because of its potential to reduce the morbidity and cost associated with ictal recording. 65 patients had unifocal seizure onset on IEEG, and 36 were non-focal or multi-focal. We quantified the spatial dispersion of implanted electrodes and interictal IEEG abnormalities for each patient. We compared these measures against the 5 Sense Score (5SS), a pre-implant estimate of the likelihood of focal seizure onset, and assessed their ability to predict the clinicians choice of therapeutic intervention and the patient outcome.
The spatial dispersion of IEEG electrodes predicted network focality with precision similar to the 5SS (AUC = 0.67), indicating that electrode placement accurately reflected pre-implant information. A cross-validated model combining the 5SS and the spatial dispersion of interictal IEEG abnormalities significantly improved this prediction (AUC = 0.79; p<0.05). The combined model predicted ultimate treatment strategy (surgery vs. device) with an AUC of 0.81 and post-surgical outcome at 2 years with an AUC of 0.70. The 5SS, interictal IEEG, and electrode placement were not correlated and provided complementary information.
Quantitative, interictal IEEG significantly improved upon pre-implant estimates of network focality and predicted treatment with precision approaching that of clinical experts.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Graph Sampling-based Meta-Learning for Molecular Property Prediction
Authors:
Xiang Zhuang,
Qiang Zhang,
Bin Wu,
Keyan Ding,
Yin Fang,
Huajun Chen
Abstract:
Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-ba…
▽ More
Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Authors:
Yin Fang,
Xiaozhuan Liang,
Ningyu Zhang,
Kangwei Liu,
Rui Huang,
Zhuo Chen,
Xiaohui Fan,
Huajun Chen
Abstract:
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular doma…
▽ More
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
△ Less
Submitted 4 March, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Comprehensive evaluation of deep and graph learning on drug-drug interactions prediction
Authors:
Xuan Lin,
Lichang Dai,
Yafang Zhou,
Zu-Guo Yu,
Wen Zhang,
Jian-Yu Shi,
Dong-Sheng Cao,
Li Zeng,
Haowen Chen,
Bosheng Song,
Philip S. Yu,
Xiangxiang Zeng
Abstract:
Recent advances and achievements of artificial intelligence (AI) as well as deep and graph learning models have established their usefulness in biomedical applications, especially in drug-drug interactions (DDIs). DDIs refer to a change in the effect of one drug to the presence of another drug in the human body, which plays an essential role in drug discovery and clinical research. DDIs prediction…
▽ More
Recent advances and achievements of artificial intelligence (AI) as well as deep and graph learning models have established their usefulness in biomedical applications, especially in drug-drug interactions (DDIs). DDIs refer to a change in the effect of one drug to the presence of another drug in the human body, which plays an essential role in drug discovery and clinical research. DDIs prediction through traditional clinical trials and experiments is an expensive and time-consuming process. To correctly apply the advanced AI and deep learning, the developer and user meet various challenges such as the availability and encoding of data resources, and the design of computational methods. This review summarizes chemical structure based, network based, NLP based and hybrid methods, providing an updated and accessible guide to the broad researchers and development community with different domain knowledge. We introduce widely-used molecular representation and describe the theoretical frameworks of graph neural network models for representing molecular structures. We present the advantages and disadvantages of deep and graph learning methods by performing comparative experiments. We discuss the potential technical challenges and highlight future directions of deep and graph learning models for accelerating DDIs prediction.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
ProGroTrack: Deep Learning-Assisted Tracking of Intracellular Protein Growth Dynamics
Authors:
Kai San Chan,
Huimiao Chen,
Chenyu Jin,
Yuxuan Tian,
Dingchang Lin
Abstract:
Accurate tracking of cellular and subcellular structures, along with their dynamics, plays a pivotal role in understanding the underlying mechanisms of biological systems. This paper presents a novel approach, ProGroTrack, that combines the You Only Look Once (YOLO) and ByteTrack algorithms within the detection-based tracking (DBT) framework to track intracellular protein nanostructures. Focusing…
▽ More
Accurate tracking of cellular and subcellular structures, along with their dynamics, plays a pivotal role in understanding the underlying mechanisms of biological systems. This paper presents a novel approach, ProGroTrack, that combines the You Only Look Once (YOLO) and ByteTrack algorithms within the detection-based tracking (DBT) framework to track intracellular protein nanostructures. Focusing on iPAK4 protein fibers as a representative case study, we conducted a comprehensive evaluation of YOLOv5 and YOLOv8 models, revealing the superior performance of YOLOv5 on our dataset. Notably, YOLOv5x achieved an impressive mAP50 of 0.839 and F-score of 0.819. To further optimize detection capabilities, we incorporated semi-supervised learning for model improvement, resulting in enhanced performances in all metrics. Subsequently, we successfully applied our approach to track the growth behavior of iPAK4 protein fibers, revealing their two distinct growth phases consistent with a previously reported kinetic model. This research showcases the promising potential of our approach, extending beyond iPAK4 fibers. It also offers a significant advancement in precise tracking of dynamic processes in live cells, and fostering new avenues for biomedical research.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Deep Learning Methods for Small Molecule Drug Discovery: A Survey
Authors:
Wenhao Hu,
Yingying Liu,
Xuanyu Chen,
Wenhao Chai,
Hangyue Chen,
Hongwei Wang,
Gaoang Wang
Abstract:
With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing s…
▽ More
With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing surveys only focus on one of the applications, limiting the view of researchers in the community. In this paper, we present a comprehensive review on the aforementioned four aspects, and discuss the relationships among different applications. The latest literature and classical benchmarks are presented for better understanding the development of variety of approaches.
We commence by summarizing the molecule representation format in these works, followed by an introduction of recent proposed approaches for each of the four tasks. Furthermore, we review a variety of commonly used datasets and evaluation metrics and compare the performance of deep learning-based models. Finally, we conclude by identifying remaining challenges and discussing the future trend for deep learning methods in drug discovery.
△ Less
Submitted 5 March, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Federated Generalized Linear Mixed Models for Collaborative Genome-wide Association Studies
Authors:
Wentao Li,
Han Chen,
Xiaoqian Jiang,
Arif Harmanci
Abstract:
As the sequencing costs are decreasing, there is great incentive to perform large scale association studies to increase power of detecting new variants. Federated association testing among different institutions is a viable solution for increasing sample sizes by sharing the intermediate testing statistics that are aggregated by a central server. There are, however, standing challenges to performi…
▽ More
As the sequencing costs are decreasing, there is great incentive to perform large scale association studies to increase power of detecting new variants. Federated association testing among different institutions is a viable solution for increasing sample sizes by sharing the intermediate testing statistics that are aggregated by a central server. There are, however, standing challenges to performing federated association testing. Association tests are known to be confounded by numerous factors such as population stratification, which can be especially important in multiancestral studies and in admixed populations among different sites. Furthermore, disease etiology should be considered via flexible models to avoid biases in the significance of the genetic effect. A rising challenge for performing large scale association studies is the privacy of participants and related ethical concerns of stigmatization and marginalization. Here, we present dMEGA, a flexible and efficient method for performing federated generalized linear mixed model based association testing among multiple sites while underlying genotype and phenotype data are not explicitly shared. dMEGA first utilizes a reference projection to estimate population-based covariates without sharing genotype dataset among sites. Next, dMEGA uses Laplacian approximation for the parameter likelihoods and decomposes parameter estimation into efficient local-gradient updates among sites. We use simulated and real datasets to demonstrate the accuracy and efficiency of dMEGA. Overall, dMEGA's formulation is flexible to integrate fixed and random effects in a federated setting.
△ Less
Submitted 1 October, 2022;
originally announced October 2022.
-
Artificial Intelligence Models for Cell Type and Subtype Identification Based on Single-Cell RNA Sequencing Data in Vision Science
Authors:
Yeganeh Madadi,
Aboozar Monavarfeshani,
Hao Chen,
W. Daniel Stamer,
Robert W. Williams,
Siamak Yousefi
Abstract:
Single-cell RNA sequencing (scRNA-seq) provides a high throughput, quantitative and unbiased framework for scientists in many research fields to identify and characterize cell types within heterogeneous cell populations from various tissues. However, scRNA-seq based identification of discrete cell-types is still labor intensive and depends on prior molecular knowledge. Artificial intelligence has…
▽ More
Single-cell RNA sequencing (scRNA-seq) provides a high throughput, quantitative and unbiased framework for scientists in many research fields to identify and characterize cell types within heterogeneous cell populations from various tissues. However, scRNA-seq based identification of discrete cell-types is still labor intensive and depends on prior molecular knowledge. Artificial intelligence has provided faster, more accurate, and user-friendly approaches for cell-type identification. In this review, we discuss recent advances in cell-type identification methods using artificial intelligence techniques based on single-cell and single-nucleus RNA sequencing data in vision science.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
Toward reliable signals decoding for electroencephalogram: A benchmark study to EEGNeX
Authors:
Xia Chen,
Xiangbin Teng,
Han Chen,
Yafeng Pan,
Philipp Geyer
Abstract:
This study examines the efficacy of various neural network (NN) models in interpreting mental constructs via electroencephalogram (EEG) signals. Through the assessment of 16 prevalent NN models and their variants across four brain-computer interface (BCI) paradigms, we gauged their information representation capability. Rooted in comprehensive literature review findings, we proposed EEGNeX, a nove…
▽ More
This study examines the efficacy of various neural network (NN) models in interpreting mental constructs via electroencephalogram (EEG) signals. Through the assessment of 16 prevalent NN models and their variants across four brain-computer interface (BCI) paradigms, we gauged their information representation capability. Rooted in comprehensive literature review findings, we proposed EEGNeX, a novel, purely ConvNet-based architecture. We pitted it against both existing cutting-edge strategies and the Mother of All BCI Benchmarks (MOABB) involving 11 distinct EEG motor imagination (MI) classification tasks and revealed that EEGNeX surpasses other state-of-the-art methods. Notably, it shows up to 2.1%-8.5% improvement in the classification accuracy in different scenarios with statistical significance (p < 0.05) compared to its competitors. This study not only provides deeper insights into designing efficient NN models for EEG data but also lays groundwork for future explorations into the relationship between bioelectric brain signals and NN architectures. For the benefit of broader scientific collaboration, we have made all benchmark models, including EEGNeX, publicly available at (https://github.com/chenxiachan/EEGNeX).
△ Less
Submitted 24 September, 2023; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Multi-modal Protein Knowledge Graph Construction and Applications
Authors:
Siyuan Cheng,
Xiaozhuan Liang,
Zhen Bi,
Huajun Chen,
Ningyu Zhang
Abstract:
Existing data-centric methods for protein science generally cannot sufficiently capture and leverage biology knowledge, which may be crucial for many protein tasks. To facilitate research in this field, we create ProteinKG65, a knowledge graph for protein science. Using gene ontology and Uniprot knowledge base as a basis, we transform and integrate various kinds of knowledge with aligned descripti…
▽ More
Existing data-centric methods for protein science generally cannot sufficiently capture and leverage biology knowledge, which may be crucial for many protein tasks. To facilitate research in this field, we create ProteinKG65, a knowledge graph for protein science. Using gene ontology and Uniprot knowledge base as a basis, we transform and integrate various kinds of knowledge with aligned descriptions and protein sequences, respectively, to GO terms and protein entities. ProteinKG65 is mainly dedicated to providing a specialized protein knowledge graph, bringing the knowledge of Gene Ontology to protein function and structure prediction. We also illustrate the potential applications of ProteinKG65 with a prototype. Our dataset can be downloaded at https://w3id.org/proteinkg65.
△ Less
Submitted 14 November, 2022; v1 submitted 27 May, 2022;
originally announced July 2022.
-
Global Mapping of Gene/Protein Interactions in PubMed Abstracts: A Framework and an Experiment with P53 Interactions
Authors:
Xin Li,
Hsinchun Chen,
Zan Huang,
Hua Su,
Jesse D. Martinez
Abstract:
Gene/protein interactions provide critical information for a thorough understanding of cellular processes. Recently, considerable interest and effort has been focused on the construction and analysis of genome-wide gene networks. The large body of biomedical literature is an important source of gene/protein interaction information. Recent advances in text mining tools have made it possible to auto…
▽ More
Gene/protein interactions provide critical information for a thorough understanding of cellular processes. Recently, considerable interest and effort has been focused on the construction and analysis of genome-wide gene networks. The large body of biomedical literature is an important source of gene/protein interaction information. Recent advances in text mining tools have made it possible to automatically extract such documented interactions from free-text literature. In this paper, we propose a comprehensive framework for constructing and analyzing large-scale gene functional networks based on the gene/protein interactions extracted from biomedical literature repositories using text mining tools. Our proposed framework consists of analyses of the network topology, network topology-gene function relationship, and temporal network evolution to distill valuable information embedded in the gene functional interactions in literature. We demonstrate the application of the proposed framework using a testbed of P53-related PubMed abstracts, which shows that literature-based P53 networks exhibit small-world and scale-free properties. We also found that high degree genes in the literature-based networks have a high probability of appearing in the manually curated database and genes in the same pathway tend to form local clusters in our literature-based networks. Temporal analysis showed that genes interacting with many other genes tend to be involved in a large number of newly discovered interactions.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Gene Function Prediction with Gene Interaction Networks: A Context Graph Kernel Approach
Authors:
Xin Li,
Hsinchun Chen,
Jiexun Li,
Zhu Zhang
Abstract:
Predicting gene functions is a challenge for biologists in the post genomic era. Interactions among genes and their products compose networks that can be used to infer gene functions. Most previous studies adopt a linkage assumption, i.e., they assume that gene interactions indicate functional similarities between connected genes. In this study, we propose to use a gene's context graph, i.e., the…
▽ More
Predicting gene functions is a challenge for biologists in the post genomic era. Interactions among genes and their products compose networks that can be used to infer gene functions. Most previous studies adopt a linkage assumption, i.e., they assume that gene interactions indicate functional similarities between connected genes. In this study, we propose to use a gene's context graph, i.e., the gene interaction network associated with the focal gene, to infer its functions. In a kernel-based machine-learning framework, we design a context graph kernel to capture the information in context graphs. Our experimental study on a testbed of p53-related genes demonstrates the advantage of using indirect gene interactions and shows the empirical superiority of the proposed approach over linkage-assumption-based methods, such as the algorithm to minimize inconsistent connected genes and diffusion kernels.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Free energy landscape of two-state protein Acylphosphatase with large contact order revealed by force-dependent folding and unfolding dynamics
Authors:
Xuening Ma,
Hao Sun,
Haiyan Hong,
Zilong Guo,
Huanhuan Su,
Hu Chen
Abstract:
Acylphosphatase (AcP) is a small protein with 98 amino acid residues that catalyzes the hydrolysis of carboxyl-phosphate bonds. AcP is a typical two-state protein with slow folding rate due to its relatively large contact order in the native structure. The mechanical properties and unfolding behavior of AcP has been studied by atomic force microscope. But the folding and unfolding dynamics at low…
▽ More
Acylphosphatase (AcP) is a small protein with 98 amino acid residues that catalyzes the hydrolysis of carboxyl-phosphate bonds. AcP is a typical two-state protein with slow folding rate due to its relatively large contact order in the native structure. The mechanical properties and unfolding behavior of AcP has been studied by atomic force microscope. But the folding and unfolding dynamics at low forces has not been reported. Here using stable magnetic tweezers, we measured the force-dependent folding rates within a force range from 1 pN to 3 pN, and unfolding rates from 15 pN to 40 pN. The obtained unfolding rates show different force sensitivities at forces below and above ~27 pN, which determines a free energy landscape with two energy barriers. Our results indicate that the free energy landscape of small globule proteins have general Bactrian camel shape, and large contact order of the native state produces a high barrier dominate at low forces.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Nuclei Segmentation with Point Annotations from Pathology Images via Self-Supervised Learning and Co-Training
Authors:
Yi Lin,
Zhiyong Qu,
Hao Chen,
Zhongke Gao,
Yuexiang Li,
Lili Xia,
Kai Ma,
Yefeng Zheng,
Kwang-Ting Cheng
Abstract:
Nuclei segmentation is a crucial task for whole slide image analysis in digital pathology. Generally, the segmentation performance of fully-supervised learning heavily depends on the amount and quality of the annotated data. However, it is time-consuming and expensive for professional pathologists to provide accurate pixel-level ground truth, while it is much easier to get coarse labels such as po…
▽ More
Nuclei segmentation is a crucial task for whole slide image analysis in digital pathology. Generally, the segmentation performance of fully-supervised learning heavily depends on the amount and quality of the annotated data. However, it is time-consuming and expensive for professional pathologists to provide accurate pixel-level ground truth, while it is much easier to get coarse labels such as point annotations. In this paper, we propose a weakly-supervised learning method for nuclei segmentation that only requires point annotations for training. First, coarse pixel-level labels are derived from the point annotations based on the Voronoi diagram and the k-means clustering method to avoid overfitting. Second, a co-training strategy with an exponential moving average method is designed to refine the incomplete supervision of the coarse labels. Third, a self-supervised visual representation learning method is tailored for nuclei segmentation of pathology images that transforms the hematoxylin component images into the H&E stained images to gain better understanding of the relationship between the nuclei and cytoplasm. We comprehensively evaluate the proposed method using two public datasets. Both visual and quantitative results demonstrate the superiority of our method to the state-of-the-art methods, and its competitive performance compared to the fully-supervised methods. Code: https://github.com/hust-linyi/SC-Net
△ Less
Submitted 17 August, 2023; v1 submitted 16 February, 2022;
originally announced February 2022.
-
Prompt-Guided Injection of Conformation to Pre-trained Protein Model
Authors:
Qiang Zhang,
Zeyuan Wang,
Yuqiang Han,
Haoran Yu,
Xurui Jin,
Huajun Chen
Abstract:
Pre-trained protein models (PTPMs) represent a protein with one fixed embedding and thus are not capable for diverse tasks. For example, protein structures can shift, namely protein folding, between several conformations in various biological processes. To enable PTPMs to produce task-aware representations, we propose to learn interpretable, pluggable and extensible protein prompts as a way of inj…
▽ More
Pre-trained protein models (PTPMs) represent a protein with one fixed embedding and thus are not capable for diverse tasks. For example, protein structures can shift, namely protein folding, between several conformations in various biological processes. To enable PTPMs to produce task-aware representations, we propose to learn interpretable, pluggable and extensible protein prompts as a way of injecting task-related knowledge into PTPMs. In this regard, prior PTPM optimization with the masked language modeling task can be interpreted as learning a sequence prompt (Seq prompt) that enables PTPMs to capture the sequential dependency between amino acids. To incorporate conformational knowledge to PTPMs, we propose an interaction-conformation prompt (IC prompt) that is learned through back-propagation with the protein-protein interaction task. As an instantiation, we present a conformation-aware pre-trained protein model that learns both sequence and interaction-conformation prompts in a multi-task setting. We conduct comprehensive experiments on nine protein datasets. Results confirm our expectation that using the sequence prompt does not hurt PTPMs' performance on sequence-related tasks while incorporating the interaction-conformation prompt significantly improves PTPMs' performance on tasks where conformational knowledge counts. We also show the learned prompts can be combined and extended to deal with new complex tasks.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
OntoProtein: Protein Pretraining With Gene Ontology Embedding
Authors:
Ningyu Zhang,
Zhen Bi,
Xiaozhuan Liang,
Siyuan Cheng,
Haosen Hong,
Shumin Deng,
Jiazhang Lian,
Qiang Zhang,
Huajun Chen
Abstract:
Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorpora…
▽ More
Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.
△ Less
Submitted 3 June, 2022; v1 submitted 23 January, 2022;
originally announced January 2022.
-
Molecular Contrastive Learning with Chemical Element Knowledge Graph
Authors:
Yin Fang,
Qiang Zhang,
Haihong Yang,
Xiang Zhuang,
Shumin Deng,
Wen Zhang,
Ming Qin,
Zhuo Chen,
Xiaohui Fan,
Huajun Chen
Abstract:
Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus…
▽ More
Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL.
△ Less
Submitted 10 March, 2022; v1 submitted 1 December, 2021;
originally announced December 2021.
-
PDBL: Improving Histopathological Tissue Classification with Plug-and-Play Pyramidal Deep-Broad Learning
Authors:
Jiatai Lin,
Guoqiang Han,
Xipeng Pan,
Hao Chen,
Danyi Li,
Xiping Jia,
Zhenwei Shi,
Zhizhen Wang,
Yanfen Cui,
Haiming Li,
Changhong Liang,
Li Liang,
Zaiyi Liu,
Chu Han
Abstract:
Histopathological tissue classification is a fundamental task in pathomics cancer research. Precisely differentiating different tissue types is a benefit for the downstream researches, like cancer diagnosis, prognosis and etc. Existing works mostly leverage the popular classification backbones in computer vision to achieve histopathological tissue classification. In this paper, we proposed a super…
▽ More
Histopathological tissue classification is a fundamental task in pathomics cancer research. Precisely differentiating different tissue types is a benefit for the downstream researches, like cancer diagnosis, prognosis and etc. Existing works mostly leverage the popular classification backbones in computer vision to achieve histopathological tissue classification. In this paper, we proposed a super lightweight plug-and-play module, named Pyramidal Deep-Broad Learning (PDBL), for any well-trained classification backbone to further improve the classification performance without a re-training burden. We mimic how pathologists observe pathology slides in different magnifications and construct an image pyramid for the input image in order to obtain the pyramidal contextual information. For each level in the pyramid, we extract the multi-scale deep-broad features by our proposed Deep-Broad block (DB-block). We equipped PDBL in three popular classification backbones, ShuffLeNetV2, EfficientNetb0, and ResNet50 to evaluate the effectiveness and efficiency of our proposed module on two datasets (Kather Multiclass Dataset and the LC25000 Dataset). Experimental results demonstrate the proposed PDBL can steadily improve the tissue-level classification performance for any CNN backbones, especially for the lightweight models when given a small among of training samples (less than 10%), which greatly saves the computational time and annotation efforts.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Habitability Models for Astrobiology
Authors:
Abel Méndez,
Edgard E. Rivera-Valentín,
Dirk Schulze-Makuch,
Justin Filiberto,
Ramses M. Ramírez,
Tana Wood,
Alfonso Dávila,
Chris McKay,
Kevin N. Ortiz Ceballos,
Marcos Jusino-Maldonado,
Nicole J. Torres-Santiago,
Guillermo Nery,
René Heller,
Paul K. Byrne,
Michael J. Malaska,
Erica Nathan,
Marta F. Simões,
André Antunes,
Jesús Martínez-Frías,
Ludmila Carone,
Noam R. Izenberg,
Dimitra Atri,
Humberto I. Carvajal Chitty,
Priscilla Nowajewski-Barra,
Frances Rivera-Hernández
, et al. (9 additional authors not shown)
Abstract:
Habitability has been generally defined as the capability of an environment to support life. Ecologists have been using Habitat Suitability Models (HSMs) for more than four decades to study the habitability of Earth from local to global scales. Astrobiologists have been proposing different habitability models for some time, with little integration and consistency among them, being different in fun…
▽ More
Habitability has been generally defined as the capability of an environment to support life. Ecologists have been using Habitat Suitability Models (HSMs) for more than four decades to study the habitability of Earth from local to global scales. Astrobiologists have been proposing different habitability models for some time, with little integration and consistency among them, being different in function to those used by ecologists. Habitability models are not only used to determine if environments are habitable or not, but they also are used to characterize what key factors are responsible for the gradual transition from low to high habitability states. Here we review and compare some of the different models used by ecologists and astrobiologists and suggest how they could be integrated into new habitability standards. Such standards will help to improve the comparison and characterization of potentially habitable environments, prioritize target selections, and study correlations between habitability and biosignatures. Habitability models are the foundation of planetary habitability science and the synergy between ecologists and astrobiologists is necessary to expand our understanding of the habitability of Earth, the Solar System, and extrasolar planets.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Rate-Independent Computation in Continuous Chemical Reaction Networks
Authors:
Ho-Lin Chen,
David Doty,
Wyatt Reeves,
David Soloveichik
Abstract:
Coupled chemical interactions in a well-mixed solution are commonly formalized as chemical reaction networks (CRNs). However, despite the widespread use of CRNs in the natural sciences, the range of computational behaviors exhibited by CRNs is not well understood. Here we study the following problem: what functions $f:\mathbb{R}^k \to \mathbb{R}$ can be computed by a CRN, in which the CRN eventual…
▽ More
Coupled chemical interactions in a well-mixed solution are commonly formalized as chemical reaction networks (CRNs). However, despite the widespread use of CRNs in the natural sciences, the range of computational behaviors exhibited by CRNs is not well understood. Here we study the following problem: what functions $f:\mathbb{R}^k \to \mathbb{R}$ can be computed by a CRN, in which the CRN eventually produces the correct amount of the "output" molecule, no matter the rate at which reactions proceed? This captures a previously unexplored, but very natural class of computations: for example, the reaction $X_1 + X_2 \to Y$ can be thought to compute the function $y = \min(x_1, x_2)$. Such a CRN is robust in the sense that it is correct no matter the kinetic model of chemistry, so long as it respects the stoichiometric constraints.
We develop a reachability relation based on "what could happen" if reaction rates can vary arbitrarily over time. We define *stable computation* analogously to probability 1 computation in distributed computing, and connect it with a seemingly stronger notion of rate-independent computation based on convergence under a wide class of generalized rate laws. We also consider the "dual-rail representation" that can represent negative values as the difference of two concentrations and allows the composition of CRN modules. We prove that a function is rate-independently computable if and only if it is piecewise linear (with rational coefficients) and continuous (dual-rail representation), or non-negative with discontinuities occurring only when some inputs switch from zero to positive (direct representation). The many contexts where continuous piecewise linear functions are powerful targets for implementation, combined with the systematic construction we develop for computing these functions, demonstrate the potential of rate-independent chemical computation.
△ Less
Submitted 7 April, 2023; v1 submitted 28 July, 2021;
originally announced July 2021.