Search | arXiv e-print repository

A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction

Authors: Huajun Zhou, Fengtao Zhou, Jiabo Ma, Yingxue Xu, Xi Wang, Xiuming Zhang, Li Liang, Zhenhui Li, Hao Chen

Abstract: Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates patho… ▽ More Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE's generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes. △ Less

Submitted 15 September, 2025; originally announced September 2025.

Comments: 27 pages, 7 figures

arXiv:2503.03989 [pdf, other]

Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows

Authors: Xiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu Ma

Abstract: The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically… ▽ More The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery. △ Less

Submitted 5 March, 2025; originally announced March 2025.

Comments: Accepted to ICLR 2025

arXiv:2410.13872 [pdf, other]

BLEND: Behavior-guided Neural Population Dynamics Modeling via Privileged Knowledge Distillation

Authors: Zhengrui Guo, Fangxu Zhou, Wei Wu, Qichen Sun, Lishuang Feng, Jinzhuo Wang, Hao Chen

Abstract: Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfect… ▽ More Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfectly paired neural-behavioral datasets in real-world scenarios when deploying these models, a critical yet understudied research question emerges: how to develop a model that performs well using only neural activity as input at inference, while benefiting from the insights gained from behavioral signals during training? To this end, we propose BLEND, the behavior-guided neural population dynamics modeling framework via privileged knowledge distillation. By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs. A student model is then distilled using only neural activity. Unlike existing methods, our framework is model-agnostic and avoids making strong assumptions about the relationship between behavior and neural activity. This allows BLEND to enhance existing neural dynamics modeling architectures without developing specialized models from scratch. Extensive experiments across neural population activity modeling and transcriptomic neuron identity prediction tasks demonstrate strong capabilities of BLEND, reporting over 50% improvement in behavioral decoding and over 15% improvement in transcriptomic neuron identity prediction after behavior-guided distillation. Furthermore, we empirically explore various behavior-guided distillation strategies within the BLEND framework and present a comprehensive analysis of effectiveness and implications for model performance. △ Less

Submitted 6 February, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Accepted by ICLR'2025

arXiv:2409.18597 [pdf]

TemporalPaD: a reinforcement-learning framework for temporal feature representation and dimension reduction

Authors: Xuechen Mu, Zhenyu Huang, Kewei Li, Haotian Zhang, Xiuli Wang, Yusi Fan, Kai Zhang, Fengfeng Zhou

Abstract: Recent advancements in feature representation and dimension reduction have highlighted their crucial role in enhancing the efficacy of predictive modeling. This work introduces TemporalPaD, a novel end-to-end deep learning framework designed for temporal pattern datasets. TemporalPaD integrates reinforcement learning (RL) with neural networks to achieve concurrent feature representation and featur… ▽ More Recent advancements in feature representation and dimension reduction have highlighted their crucial role in enhancing the efficacy of predictive modeling. This work introduces TemporalPaD, a novel end-to-end deep learning framework designed for temporal pattern datasets. TemporalPaD integrates reinforcement learning (RL) with neural networks to achieve concurrent feature representation and feature reduction. The framework consists of three cooperative modules: a Policy Module, a Representation Module, and a Classification Module, structured based on the Actor-Critic (AC) framework. The Policy Module, responsible for dimensionality reduction through RL, functions as the actor, while the Representation Module for feature extraction and the Classification Module collectively serve as the critic. We comprehensively evaluate TemporalPaD using 29 UCI datasets, a well-known benchmark for validating feature reduction algorithms, through 10 independent tests and 10-fold cross-validation. Additionally, given that TemporalPaD is specifically designed for time series data, we apply it to a real-world DNA classification problem involving enhancer category and enhancer strength. The results demonstrate that TemporalPaD is an efficient and effective framework for achieving feature reduction, applicable to both structured data and sequence datasets. The source code of the proposed TemporalPaD is freely available as supplementary material to this article and at http://www.healthinformaticslab.org/supp/. △ Less

Submitted 27 September, 2024; originally announced September 2024.

arXiv:2407.21298 [pdf, other]

A Vectorization Method Induced By Maximal Margin Classification For Persistent Diagrams

Authors: An Wu, Yu Pan, Fuqi Zhou, Jinghui Yan, Chuanlu Liu

Abstract: Persistent homology is an effective method for extracting topological information, represented as persistent diagrams, of spatial structure data. Hence it is well-suited for the study of protein structures. Attempts to incorporate Persistent homology in machine learning methods of protein function prediction have resulted in several techniques for vectorizing persistent diagrams. However, current… ▽ More Persistent homology is an effective method for extracting topological information, represented as persistent diagrams, of spatial structure data. Hence it is well-suited for the study of protein structures. Attempts to incorporate Persistent homology in machine learning methods of protein function prediction have resulted in several techniques for vectorizing persistent diagrams. However, current vectorization methods are excessively artificial and cannot ensure the effective utilization of information or the rationality of the methods. To address this problem, we propose a more geometrical vectorization method of persistent diagrams based on maximal margin classification for Banach space, and additionaly propose a framework that utilizes topological data analysis to identify proteins with specific functions. We evaluated our vectorization method using a binary classification task on proteins and compared it with the statistical methods that exhibit the best performance among thirteen commonly used vectorization methods. The experimental results indicate that our approach surpasses the statistical methods in both robustness and precision. △ Less

Submitted 30 July, 2024; originally announced July 2024.

arXiv:2406.19611 [pdf, other]

Multimodal Data Integration for Precision Oncology: Challenges and Future Directions

Authors: Huajun Zhou, Fengtao Zhou, Chenyu Zhao, Yingxue Xu, Luyang Luo, Hao Chen

Abstract: The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade,… ▽ More The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade, multimodal data integration technology for precision oncology has made significant strides, showcasing remarkable progress in understanding the intricate details within heterogeneous data modalities. These strides have exhibited tremendous potential for improving clinical decision-making and model interpretation, contributing to the advancement of cancer care and treatment. Given the rapid progress that has been achieved, we provide a comprehensive overview of about 300 papers detailing cutting-edge multimodal data integration techniques in precision oncology. In addition, we conclude the primary clinical applications that have reaped significant benefits, including early assessment, diagnosis, prognosis, and biomarker discovery. Finally, derived from the findings of this survey, we present an in-depth analysis that explores the pivotal challenges and reveals essential pathways for future research in the field of multimodal data integration for precision oncology. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 15 pages, 4 figures

arXiv:2404.09738 [pdf]

AMPCliff: quantitative definition and benchmarking of activity cliffs in antimicrobial peptides

Authors: Kewei Li, Yuqian Wu, Yinheng Li, Yutong Guo, Yan Wang, Yiyang Liang, Yusi Fan, Lan Huang, Ruochi Zhang, Fengfeng Zhou

Abstract: Since the mechanism of action of drug molecules in the human body is difficult to reproduce in the in vitro environment, it becomes difficult to reveal the causes of the activity cliff phenomenon of drug molecules. We found out the AC of small molecules has been extensively investigated but limited knowledge is accumulated about the AC phenomenon in peptides with canonical amino acids. Understandi… ▽ More Since the mechanism of action of drug molecules in the human body is difficult to reproduce in the in vitro environment, it becomes difficult to reveal the causes of the activity cliff phenomenon of drug molecules. We found out the AC of small molecules has been extensively investigated but limited knowledge is accumulated about the AC phenomenon in peptides with canonical amino acids. Understanding the mechanism of AC in canonical amino acids might help understand the one in drug molecules. This study introduces a quantitative definition and benchmarking framework AMPCliff for the AC phenomenon in antimicrobial peptides (AMPs) composed by canonical amino acids. A comprehensive analysis of the existing AMP dataset reveals a significant prevalence of AC within AMPs. AMPCliff quantifies the activities of AMPs by the MIC, and defines 0.9 as the minimum threshold for the normalized BLOSUM62 similarity score between a pair of aligned peptides with at least two-fold MIC changes. This study establishes a benchmark dataset of paired AMPs in Staphylococcus aureus from the publicly available AMP dataset GRAMPA, and conducts a rigorous procedure to evaluate various AMP AC prediction models, including nine machine learning, four deep learning algorithms, four masked language models, and four generative language models. Our analysis reveals that these models are capable of detecting AMP AC events and the pre-trained protein language model ESM2 demonstrates superior performance across the evaluations. The predictive performance of AMP activity cliffs remains to be further improved, considering that ESM2 with 33 layers only achieves the Spearman correlation coefficient 0.4669 for the regression task of the MIC values on the benchmark dataset. Source code and additional resources are available at https://www.healthinformaticslab.org/supp/ or https://github.com/Kewei2023/AMPCliff-generation. △ Less

Submitted 3 November, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

arXiv:2401.11360 [pdf]

PepHarmony: A Multi-View Contrastive Learning Framework for Integrated Sequence and Structure-Based Peptide Encoding

Authors: Ruochi Zhang, Haoran Wu, Chang Liu, Huaping Li, Yuqian Wu, Kewei Li, Yifan Wang, Yifan Deng, Jiahui Chen, Fengfeng Zhou, Xin Gao

Abstract: Recent advances in protein language models have catalyzed significant progress in peptide sequence representation. Despite extensive exploration in this field, pre-trained models tailored for peptide-specific needs remain largely unaddressed due to the difficulty in capturing the complex and sometimes unstable structures of peptides. This study introduces a novel multi-view contrastive learning fr… ▽ More Recent advances in protein language models have catalyzed significant progress in peptide sequence representation. Despite extensive exploration in this field, pre-trained models tailored for peptide-specific needs remain largely unaddressed due to the difficulty in capturing the complex and sometimes unstable structures of peptides. This study introduces a novel multi-view contrastive learning framework PepHarmony for the sequence-based peptide encoding task. PepHarmony innovatively combines both sequence- and structure-level information into a sequence-level encoding module through contrastive learning. We carefully select datasets from the Protein Data Bank (PDB) and AlphaFold database to encompass a broad spectrum of peptide sequences and structures. The experimental data highlights PepHarmony's exceptional capability in capturing the intricate relationship between peptide sequences and structures compared with the baseline and fine-tuned models. The robustness of our model is confirmed through extensive ablation studies, which emphasize the crucial roles of contrastive loss and strategic data sorting in enhancing predictive performance. The proposed PepHarmony framework serves as a notable contribution to peptide representations, and offers valuable insights for future applications in peptide drug discovery and peptide engineering. We have made all the source code utilized in this study publicly accessible via GitHub at https://github.com/zhangruochi/PepHarmony or http://www.healthinformaticslab.org/supp/. △ Less

Submitted 20 January, 2024; originally announced January 2024.

Comments: 25 pages, 5 figures, 3 tables

arXiv:2311.17964 [pdf]

Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

Authors: Manal Helal, Fanrong Kong, Sharon C-A Chen, Fei Zhou, Dominic E Dwyer, John Potter, Vitali Sintchenko

Abstract: The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity seq… ▽ More The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear mapping hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset. The combination of MSA with the linear mapping hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses. △ Less

Submitted 29 November, 2023; originally announced November 2023.

ACM Class: I.2.6

Journal ref: Microbial Informatics and Experimentation volume 2, Article number: 2 (2012) https://microbialinformaticsj.biomedcentral.com/counter/pdf/10.1186/2042-5783-2-2.pdf

arXiv:2311.04419 [pdf]

PepLand: a large-scale pre-trained peptide representation model for a comprehensive landscape of both canonical and non-canonical amino acids

Authors: Ruochi Zhang, Haoran Wu, Yuting Xiu, Kewei Li, Ningning Chen, Yu Wang, Yan Wang, Xin Gao, Fengfeng Zhou

Abstract: In recent years, the scientific community has become increasingly interested on peptides with non-canonical amino acids due to their superior stability and resistance to proteolytic degradation. These peptides present promising modifications to biological, pharmacological, and physiochemical attributes in both endogenous and engineered peptides. Notwithstanding their considerable advantages, the s… ▽ More In recent years, the scientific community has become increasingly interested on peptides with non-canonical amino acids due to their superior stability and resistance to proteolytic degradation. These peptides present promising modifications to biological, pharmacological, and physiochemical attributes in both endogenous and engineered peptides. Notwithstanding their considerable advantages, the scientific community exhibits a conspicuous absence of an effective pre-trained model adept at distilling feature representations from such complex peptide sequences. We herein propose PepLand, a novel pre-training architecture for representation and property analysis of peptides spanning both canonical and non-canonical amino acids. In essence, PepLand leverages a comprehensive multi-view heterogeneous graph neural network tailored to unveil the subtle structural representations of peptides. Empirical validations underscore PepLand's effectiveness across an array of peptide property predictions, encompassing protein-protein interactions, permeability, solubility, and synthesizability. The rigorous evaluation confirms PepLand's unparalleled capability in capturing salient synthetic peptide features, thereby laying a robust foundation for transformative advances in peptide-centric research domains. We have made all the source code utilized in this study publicly accessible via GitHub at https://github.com/zhangruochi/pepland △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2304.06176 [pdf]

Surface-guided computing to analyze subcellular morphology and membrane-associated signals in 3D

Authors: Felix Y. Zhou, Andrew Weems, Gabriel M. Gihana, Bingying Chen, Bo-Jui Chang, Meghan Driscoll, Gaudenz Danuser

Abstract: Signal transduction and cell function are governed by the spatiotemporal organization of membrane-associated molecules. Despite significant advances in visualizing molecular distributions by 3D light microscopy, cell biologists still have limited quantitative understanding of the processes implicated in the regulation of molecular signals at the whole cell scale. In particular, complex and transie… ▽ More Signal transduction and cell function are governed by the spatiotemporal organization of membrane-associated molecules. Despite significant advances in visualizing molecular distributions by 3D light microscopy, cell biologists still have limited quantitative understanding of the processes implicated in the regulation of molecular signals at the whole cell scale. In particular, complex and transient cell surface morphologies challenge the complete sampling of cell geometry, membrane-associated molecular concentration and activity and the computing of meaningful parameters such as the cofluctuation between morphology and signals. Here, we introduce u-Unwrap3D, a framework to remap arbitrarily complex 3D cell surfaces and membrane-associated signals into equivalent lower dimensional representations. The mappings are bidirectional, allowing the application of image processing operations in the data representation best suited for the task and to subsequently present the results in any of the other representations, including the original 3D cell surface. Leveraging this surface-guided computing paradigm, we track segmented surface motifs in 2D to quantify the recruitment of Septin polymers by blebbing events; we quantify actin enrichment in peripheral ruffles; and we measure the speed of ruffle movement along topographically complex cell surfaces. Thus, u-Unwrap3D provides access to spatiotemporal analyses of cell biological parameters on unconstrained 3D surface geometries and signals. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: 49 pages, 10 figures

arXiv:2212.10883 [pdf, other]

Detecting Temporal shape changes with the Euler Characteristic Transform

Authors: Lewis Marsh, Felix Y. Zhou, Xiao Qin, Xin Lu, Helen M. Byrne, Heather A. Harrington

Abstract: Organoids are multi-cellular structures which are cultured in vitro from stem cells to resemble specific organs (e.g., brain, liver) in their three-dimensional composition. Dynamic changes in the shape and composition of these model systems can be used to understand the effect of mutations and treatments in health and disease. In this paper, we propose a new technique in the field of topological d… ▽ More Organoids are multi-cellular structures which are cultured in vitro from stem cells to resemble specific organs (e.g., brain, liver) in their three-dimensional composition. Dynamic changes in the shape and composition of these model systems can be used to understand the effect of mutations and treatments in health and disease. In this paper, we propose a new technique in the field of topological data analysis for DEtecting Temporal shape changes with the Euler Characteristic Transform (DETECT). DETECT is a rotationally invariant signature of dynamically changing shapes. We demonstrate our method on a data set of segmented videos of mouse small intestine organoid experiments and show that it outperforms classical shape descriptors. We verify our method on a synthetic organoid data set and illustrate how it generalises to 3D. We conclude that DETECT offers rigorous quantification of organoids and opens up computationally scalable methods for distinguishing different growth regimes and assessing treatment effects. △ Less

Submitted 22 December, 2022; v1 submitted 21 December, 2022; originally announced December 2022.

arXiv:q-bio/0409011 [pdf]

SUMO Substrates and Sites Prediction Combining Pattern Recognition and Phylogenetic Conservation

Authors: Yu Xue, Fengfeng Zhou, Hualei Lu, Guoliang Chen, Xuebiao Yao

Abstract: Small Ubiquitin-related modifier (SUMO) proteins are widely expressed in eukaryotic cells, which are reversibly coupled to their substrates by motif recognition, called sumoylation. Two interesting questions are 1) how many potential SUMO substrates may be included in mammalian proteomes, such as human and mouse, 2) and given a SUMO substrate, can we recognize its sumoylation sites? To answer th… ▽ More Small Ubiquitin-related modifier (SUMO) proteins are widely expressed in eukaryotic cells, which are reversibly coupled to their substrates by motif recognition, called sumoylation. Two interesting questions are 1) how many potential SUMO substrates may be included in mammalian proteomes, such as human and mouse, 2) and given a SUMO substrate, can we recognize its sumoylation sites? To answer these two questions, previous prediction systems of SUMO substrates mainly adopted the pattern recognition methods, which could get high sensitivity with relatively too many potential false positives. So we use phylogenetic conservation between mouse and human to reduce the number of potential false positives. △ Less

Submitted 9 September, 2004; originally announced September 2004.

Comments: 15 pages (including 1 figure and 2 tables)

Showing 1–13 of 13 results for author: Zhou, F