-
Benchmarking AI scientists in omics data-driven biological research
Authors:
Erpai Luo,
Jinmeng Jia,
Yifan Xiong,
Xiangyu Li,
Xiaobo Guo,
Baoqi Yu,
Lei Wei,
Xuegong Zhang
Abstract:
The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benc…
▽ More
The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
QueST: Querying Functional and Structural Niches on Spatial Transcriptomics Data via Contrastive Subgraph Embedding
Authors:
Mo Chen,
Minsheng Hao,
Xuegong Zhang,
Lei Wei
Abstract:
The functional or structural spatial regions within tissues, referred to as spatial niches, are elements for illustrating the spatial contexts of multicellular organisms. A key challenge is querying shared niches across diverse tissues, which is crucial for achieving a comprehensive understanding of the organization and phenotypes of cell populations. However, current data analysis methods predomi…
▽ More
The functional or structural spatial regions within tissues, referred to as spatial niches, are elements for illustrating the spatial contexts of multicellular organisms. A key challenge is querying shared niches across diverse tissues, which is crucial for achieving a comprehensive understanding of the organization and phenotypes of cell populations. However, current data analysis methods predominantly focus on creating spatial-aware embeddings for cells, neglecting the development of niche-level representations for effective querying. To address this gap, we introduce QueST, a novel niche representation learning model designed for querying spatial niches across multiple samples. QueST utilizes a novel subgraph contrastive learning approach to explicitly capture niche-level characteristics and incorporates adversarial training to mitigate batch effects. We evaluate QueST on established benchmarks using human and mouse datasets, demonstrating its superiority over state-of-the-art graph representation learning methods in accurate niche queries. Overall, QueST offers a specialized model for spatial niche queries, paving the way for deeper insights into the patterns and mechanisms of cell spatial organization across tissues. Source code can be found at https://github.com/cmhimself/QueST.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Diagnosis and Pathogenic Analysis of Autism Spectrum Disorder Using Fused Brain Connection Graph
Authors:
Lu Wei,
Yi Huang,
Guosheng Yin,
Fode Zhang,
Manxue Zhang,
Bin Liu
Abstract:
We propose a model for diagnosing Autism spectrum disorder (ASD) using multimodal magnetic resonance imaging (MRI) data. Our approach integrates brain connectivity data from diffusion tensor imaging (DTI) and functional MRI (fMRI), employing graph neural networks (GNNs) for fused graph classification. To improve diagnostic accuracy, we introduce a loss function that maximizes inter-class and minim…
▽ More
We propose a model for diagnosing Autism spectrum disorder (ASD) using multimodal magnetic resonance imaging (MRI) data. Our approach integrates brain connectivity data from diffusion tensor imaging (DTI) and functional MRI (fMRI), employing graph neural networks (GNNs) for fused graph classification. To improve diagnostic accuracy, we introduce a loss function that maximizes inter-class and minimizes intra-class margins. We also analyze network node centrality, calculating degree, subgraph, and eigenvector centralities on a bimodal fused brain graph to identify pathological regions linked to ASD. Two non-parametric tests assess the statistical significance of these centralities between ASD patients and healthy controls. Our results reveal consistency between the tests, yet the identified regions differ significantly across centralities, suggesting distinct physiological interpretations. These findings enhance our understanding of ASD's neurobiological basis and offer new directions for clinical diagnosis.
△ Less
Submitted 21 September, 2024;
originally announced October 2024.
-
scDiffusion: conditional generation of high-quality single-cell data using diffusion model
Authors:
Erpai Luo,
Minsheng Hao,
Lei Wei,
Xuegong Zhang
Abstract:
Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realisti…
▽ More
Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, the Diffusion models have shown their power in generating data at high fidelity, providing a new opportunity for scRNA-seq generation.
In this study, we developed scDiffusion, a generative model combining diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state.
Experiments showed that scDiffusion can generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research.
△ Less
Submitted 4 March, 2024; v1 submitted 8 January, 2024;
originally announced January 2024.
-
Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex
Authors:
Ruixing Liang,
Xiangyu Zhang,
Qiong Li,
Lai Wei,
Hexin Liu,
Avisha Kumar,
Kelley M. Kempski Leadingham,
Joshua Punnoose,
Leibny Paola Garcia,
Amir Manbachi
Abstract:
While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries…
▽ More
While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
MolCAP: Molecular Chemical reActivity pretraining and prompted-finetuning enhanced molecular representation learning
Authors:
Yu Wang,
JingJie Zhang,
Junru Jin,
Leyi Wei
Abstract:
Molecular representation learning (MRL) is a fundamental task for drug discovery. However, previous deep-learning (DL) methods focus excessively on learning robust inner-molecular representations by mask-dominated pretraining framework, neglecting abundant chemical reactivity molecular relationships that have been demonstrated as the determining factor for various molecular property prediction tas…
▽ More
Molecular representation learning (MRL) is a fundamental task for drug discovery. However, previous deep-learning (DL) methods focus excessively on learning robust inner-molecular representations by mask-dominated pretraining framework, neglecting abundant chemical reactivity molecular relationships that have been demonstrated as the determining factor for various molecular property prediction tasks. Here, we present MolCAP to promote MRL, a graph pretraining Transformer based on chemical reactivity (IMR) knowledge with prompted finetuning. Results show that MolCAP outperforms comparative methods based on traditional molecular pretraining framework, in 13 publicly available molecular datasets across a diversity of biomedical tasks. Prompted by MolCAP, even basic graph neural networks are capable of achieving surprising performance that outperforms previous models, indicating the promising prospect of applying reactivity information for MRL. In addition, manual designed molecular templets are potential to uncover the dataset bias. All in all, we expect our MolCAP to gain more chemical meaningful insights for the entire process of drug discovery.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Multi-view deep learning based molecule design and structural optimization accelerates the SARS-CoV-2 inhibitor discovery
Authors:
Chao Pang,
Yu Wang,
Yi Jiang,
Ruheng Wang,
Ran Su,
Leyi Wei
Abstract:
In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and a…
▽ More
In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and adaptively learn comprehensive structural semantics from targeted molecular topology and geometry. We show that our MEDICO significantly outperforms the state-of-the-art methods in generating valid, unique, and novel molecules under benchmarking comparisons. In particular, we showcase the multi-view deep learning model enables us to generate not only the molecules structurally similar to the targeted molecules but also the molecules with desired chemical properties, demonstrating the strong capability of our model in exploring the chemical space deeply. Moreover, case study results on targeted molecule generation for the SARS-CoV-2 main protease (Mpro) show that by integrating molecule docking into our model as chemical priori, we successfully generate new small molecules with desired drug-like properties for the Mpro, potentially accelerating the de novo design of Covid-19 drugs. Further, we apply MEDICO to the structural optimization of three well-known Mpro inhibitors (N3, 11a, and GC376) and achieve ~88% improvement in their binding affinity to Mpro, demonstrating the application value of our model for the development of therapeutics for SARS-CoV-2 infection.
△ Less
Submitted 3 December, 2022;
originally announced December 2022.
-
Learning Melanocytic Cell Masks from Adjacent Stained Tissue
Authors:
Mikio Tada,
Ursula E. Lang,
Iwei Yeh,
Elizabeth S. Keiser,
Maria L. Wei,
Michael J. Keiser
Abstract:
Melanoma is one of the most aggressive forms of skin cancer, causing a large proportion of skin cancer deaths. However, melanoma diagnoses by pathologists shows low interrater reliability. As melanoma is a cancer of the melanocyte, there is a clear need to develop a melanocytic cell segmentation tool that is agnostic to pathologist variability and automates pixel-level annotation. Gigapixel-level…
▽ More
Melanoma is one of the most aggressive forms of skin cancer, causing a large proportion of skin cancer deaths. However, melanoma diagnoses by pathologists shows low interrater reliability. As melanoma is a cancer of the melanocyte, there is a clear need to develop a melanocytic cell segmentation tool that is agnostic to pathologist variability and automates pixel-level annotation. Gigapixel-level pathologist labeling, however, is impractical. Herein, we propose a means to train deep neural networks for melanocytic cell segmentation from hematoxylin and eosin (H&E) stained sections and paired immunohistochemistry (IHC) of adjacent tissue sections, achieving a mean IOU of 0.64 despite imperfect ground-truth labels.
△ Less
Submitted 13 March, 2024; v1 submitted 31 October, 2022;
originally announced November 2022.
-
MechRetro is a chemical-mechanism-driven graph learning framework for interpretable retrosynthesis prediction and pathway planning
Authors:
Yu Wang,
Chao Pang,
Yuzhe Wang,
Yi Jiang,
Junru Jin,
Sirui Liang,
Quan Zou,
Leyi Wei
Abstract:
Leveraging artificial intelligence for automatic retrosynthesis speeds up organic pathway planning in digital laboratories. However, existing deep learning approaches are unexplainable, like "black box" with few insights, notably limiting their applications in real retrosynthesis scenarios. Here, we propose MechRetro, a chemical-mechanism-driven graph learning framework for interpretable retrosynt…
▽ More
Leveraging artificial intelligence for automatic retrosynthesis speeds up organic pathway planning in digital laboratories. However, existing deep learning approaches are unexplainable, like "black box" with few insights, notably limiting their applications in real retrosynthesis scenarios. Here, we propose MechRetro, a chemical-mechanism-driven graph learning framework for interpretable retrosynthetic prediction and pathway planning, which learns several retrosynthetic actions to simulate a reverse reaction via elaborate self-adaptive joint learning. By integrating chemical knowledge as prior information, we design a novel Graph Transformer architecture to adaptively learn discriminative and chemically meaningful molecule representations, highlighting the strong capacity in molecule feature representation learning. We demonstrate that MechRetro outperforms the state-of-the-art approaches for retrosynthetic prediction with a large margin on large-scale benchmark datasets. Extending MechRetro to the multi-step retrosynthesis analysis, we identify efficient synthetic routes via an interpretable reasoning mechanism, leading to a better understanding in the realm of knowledgeable synthetic chemists. We also showcase that MechRetro discovers a novel pathway for protokylol, along with energy scores for uncertainty assessment, broadening the applicability for practical scenarios. Overall, we expect MechRetro to provide meaningful insights for high-throughput automated organic synthesis in drug discovery.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
The dichotomy structure of Y chromosome Haplogroup N
Authors:
Kang Hu,
Shi Yan,
Kai Liu,
Chao Ning,
Lan-Hai Wei,
Shi-Lin Li,
Bing Song,
Ge Yu,
Feng Chen,
Li-Jun Liu,
Zhi-Peng Zhao,
Chuan-Chao Wang,
Ya-Jun Yang,
Zhen-Dong Qin,
Jing-Ze Tan,
Fu-Zhong Xue,
Hui Li,
Long-Li Kang,
Li Jin
Abstract:
Haplogroup N-M231 of human Y chromosome is a common clade from Eastern Asia to Northern Europe, being one of the most frequent haplogroups in Altaic and Uralic-speaking populations. Using newly discovered bi-allelic markers from high-throughput DNA sequencing, we largely improved the phylogeny of Haplogroup N, in which 16 subclades could be identified by 33 SNPs. More than 400 males belonging to H…
▽ More
Haplogroup N-M231 of human Y chromosome is a common clade from Eastern Asia to Northern Europe, being one of the most frequent haplogroups in Altaic and Uralic-speaking populations. Using newly discovered bi-allelic markers from high-throughput DNA sequencing, we largely improved the phylogeny of Haplogroup N, in which 16 subclades could be identified by 33 SNPs. More than 400 males belonging to Haplogroup N in 34 populations in China were successfully genotyped, and populations in Northern Asia and Eastern Europe were also compared together. We found that all the N samples were typed as inside either clade N1-F1206 (including former N1a-M128, N1b-P43 and N1c-M46 clades), most of which were found in Altaic, Uralic, Russian and Chinese-speaking populations, or N2-F2930, common in Tibeto-Burman and Chinese-speaking populations. Our detailed results suggest that Haplogroup N developed in the region of China since the final stage of late Paleolithic Era.
△ Less
Submitted 24 April, 2015;
originally announced April 2015.
-
Y Chromosome of Aisin Gioro, the Imperial House of Qing Dynasty
Authors:
Shi Yan,
Harumasa Tachibana,
Lan-Hai Wei,
Ge Yu,
Shao-Qing Wen,
Chuan-Chao Wang
Abstract:
House of Aisin Gioro is the imperial family of the last dynasty in Chinese history - Qing Dynasty (1644 - 1911). Aisin Gioro family originated from Jurchen tribes and developed the Manchu people before they conquered China. By investigating the Y chromosomal short tandem repeats (STRs) of 7 modern male individuals who claim belonging to Aisin Gioro family (in which 3 have full records of pedigree)…
▽ More
House of Aisin Gioro is the imperial family of the last dynasty in Chinese history - Qing Dynasty (1644 - 1911). Aisin Gioro family originated from Jurchen tribes and developed the Manchu people before they conquered China. By investigating the Y chromosomal short tandem repeats (STRs) of 7 modern male individuals who claim belonging to Aisin Gioro family (in which 3 have full records of pedigree), we found that 3 of them (in which 2 keep full pedigree, whose most recent common ancestor is Nurgaci) shows very close relationship (1 - 2 steps of difference in 17 STR) and the haplotype is rare. We therefore conclude that this haplotype is the Y chromosome of the House of Aisin Gioro. Further tests of single nucleotide polymorphisms (SNPs) indicates that they belong to Haplogroup C3b2b1*-M401(xF5483), although their Y-STR results are distant to the "star cluster", which also belongs to the same haplogroup. This study forms the base for the pedigree research of the imperial family of Qing Dynasty by means of genetics.
△ Less
Submitted 19 December, 2014;
originally announced December 2014.
-
Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers
Authors:
Shi Yan,
Chuan-Chao Wang,
Hong-Xiang Zheng,
Wei Wang,
Zhen-Dong Qin,
Lan-Hai Wei,
Yi Wang,
Xue-Dong Pan,
Wen-Qing Fu,
Yun-Gang He,
Li-Jun Xiong,
Wen-Fei Jin,
Shi-Lin Li,
Yu An,
Hui Li,
Li Jin
Abstract:
Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely usi…
▽ More
Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely using molecular clock. We found that all the Paleolithic divergences were binary; however, three strong star-like Neolithic expansions at ~6 kya (thousand years ago) (assuming a constant substitution rate of 1e-9/bp/year) indicates that ~40% of modern Chinese are patrilineal descendants of only three super-grandfathers at that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be related to the development of agriculture.
△ Less
Submitted 14 October, 2013;
originally announced October 2013.
-
From Cellular Characteristics to Disease Diagnosis: Uncovering Phenotypes with Supercells
Authors:
Julián Candia,
Ryan Maunu,
Meghan Driscoll,
Angélique Biancotto,
Pradeep Dagur,
J. Philip McCoy Jr,
H. Nida Sen,
Lai Wei,
Amos Maritan,
Kan Cao,
Robert B. Nussenblatt,
Jayanth R. Banavar,
Wolfgang Losert
Abstract:
Cell heterogeneity and the inherent complexity due to the interplay of multiple molecular processes within the cell pose difficult challenges for current single-cell biology. We introduce an approach that identifies a disease phenotype from multiparameter single-cell measurements, which is based on the concept of "supercell statistics", a single-cell-based averaging procedure followed by a machine…
▽ More
Cell heterogeneity and the inherent complexity due to the interplay of multiple molecular processes within the cell pose difficult challenges for current single-cell biology. We introduce an approach that identifies a disease phenotype from multiparameter single-cell measurements, which is based on the concept of "supercell statistics", a single-cell-based averaging procedure followed by a machine learning classification scheme. We are able to assess the optimal tradeoff between the number of single cells averaged and the number of measurements needed to capture phenotypic differences between healthy and diseased patients, as well as between different diseases that are difficult to diagnose otherwise. We apply our approach to two kinds of single-cell datasets, addressing the diagnosis of a premature aging disorder using images of cell nuclei, as well as the phenotypes of two non-infectious uveitides (the ocular manifestations of Behçet's disease and sarcoidosis) based on multicolor flow cytometry. In the former case, one nuclear shape measurement taken over a group of 30 cells is sufficient to classify samples as healthy or diseased, in agreement with usual laboratory practice. In the latter, our method is able to identify a minimal set of 5 markers that accurately predict Behçet's disease and sarcoidosis. This is the first time that a quantitative phenotypic distinction between these two diseases has been achieved. To obtain this clear phenotypic signature, about one hundred CD8+ T cells need to be measured. Beyond these specific cases, the approach proposed here is applicable to datasets generated by other kinds of state-of-the-art and forthcoming single-cell technologies, such as multidimensional mass cytometry, single-cell gene expression, and single-cell full genome sequencing techniques.
△ Less
Submitted 1 August, 2013;
originally announced August 2013.