Search | arXiv e-print repository

Towards an AI co-scientist

Authors: Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon , et al. (9 additional authors not shown)

Abstract: Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned… ▽ More Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: 81 pages in total (main 38 pages, appendix 43 pages), 13 main figures, 40 appendix figures, 1 main table, 2 appendix tables, 143 main references, 7 appendix references

arXiv:2407.19349 [pdf]

Predicting T-Cell Receptor Specificity

Authors: Tengyao Tu, Wei Zeng, Kun Zhao, Zhenyu Zhang

Abstract: Researching the specificity of TCR contributes to the development of immunotherapy and provides new opportunities and strategies for personalized cancer immunotherapy. Therefore, we established a TCR generative specificity detection framework consisting of an antigen selector and a TCR classifier based on the Random Forest algorithm, aiming to efficiently screen out TCRs and target antigens and ac… ▽ More Researching the specificity of TCR contributes to the development of immunotherapy and provides new opportunities and strategies for personalized cancer immunotherapy. Therefore, we established a TCR generative specificity detection framework consisting of an antigen selector and a TCR classifier based on the Random Forest algorithm, aiming to efficiently screen out TCRs and target antigens and achieve TCR specificity prediction. Furthermore, we used the k-fold validation method to compare the performance of our model with ordinary deep learning methods. The result proves that adding a classifier to the model based on the random forest algorithm is very effective, and our model generally outperforms ordinary deep learning methods. Moreover, we put forward feasible optimization suggestions for the shortcomings and challenges of our model found during model implementation. △ Less

Submitted 27 July, 2024; originally announced July 2024.

arXiv:2307.10343 [pdf, other]

ProtiGeno: a prokaryotic short gene finder using protein language models

Authors: Tony Tu, Gautham Krishna, Amirali Aghazadeh

Abstract: Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features… ▽ More Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible limitations by visualizing the three-dimensional structure of the predicted short genes. Data, codes, and models are available at https://github.com/tonytu16/protigeno. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Comments: Accepted at the 2023 ICML Workshop on Computational Biology

ACM Class: I.2.1; J.3

arXiv:2212.02226 [pdf, other]

Inferring latent neural sources via deep transcoding of simultaneously acquired EEG and fMRI

Authors: Xueqing Liu, Tao Tu, Paul Sajda

Abstract: Simultaneous EEG-fMRI is a multi-modal neuroimaging technique that provides complementary spatial and temporal resolution. Challenging has been developing principled and interpretable approaches for fusing the modalities, specifically approaches enabling inference of latent source spaces representative of neural activity. In this paper, we address this inference problem within the framework of tra… ▽ More Simultaneous EEG-fMRI is a multi-modal neuroimaging technique that provides complementary spatial and temporal resolution. Challenging has been developing principled and interpretable approaches for fusing the modalities, specifically approaches enabling inference of latent source spaces representative of neural activity. In this paper, we address this inference problem within the framework of transcoding -- mapping from a specific encoding (modality) to a decoding (the latent source space) and then encoding the latent source space to the other modality. Specifically, we develop a symmetric method consisting of a cyclic convolutional transcoder that transcodes EEG to fMRI and vice versa. Without any prior knowledge of either the hemodynamic response function or lead field matrix, the complete data-driven method exploits the temporal and spatial relationships between the modalities and latent source spaces to learn these mappings. We quantify, for both the simulated and real EEG-fMRI data, how well the modalities can be transcoded from one to another as well as the source spaces that are recovered, all evaluated on unseen data. In addition to enabling a new way to symmetrically infer a latent source space, the method can also be seen as low-cost computational neuroimaging -- i.e. generating an 'expensive' fMRI BOLD image from 'low cost' EEG data. △ Less

Submitted 27 November, 2022; originally announced December 2022.

arXiv:1911.11846 [pdf]

Physics Approaches to the Spatial Distribution of Immune Cells in Tumors

Authors: Clare C. Yu, Juliana C. Wortman, Ting-Fang He, Shawn Solomon, Robert Z. Zhang, Anthony Rosario, Roger Wang, Travis Y. Tu, Daniel Schmolze, Yuan Yuan, Susan E. Yost, Xuefei Li, Herbert Levine, Gurinder Atwal, Peter P. Lee

Abstract: The goal of immunotherapy is to enhance the ability of the immune system to kill cancer cells. Immunotherapy is more effective and, in general, the prognosis is better, when more immune cells infiltrate the tumor. We explore the question of whether the spatial distribution rather than just the density of immune cells in the tumor is important in forecasting whether cancer recurs. After reviewing p… ▽ More The goal of immunotherapy is to enhance the ability of the immune system to kill cancer cells. Immunotherapy is more effective and, in general, the prognosis is better, when more immune cells infiltrate the tumor. We explore the question of whether the spatial distribution rather than just the density of immune cells in the tumor is important in forecasting whether cancer recurs. After reviewing previous work on this issue, we introduce a novel application of maximum entropy to quantify the spatial distribution of discrete point-like objects. We apply our approach to B and T cells in images of tumor tissue taken from triple negative breast cancer (TBNC) patients. We find that there is a distinct difference in the spatial distribution of immune cells between good clinical outcome (no recurrence of cancer within at least 5 years of diagnosis) and poor clinical outcome (recurrence within 3 years of diagnosis). Our results highlight the importance of spatial distribution of immune cells within tumors with regard to clinical outcome, and raise new questions on their role in cancer recurrence. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Showing 1–5 of 5 results for author: Tu, T