-
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Authors:
James Burgess,
Jeffrey J Nirschl,
Laura Bravo-Sánchez,
Alejandro Lozano,
Sanket Rajan Gupte,
Jesus G. Galaz-Montoya,
Yuhui Zhang,
Yuchang Su,
Disha Bhowmik,
Zachary Coman,
Sarina M. Hasan,
Alexandra Johannesson,
William D. Leineweber,
Malvika G Nair,
Ridhi Yarlagadda,
Connor Zuraski,
Wah Chiu,
Sarah Cohen,
Jan N. Hansen,
Manuel D Leonetti,
Chad Liu,
Emma Lundberg,
Serena Yeung-Levy
Abstract:
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimo…
▽ More
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
CellFlow: Simulating Cellular Morphology Changes via Flow Matching
Authors:
Yuhui Zhang,
Yuchang Su,
Chenyu Wang,
Tianhong Li,
Zoe Wefers,
Jeffrey Nirschl,
James Burgess,
Daisy Ding,
Alejandro Lozano,
Emma Lundberg,
Serena Yeung-Levy
Abstract:
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlow, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlow models distribution-wise transformations from unperturbed to perturbed cell…
▽ More
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlow, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlow models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects -- a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlow generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlow enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities
Authors:
Charlotte Bunne,
Yusuf Roohani,
Yanay Rosen,
Ankit Gupta,
Xikun Zhang,
Marcel Roed,
Theo Alexandrov,
Mohammed AlQuraishi,
Patricia Brennan,
Daniel B. Burkhardt,
Andrea Califano,
Jonah Cool,
Abby F. Dernburg,
Kirsty Ewing,
Emily B. Fox,
Matthias Haury,
Amy E. Herr,
Eric Horvitz,
Patrick D. Hsu,
Viren Jain,
Gregory R. Johnson,
Thomas Kalil,
David R. Kelley,
Shana O. Kelley,
Anna Kreshuk
, et al. (17 additional authors not shown)
Abstract:
The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision…
▽ More
The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of leveraging advances in AI to construct virtual cells, high-fidelity simulations of cells and cellular systems under different conditions that are directly learned from biological data across measurements and scales. We discuss desired capabilities of such AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using virtual instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions has come into reach.
△ Less
Submitted 14 October, 2024; v1 submitted 17 September, 2024;
originally announced September 2024.
-
Enabling Global Image Data Sharing in the Life Sciences
Authors:
Peter Bajcsy,
Sreenivas Bhattiprolu,
Katy Boerner,
Beth A Cimini,
Lucy Collinson,
Jan Ellenberg,
Reto Fiolka,
Maryellen Giger,
Wojtek Goscinski,
Matthew Hartley,
Nathan Hotaling,
Rick Horwitz,
Florian Jug,
Anna Kreshuk,
Emma Lundberg,
Aastha Mathur,
Kedar Narayan,
Shuichi Onami,
Anne L. Plant,
Fred Prior,
Jason Swedlow,
Adam Taylor,
Antje Keppler
Abstract:
Coordinated collaboration is essential to realize the added value of and infrastructure requirements for global image data sharing in the life sciences. In this White Paper, we take a first step at presenting some of the most common use cases as well as critical/emerging use cases of (including the use of artificial intelligence for) biological and medical image data, which would benefit tremendou…
▽ More
Coordinated collaboration is essential to realize the added value of and infrastructure requirements for global image data sharing in the life sciences. In this White Paper, we take a first step at presenting some of the most common use cases as well as critical/emerging use cases of (including the use of artificial intelligence for) biological and medical image data, which would benefit tremendously from better frameworks for sharing (including technical, resourcing, legal, and ethical aspects). In the second half of this paper, we paint an ideal world scenario for how global image data sharing could work and benefit all life sciences and beyond. As this is still a long way off, we conclude by suggesting several concrete measures directed toward our institutions, existing imaging communities and data initiatives, and national funders, as well as publishers. Our vision is that within the next ten years, most researchers in the world will be able to make their datasets openly available and use quality image data of interest to them for their research and benefit. This paper is published in parallel with a companion White Paper entitled Harmonizing the Generation and Pre-publication Stewardship of FAIR Image Data, which addresses challenges and opportunities related to producing well-documented and high-quality image data that is ready to be shared. The driving goal is to address remaining challenges and democratize access to everyday practices and tools for a spectrum of biomedical researchers, regardless of their expertise, access to resources, and geographical location.
△ Less
Submitted 9 August, 2024; v1 submitted 23 January, 2024;
originally announced January 2024.
-
Harmonizing the Generation and Pre-publication Stewardship of FAIR Image Data
Authors:
Nikki Bialy,
Frank Alber,
Brenda Andrews,
Michael Angelo,
Brian Beliveau,
Lacramioara Bintu,
Alistair Boettiger,
Ulrike Boehm,
Claire M. Brown,
Mahmoud Bukar Maina,
James J. Chambers,
Beth A. Cimini,
Kevin Eliceiri,
Rachel Errington,
Orestis Faklaris,
Nathalie Gaudreault,
Ronald N. Germain,
Wojtek Goscinski,
David Grunwald,
Michael Halter,
Dorit Hanein,
John W. Hickey,
Judith Lacoste,
Alex Laude,
Emma Lundberg
, et al. (22 additional authors not shown)
Abstract:
Together with the molecular knowledge of genes and proteins, biological images promise to significantly enhance the scientific understanding of complex cellular systems and to advance predictive and personalized therapeutic products for human health. For this potential to be realized, quality-assured image data must be shared among labs at a global scale to be compared, pooled, and reanalyzed, thu…
▽ More
Together with the molecular knowledge of genes and proteins, biological images promise to significantly enhance the scientific understanding of complex cellular systems and to advance predictive and personalized therapeutic products for human health. For this potential to be realized, quality-assured image data must be shared among labs at a global scale to be compared, pooled, and reanalyzed, thus unleashing untold potential beyond the original purpose for which the data was generated. There are two broad sets of requirements to enable image data sharing in the life sciences. One set of requirements is articulated in the companion White Paper entitled Enabling Global Image Data Sharing in the Life Sciences, which is published in parallel and addresses the need to build the cyberinfrastructure for sharing the digital array data. In this White Paper, we detail a broad set of requirements, which involves collecting, managing, presenting, and propagating contextual information essential to assess the quality, understand the content, interpret the scientific implications, and reuse image data in the context of the experimental details. We start by providing an overview of the main lessons learned to date through international community activities, which have recently made considerable progress toward generating community standard practices for imaging Quality Control (QC) and metadata. We then provide a clear set of recommendations for amplifying this work. The driving goal is to address remaining challenges and democratize access to everyday practices and tools for a spectrum of biomedical researchers, regardless of their expertise, access to resources, and geographical location.
△ Less
Submitted 30 August, 2024; v1 submitted 23 January, 2024;
originally announced January 2024.
-
New views of old proteins: clarifying the enigmatic proteome
Authors:
Participants in a NIH Workshop on Functional,
Integrative Proteomics,
:,
Kristin E. Burnum Johnson,
Thomas P. Conrads,
Richard R. Drake,
Amy E. Herr,
Ravi Iyengar,
Ryan T. Kelly,
Emma Lundberg,
Michael J. MacCoss,
Alexandra Naba,
Garry P. Nolan,
Pavel A. Pevzner,
Karin D. Rodland,
Salvatore Sechi,
Nikolai Slavov,
Jeffrey M. Spraggins,
Jennifer E. Van Eyk,
Marc Vidal,
Christine Vogel,
David R. Walt,
Neil L. Kelleher
Abstract:
All human diseases involve proteins, yet our current tools to characterize and quantify them are limited. To better elucidate proteins across space, time, and molecular composition, we provide provocative projections for technologies to meet the challenges that protein biology presents. With a broad perspective, we discuss grand opportunities to transition the science of proteomics into a more pro…
▽ More
All human diseases involve proteins, yet our current tools to characterize and quantify them are limited. To better elucidate proteins across space, time, and molecular composition, we provide provocative projections for technologies to meet the challenges that protein biology presents. With a broad perspective, we discuss grand opportunities to transition the science of proteomics into a more propulsive enterprise. Extrapolating recent trends, we offer potential futures for a next generation of disruptive approaches to define, quantify and visualize the multiple dimensions of the proteome, thereby transforming our understanding and interactions with human disease in the coming decade.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
Spatial mapping of protein composition and tissue organization: a primer for multiplexed antibody-based imaging
Authors:
John W. Hickey,
Elizabeth K. Neumann,
Andrea J. Radtke,
Jeannie M. Camarillo,
Rebecca T. Beuschel,
Alexandre Albanese,
Elizabeth McDonough,
Julia Hatler,
Anne E. Wiblin,
Jeremy Fisher,
Josh Croteau,
Eliza C. Small,
Anup Sood,
Richard M. Caprioli,
R. Michael Angelo,
Garry P. Nolan,
Kwanghun Chung,
Stephen M. Hewitt,
Ronald N. Germain,
Jeffrey M. Spraggins,
Emma Lundberg,
Michael P. Snyder,
Neil L. Kelleher,
Sinem K. Saka
Abstract:
Tissues and organs are composed of distinct cell types that must operate in concert to perform physiological functions. Efforts to create high-dimensional biomarker catalogs of these cells are largely based on transcriptomic single-cell approaches that lack the spatial context required to understand critical cellular communication and correlated structural organization. To probe in situ biology wi…
▽ More
Tissues and organs are composed of distinct cell types that must operate in concert to perform physiological functions. Efforts to create high-dimensional biomarker catalogs of these cells are largely based on transcriptomic single-cell approaches that lack the spatial context required to understand critical cellular communication and correlated structural organization. To probe in situ biology with sufficient coverage depth, several multiplexed protein imaging methods have recently been developed. Though these antibody-based technologies differ in strategy and mode of immunolabeling and detection tags, they commonly utilize antibodies directed against protein biomarkers to provide detailed spatial and functional maps of complex tissues. As these promising antibody-based multiplexing approaches become more widely adopted, new frameworks and considerations are critical for training future users, generating molecular tools, validating antibody panels, and harmonizing datasets. In this perspective, we provide essential resources and key considerations for obtaining robust and reproducible multiplexed antibody-based imaging data compiling specialized knowledge from domain experts and technology developers.
△ Less
Submitted 16 July, 2021;
originally announced July 2021.
-
ImJoy: an open-source computational platform for the deep learning era
Authors:
Wei Ouyang,
Florian Mueller,
Martin Hjelmare,
Emma Lundberg,
Christophe Zimmer
Abstract:
Deep learning methods have shown extraordinary potential for analyzing very diverse biomedical data, but their dissemination beyond developers is hindered by important computational hurdles. We introduce ImJoy (https://imjoy.io/), a flexible and open-source browser-based platform designed to facilitate widespread reuse of deep learning solutions in biomedical research. We highlight ImJoy's main fe…
▽ More
Deep learning methods have shown extraordinary potential for analyzing very diverse biomedical data, but their dissemination beyond developers is hindered by important computational hurdles. We introduce ImJoy (https://imjoy.io/), a flexible and open-source browser-based platform designed to facilitate widespread reuse of deep learning solutions in biomedical research. We highlight ImJoy's main features and illustrate its functionalities with deep learning plugins for mobile and interactive image analysis and genomics.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.