Search | arXiv e-print repository

cp_measure: API-first feature extraction for image-based profiling workflows

Authors: Alán F. Muñoz, Tim Treis, Alexandr A. Kalinin, Shatavisha Dasgupta, Fabian Theis, Anne E. Carpenter, Shantanu Singh

Abstract: Biological image analysis has traditionally focused on measuring specific visual properties of interest for cells or other entities. A complementary paradigm gaining increasing traction is image-based profiling - quantifying many distinct visual features to form comprehensive profiles which may reveal hidden patterns in cellular states, drug responses, and disease mechanisms. While current tools l… ▽ More Biological image analysis has traditionally focused on measuring specific visual properties of interest for cells or other entities. A complementary paradigm gaining increasing traction is image-based profiling - quantifying many distinct visual features to form comprehensive profiles which may reveal hidden patterns in cellular states, drug responses, and disease mechanisms. While current tools like CellProfiler can generate these feature sets, they pose significant barriers to automated and reproducible analyses, hindering machine learning workflows. Here we introduce cp_measure, a Python library that extracts CellProfiler's core measurement capabilities into a modular, API-first tool designed for programmatic feature extraction. We demonstrate that cp_measure features retain high fidelity with CellProfiler features while enabling seamless integration with the scientific Python ecosystem. Through applications to 3D astrocyte imaging and spatial transcriptomics, we showcase how cp_measure enables reproducible, automated image-based profiling pipelines that scale effectively for machine learning applications in computational biology. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: 10 pages, 4 figures, 4 supplementary figures. CODEML Workshop paper accepted (non-archival), as a part of ICML2025 events

ACM Class: I.4.7

arXiv:2504.17247 [pdf, other]

Targeted AMP generation through controlled diffusion with efficient embeddings

Authors: Diogo Soares, Leon Hetzel, Paulina Szymczak, Fabian Theis, Stephan Günnemann, Ewa Szczurek

Abstract: Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as low experimental hit rates as well as the need for nuanced controllability and efficient modeling of peptide properties. To address these challenges, we introduce OmegAMP, a framework that leverages a diffusion-based generative model with efficient low-dimensional embeddings, precise controllability mechani… ▽ More Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as low experimental hit rates as well as the need for nuanced controllability and efficient modeling of peptide properties. To address these challenges, we introduce OmegAMP, a framework that leverages a diffusion-based generative model with efficient low-dimensional embeddings, precise controllability mechanisms, and novel classifiers with drastically reduced false positive rates for candidate filtering. OmegAMP enables the targeted generation of AMPs with specific physicochemical properties, activity profiles, and species-specific effectiveness. Moreover, it maximizes sample diversity while ensuring faithfulness to the underlying data distribution during generation. We demonstrate that OmegAMP achieves state-of-the-art performance across all stages of the AMP discovery pipeline, significantly advancing the potential of computational frameworks in combating antimicrobial resistance. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2503.20027 [pdf]

A scalable gene network model of regulatory dynamics in single cells

Authors: Paul Bertin, Joseph D. Viviano, Alejandro Tejada-Lapuerta, Weixu Wang, Stefan Bauer, Fabian J. Theis, Yoshua Bengio

Abstract: Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene… ▽ More Single-cell data provide high-dimensional measurements of the transcriptional states of cells, but extracting insights into the regulatory functions of genes, particularly identifying transcriptional mechanisms affected by biological perturbations, remains a challenge. Many perturbations induce compensatory cellular responses, making it difficult to distinguish direct from indirect effects on gene regulation. Modeling how gene regulatory functions shape the temporal dynamics of these responses is key to improving our understanding of biological perturbations. Dynamical models based on differential equations offer a principled way to capture transcriptional dynamics, but their application to single-cell data has been hindered by computational constraints, stochasticity, sparsity, and noise. Existing methods either rely on low-dimensional representations or make strong simplifying assumptions, limiting their ability to model transcriptional dynamics at scale. We introduce a Functional and Learnable model of Cell dynamicS, FLeCS, that incorporates gene network structure into coupled differential equations to model gene regulatory functions. Given (pseudo)time-series single-cell data, FLeCS accurately infers cell dynamics at scale, provides improved functional insights into transcriptional mechanisms perturbed by gene knockouts, both in myeloid differentiation and K562 Perturb-seq experiments, and simulates single-cell trajectories of A549 cells following small-molecule perturbations. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: 42 pages, 10 figures

arXiv:2501.02526 [pdf, other]

Unified Guidance for Geometry-Conditioned Molecular Generation

Authors: Sirine Ayadi, Leon Hetzel, Johanna Sommer, Fabian Theis, Stephan Günnemann

Abstract: Effectively designing molecular geometries is essential to advancing pharmaceutical innovations, a domain, which has experienced great attention through the success of generative models and, in particular, diffusion models. However, current molecular diffusion models are tailored towards a specific downstream task and lack adaptability. We introduce UniGuide, a framework for controlled geometric g… ▽ More Effectively designing molecular geometries is essential to advancing pharmaceutical innovations, a domain, which has experienced great attention through the success of generative models and, in particular, diffusion models. However, current molecular diffusion models are tailored towards a specific downstream task and lack adaptability. We introduce UniGuide, a framework for controlled geometric guidance of unconditional diffusion models that allows flexible conditioning during inference without the requirement of extra training or networks. We show how applications such as structure-based, fragment-based, and ligand-based drug design are formulated in the UniGuide framework and demonstrate on-par or superior performance compared to specialised models. Offering a more versatile approach, UniGuide has the potential to streamline the development of molecular generative models, allowing them to be readily used in diverse application scenarios. △ Less

Submitted 5 January, 2025; originally announced January 2025.

Comments: 38th Conference on Neural Information Processing Systems (NeurIPS)

arXiv:2409.11654 [pdf, other]

How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities

Authors: Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B. Burkhardt, Andrea Califano, Jonah Cool, Abby F. Dernburg, Kirsty Ewing, Emily B. Fox, Matthias Haury, Amy E. Herr, Eric Horvitz, Patrick D. Hsu, Viren Jain, Gregory R. Johnson, Thomas Kalil, David R. Kelley, Shana O. Kelley, Anna Kreshuk , et al. (17 additional authors not shown)

Abstract: The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision… ▽ More The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of leveraging advances in AI to construct virtual cells, high-fidelity simulations of cells and cellular systems under different conditions that are directly learned from biological data across measurements and scales. We discuss desired capabilities of such AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using virtual instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions has come into reach. △ Less

Submitted 14 October, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

arXiv:2407.11734 [pdf, other]

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

Authors: Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki, Alexander Tong, Andrea Dittadi, Fabian Theis

Abstract: Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory inference, batch effect removal, and simulation of realistic cellular data. However, recent deep generative models simulating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, overlooking the discrete nature of single-cell data, which limits their effectiveness a… ▽ More Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory inference, batch effect removal, and simulation of realistic cellular data. However, recent deep generative models simulating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, overlooking the discrete nature of single-cell data, which limits their effectiveness and hinders the incorporation of robust noise models. Additionally, aspects like controllable multi-modal and multi-label generation of cellular data remain underexplored. This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data. CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics while tackling relevant generative tasks such as rare cell type augmentation and batch correction. We also introduce a novel framework for compositional data generation using Flow Matching. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models. △ Less

Submitted 3 March, 2025; v1 submitted 16 July, 2024; originally announced July 2024.

Comments: 41 pages, 22 figures

Journal ref: The Thirteenth International Conference on Learning Representations (2025)

arXiv:2311.07621 [pdf, other]

To Transformers and Beyond: Large Language Models for the Genome

Authors: Micaela E. Consens, Cameron Dufault, Michael Wainberg, Duncan Forster, Mehran Karimzadeh, Hani Goodarzi, Fabian J. Theis, Alan Moses, Bo Wang

Abstract: In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore… ▽ More In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore both the strengths and limitations of transformers and other LLMs for genomics. Additionally, we contemplate the future of genomic modeling beyond the transformer architecture based on current trends in research. The paper aims to serve as a guide for computational biologists and computer scientists interested in LLMs for genomic data. We hope the paper can also serve as an educational introduction and discussion for biologists to a fundamental shift in how we will be analyzing genomic data in the future. △ Less

Submitted 12 November, 2023; originally announced November 2023.

arXiv:2311.02455 [pdf, other]

Mixed Models with Multiple Instance Learning

Authors: Jan P. Engelmann, Alessandro Palma, Jakub M. Tomczak, Fabian J. Theis, Francesco Paolo Casale

Abstract: Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear… ▽ More Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains. △ Less

Submitted 8 March, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

Comments: AISTATS 2024 Oral, Code: https://github.com/AIH-SGML/MixMIL

arXiv:2310.14935 [pdf]

Causal machine learning for single-cell genomics

Authors: Alejandro Tejada-Lapuerta, Paul Bertin, Stefan Bauer, Hananeh Aliee, Yoshua Bengio, Fabian J. Theis

Abstract: Advances in single-cell omics allow for unprecedented insights into the transcription profiles of individual cells. When combined with large-scale perturbation screens, through which specific biological mechanisms can be targeted, these technologies allow for measuring the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the ca… ▽ More Advances in single-cell omics allow for unprecedented insights into the transcription profiles of individual cells. When combined with large-scale perturbation screens, through which specific biological mechanisms can be targeted, these technologies allow for measuring the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the causative role of genes in complex biological processes such as gene regulation, disease progression or cellular development. However, the high-dimensional nature of the data, coupled with the intricate complexity of biological systems renders this task nontrivial. Within the machine learning community, there has been a recent increase of interest in causality, with a focus on adapting established causal techniques and algorithms to handle high-dimensional data. In this perspective, we delineate the application of these methodologies within the realm of single-cell genomics and their challenges. We first present the model that underlies most of current causal approaches to single-cell biology and discuss and challenge the assumptions it entails from the biological point of view. We then identify open problems in the application of causal approaches to single-cell data: generalising to unseen environments, learning interpretable models, and learning causal models of dynamics. For each problem, we discuss how various research directions - including the development of computational approaches and the adaptation of experimental protocols - may offer ways forward, or on the contrary pose some difficulties. With the advent of single cell atlases and increasing perturbation data, we expect causal models to become a crucial tool for informed experimental design. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: 35 pages, 7 figures, 3 tables, 1 box

arXiv:2307.00558 [pdf, other]

Conditionally Invariant Representation Learning for Disentangling Cellular Heterogeneity

Authors: Hananeh Aliee, Ferdinand Kapl, Soroor Hediyeh-Zadeh, Fabian J. Theis

Abstract: This paper presents a novel approach that leverages domain variability to learn representations that are conditionally invariant to unwanted variability or distractors. Our approach identifies both spurious and invariant latent features necessary for achieving accurate reconstruction by placing distinct conditional priors on latent features. The invariant signals are disentangled from noise by enf… ▽ More This paper presents a novel approach that leverages domain variability to learn representations that are conditionally invariant to unwanted variability or distractors. Our approach identifies both spurious and invariant latent features necessary for achieving accurate reconstruction by placing distinct conditional priors on latent features. The invariant signals are disentangled from noise by enforcing independence which facilitates the construction of an interpretable model with a causal semantic. By exploiting the interplay between data domains and labels, our method simultaneously identifies invariant features and builds invariant predictors. We apply our method to grand biological challenges, such as data integration in single-cell genomics with the aim of capturing biological variations across datasets with many samples, obtained from different conditions or multiple laboratories. Our approach allows for the incorporation of specific biological mechanisms, including gene programs, disease states, or treatment conditions into the data integration process, bridging the gap between the theoretical assumptions and real biological applications. Specifically, the proposed approach helps to disentangle biological signals from data biases that are unrelated to the target task or the causal explanation of interest. Through extensive benchmarking using large-scale human hematopoiesis and human lung cancer data, we validate the superiority of our approach over existing methods and demonstrate that it can empower deeper insights into cellular heterogeneity and the identification of disease cell states. △ Less

Submitted 2 July, 2023; originally announced July 2023.

arXiv:2306.17246 [pdf, other]

The power of motifs as inductive bias for learning molecular distributions

Authors: Johanna Sommer, Leon Hetzel, David Lüdke, Fabian Theis, Stephan Günnemann

Abstract: Machine learning for molecules holds great potential for efficiently exploring the vast chemical space and thus streamlining the drug discovery process by facilitating the design of new therapeutic molecules. Deep generative models have shown promising results for molecule generation, but the benefits of specific inductive biases for learning distributions over small graphs are unclear. Our study… ▽ More Machine learning for molecules holds great potential for efficiently exploring the vast chemical space and thus streamlining the drug discovery process by facilitating the design of new therapeutic molecules. Deep generative models have shown promising results for molecule generation, but the benefits of specific inductive biases for learning distributions over small graphs are unclear. Our study aims to investigate the impact of subgraph structures and vocabulary design on distribution learning, using small drug molecules as a case study. To this end, we introduce Subcover, a new subgraph-based fragmentation scheme, and evaluate it through a two-step variational auto-encoder. Our results show that Subcover's improved identification of chemically meaningful subgraphs leads to a relative improvement of the FCD score by 30%, outperforming previous methods. Our findings highlight the potential of Subcover to enhance the performance and scalability of existing methods, contributing to the advancement of drug discovery. △ Less

Submitted 4 April, 2023; originally announced June 2023.

Comments: Accepted for publication at the MLDD workshop, ICLR 2023

arXiv:2211.03793 [pdf, other]

Uncertainty Quantification for Atlas-Level Cell Type Transfer

Authors: Jan Engelmann, Leon Hetzel, Giovanni Palla, Lisa Sikkema, Malte Luecken, Fabian Theis

Abstract: Single-cell reference atlases are large-scale, cell-level maps that capture cellular heterogeneity within an organ using single cell genomics. Given their size and cellular diversity, these atlases serve as high-quality training data for the transfer of cell type labels to new datasets. Such label transfer, however, must be robust to domain shifts in gene expression due to measurement technique, l… ▽ More Single-cell reference atlases are large-scale, cell-level maps that capture cellular heterogeneity within an organ using single cell genomics. Given their size and cellular diversity, these atlases serve as high-quality training data for the transfer of cell type labels to new datasets. Such label transfer, however, must be robust to domain shifts in gene expression due to measurement technique, lab specifics and more general batch effects. This requires methods that provide uncertainty estimates on the cell type predictions to ensure correct interpretation. Here, for the first time, we introduce uncertainty quantification methods for cell type classification on single-cell reference atlases. We benchmark four model classes and show that currently used models lack calibration, robustness, and actionable uncertainty scores. Furthermore, we demonstrate how models that quantify uncertainty are better suited to detect unseen cell types in the setting of atlas-level cell type transfer. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Workshop paper at the 2022 ICML Workshop on Computational Biology

arXiv:2205.07110 [pdf, other]

SystemMatch: optimizing preclinical drug models to human clinical outcomes via generative latent-space matching

Authors: Scott Gigante, Varsha G. Raghavan, Amanda M. Robinson, Robert A. Barton, Adeeb H. Rahman, Drausin F. Wulsin, Jacques Banchereau, Noam Solomon, Luis F. Voloch, Fabian J. Theis

Abstract: Translating the relevance of preclinical models ($\textit{in vitro}$, animal models, or organoids) to their relevance in humans presents an important challenge during drug development. The rising abundance of single-cell genomic data from human tumors and tissue offers a new opportunity to optimize model systems by their similarity to targeted human cell types in disease. In this work, we introduc… ▽ More Translating the relevance of preclinical models ($\textit{in vitro}$, animal models, or organoids) to their relevance in humans presents an important challenge during drug development. The rising abundance of single-cell genomic data from human tumors and tissue offers a new opportunity to optimize model systems by their similarity to targeted human cell types in disease. In this work, we introduce SystemMatch to assess the fit of preclinical model systems to an $\textit{in sapiens}$ target population and to recommend experimental changes to further optimize these systems. We demonstrate this through an application to developing $\textit{in vitro}$ systems to model human tumor-derived suppressive macrophages. We show with held-out $\textit{in vivo}$ controls that our pipeline successfully ranks macrophage subpopulations by their biological similarity to the target population, and apply this analysis to rank a series of 18 $\textit{in vitro}$ macrophage systems perturbed with a variety of cytokine stimulations. We extend this analysis to predict the behavior of 66 $\textit{in silico}$ model systems generated using a perturbational autoencoder and apply a $k$-medoids approach to recommend a subset of these model systems for further experimental development in order to fully explore the space of possible perturbations. Through this use case, we demonstrate a novel approach to model system development to generate a system more similar to human biology. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: Published at the MLDD workshop, ICLR 2022

arXiv:2204.13545 [pdf, other]

Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution

Authors: Leon Hetzel, Simon Böhm, Niki Kilbertus, Stephan Günnemann, Mohammad Lotfollahi, Fabian Theis

Abstract: Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely perform… ▽ More Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA HTS is required to enrich single-cell data meaningfully. We introduce chemCPA, a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with an architecture surgery for transfer learning and demonstrate how training on existing bulk RNA HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating drug discovery. △ Less

Submitted 30 December, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: 10 pages. NeurIPS 2022 conference paper

arXiv:2104.11364 [pdf]

A field guide to cultivating computational biology

Authors: Anne E Carpenter, Casey S Greene, Piero Carnici, Benilton S Carvalho, Michiel de Hoon, Stacey Finley, Kim-Anh Le Cao, Jerry SH Lee, Luigi Marchionni, Suzanne Sindi, Fabian J Theis, Gregory P Way, Jean YH Yang, Elana J Fertig

Abstract: Biomedical research centers can empower basic discovery and novel therapeutic strategies by leveraging their large-scale datasets from experiments and patients. This data, together with new technologies to create and analyze it, has ushered in an era of data-driven discovery which requires moving beyond the traditional individual, single-discipline investigator research model. This interdisciplina… ▽ More Biomedical research centers can empower basic discovery and novel therapeutic strategies by leveraging their large-scale datasets from experiments and patients. This data, together with new technologies to create and analyze it, has ushered in an era of data-driven discovery which requires moving beyond the traditional individual, single-discipline investigator research model. This interdisciplinary niche is where computational biology thrives. It has matured over the past three decades and made major contributions to scientific knowledge and human health, yet researchers in the field often languish in career advancement, publication, and grant review. We propose solutions for individual scientists, institutions, journal publishers, funding agencies, and educators. △ Less

Submitted 22 April, 2021; originally announced April 2021.

arXiv:1910.01791 [pdf, other]

Conditional out-of-sample generation for unpaired data using trVAE

Authors: Mohammad Lotfollahi, Mohsen Naghipourfar, Fabian J. Theis, F. Alexander Wolf

Abstract: While generative models have shown great success in generating high-dimensional samples conditional on low-dimensional descriptors (learning e.g. stroke thickness in MNIST, hair color in CelebA, or speaker identity in Wavenet), their generation out-of-sample poses fundamental problems. The conditional variational autoencoder (CVAE) as a simple conditional generative model does not explicitly relat… ▽ More While generative models have shown great success in generating high-dimensional samples conditional on low-dimensional descriptors (learning e.g. stroke thickness in MNIST, hair color in CelebA, or speaker identity in Wavenet), their generation out-of-sample poses fundamental problems. The conditional variational autoencoder (CVAE) as a simple conditional generative model does not explicitly relate conditions during training and, hence, has no incentive of learning a compact joint distribution across conditions. We overcome this limitation by matching their distributions using maximum mean discrepancy (MMD) in the decoder layer that follows the bottleneck. This introduces a strong regularization both for reconstructing samples within the same condition and for transforming samples across conditions, resulting in much improved generalization. We refer to the architecture as \emph{transformer} VAE (trVAE). Benchmarking trVAE on high-dimensional image and tabular data, we demonstrate higher robustness and higher accuracy than existing approaches. In particular, we show qualitatively improved predictions for cellular perturbation response to treatment and disease based on high-dimensional single-cell gene expression data, by tackling previously problematic minority classes and multiple conditions. For generic tasks, we improve Pearson correlations of high-dimensional estimated means and variances with their ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively. △ Less

Submitted 30 October, 2019; v1 submitted 3 October, 2019; originally announced October 2019.

Comments: Added reference to Johansson et al. (2016) and removed sentences from Lopez et al. (2018) in the background section (see acknowledgements)

arXiv:1909.12550 [pdf]

Single-cell eQTLGen Consortium: a personalized understanding of disease

Authors: Monique G. P. van der Wijst, Dylan H. de Vries, Hilde E. Groot, Gosia Trynka, Chung-Chau Hon, Martijn C. Nawijn, Youssef Idaghdour, Pim van der Harst, Chun J. Ye, Joseph Powell, Fabian J. Theis, Ahmed Mahfouz, Matthias Heinig, Lude Franke

Abstract: In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes… ▽ More In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes, many of which are obscured when using bulk methods. The enormous increase in throughput and reduction in cost per cell now allow this technology to be applied to large-scale population genetics studies. Therefore, we have founded the single-cell eQTLGen consortium (sc-eQTLGen), aimed at pinpointing disease-causing genetic variants and identifying the cellular contexts in which they affect gene expression. Ultimately, this information can enable development of personalized medicine. Here, we outline the goals, approach, potential utility and early proofs-of-concept of the sc-eQTLGen consortium. We also provide a set of study design considerations for future single-cell eQTL studies. △ Less

Submitted 27 September, 2019; originally announced September 2019.

Comments: 26 pages, 5 figures, position paper of sc-eQTLGen consortium

arXiv:1810.04281 [pdf, other]

Fully integrative data analysis of NMR metabolic fingerprints with comprehensive patient data: a case report based on the German Chronic Kidney Disease (GCKD) study

Authors: Helena U. Zacharias, Michael Altenbuchinger, Stefan Solbrig, Andreas Schäfer, Mustafa Buyukozkan, Ulla T. Schultheiß, Fruzsina Kotsis, Anna Köttgen, Jan Krumsiek, Fabian J. Theis, Rainer Spang, Peter J. Oefner, Wolfram Gronwald, GCKD study investigators

Abstract: Omics data facilitate the gain of novel insights into the pathophysiology of diseases and, consequently, their diagnosis, treatment, and prevention. To that end, it is necessary to integrate omics data with other data types such as clinical, phenotypic, and demographic parameters of categorical or continuous nature. Here, we exemplify this data integration issue for a study on chronic kidney disea… ▽ More Omics data facilitate the gain of novel insights into the pathophysiology of diseases and, consequently, their diagnosis, treatment, and prevention. To that end, it is necessary to integrate omics data with other data types such as clinical, phenotypic, and demographic parameters of categorical or continuous nature. Here, we exemplify this data integration issue for a study on chronic kidney disease (CKD), where complex clinical and demographic parameters were assessed together with one-dimensional (1D) 1H NMR metabolic fingerprints. Routine analysis screens for associations of single metabolic features with clinical parameters, which requires confounding variables typically chosen by expert knowledge to be taken into account. This knowledge can be incomplete or unavailable. The results of this article are manifold. We introduce a framework for data integration that intrinsically adjusts for confounding variables. We give its mathematical and algorithmic foundation, provide a state-of-the-art implementation, and give several sanity checks. In particular, we show that the discovered associations remain significant after variable adjustment based on expert knowledge. In contrast, we illustrate that the discovery of associations in routine analysis can be biased by incorrect or incomplete expert knowledge in univariate screening approaches. Finally, we exemplify how our data integration approach reveals important associations between CKD comorbidities and metabolites. Moreover, we evaluate the predictive performance of the estimated models on independent validation data and contrast the results with a naive screening approach. △ Less

Submitted 8 October, 2018; originally announced October 2018.

arXiv:1511.01658 [pdf, other]

A simulation-based approach for solving optimisation problems with ODE-type steady state constraints

Authors: Anna Fiedler, Fabian J. Theis, Jan Hasenauer

Abstract: Ordinary differential equations (ODEs) are widely used to model biological, (bio-)chemical and technical processes. The parameters of these ODEs are often estimated from experimental data using ODE-constrained optimisation. This article proposes a simple simulation-based approach for solving optimisation problems with steady state constraints relying on an ODE. This simulation-based optimisation m… ▽ More Ordinary differential equations (ODEs) are widely used to model biological, (bio-)chemical and technical processes. The parameters of these ODEs are often estimated from experimental data using ODE-constrained optimisation. This article proposes a simple simulation-based approach for solving optimisation problems with steady state constraints relying on an ODE. This simulation-based optimisation method is tailored to the problem structure and exploits the local geometry of the steady state manifold and its stability properties. A parameterisation of the steady state manifold is not required. We prove local convergence of the method for locally strictly convex objective functions. Effciency and reliability of the proposed method are demonstrated in two examples. The proposed method demonstrated better convergence properties than existing general purpose methods and a significantly higher number of converged starts per time. △ Less

Submitted 5 November, 2015; originally announced November 2015.

Comments: 11 pages, 3 figures

arXiv:1506.06392 [pdf, other]

Data-driven modelling of biological multi-scale processes

Authors: Jan Hasenauer, Nick Jagiella, Sabrina Hross, Fabian J. Theis

Abstract: Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the rel… ▽ More Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the relation between spatial and temporal scales and the implication of that on multi-scale modelling. Based upon this overview over state-of-the-art modelling approaches, we formulate key challenges in mathematical and computational modelling of biological multi-scale and multi-physics processes. In particular, we considered the availability of analysis tools for multi-scale models and model-based multi-scale data integration. We provide a compact review of methods for model-based data integration and model-based hypothesis testing. Furthermore, novel approaches and recent trends are discussed, including computation time reduction using reduced order and surrogate models, which contribute to the solution of inference problems. We conclude the manuscript by providing a few ideas for the development of tailored multi-scale inference methods. △ Less

Submitted 21 June, 2015; originally announced June 2015.

Comments: This manuscript will appear in the Journal of Coupled Systems and Multiscale Dynamics (American Scientific Publishers)

MSC Class: 92Bxx; 93A30

arXiv:1407.2112 [pdf]

MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data

Authors: Justin Feigelman, Fabian J. Theis, Carsten Marr

Abstract: Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopul… ▽ More Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopulations based on the local pairwise correlation between covariates, without needing to define an a priori interaction scale. We demonstrate that MCA facilitates the identification of differentially regulated subpopulations in simulated data from a small gene regulatory network, followed by application to previously published single-cell qPCR data from mouse embryonic stem cells. We show that MCA recovers previously identified subpopulations, provides additional insight into the underlying correlation structure, reveals potentially spurious compartmentalizations, and provides insight into novel subpopulations. Conclusions: MCA is a useful method for the identification of subpopulations in low-dimensional expression data, as emerging from qPCR or FACS measurements. With MCA it is possible to investigate the robustness of covariate correlations with respect subpopulations, graphically identify outliers, and identify factors contributing to differential regulation between pairs of covariates. MCA thus provides a framework for investigation of expression correlations for genes of interests and biological hypothesis generation. △ Less

Submitted 8 July, 2014; originally announced July 2014.

Comments: BioVis 2014 conference

arXiv:1112.0121 [pdf, other]

doi 10.1016/j.bpj.2011.11.4000

Stability and multi-attractor dynamics of a toggle switch based on a two-stage model of stochastic gene expression

Authors: Michael K. Strasser, Fabian J. Theis, Carsten Marr

Abstract: A toggle switch consists of two genes that mutually repress each other. This regulatory motif is active during cell differentiation and is thought to act as a memory device, being able to choose and maintain cell fate decisions. In this contribution, we study the stability and dynamics of a two-stage gene expression switch within a probabilistic framework inspired by the properties of the Pu/Gata… ▽ More A toggle switch consists of two genes that mutually repress each other. This regulatory motif is active during cell differentiation and is thought to act as a memory device, being able to choose and maintain cell fate decisions. In this contribution, we study the stability and dynamics of a two-stage gene expression switch within a probabilistic framework inspired by the properties of the Pu/Gata toggle switch in myeloid progenitor cells. We focus on low mRNA numbers, high protein abundance and monomeric transcription factor binding. Contrary to the expectation from a deterministic description, this switch shows complex multi-attractor dy- namics without autoactivation and cooperativity. Most importantly, the four attractors of the system, which only emerge in a probabilistic two-stage description, can be identified with committed and primed states in cell differentiation. We first study the dynamics of the system and infer the mechanisms that move the system between attractors using both the quasi-potential and the probability flux of the system. Second, we show that the residence times of the system in one of the committed attractors are geometrically distributed and provide an analytical expression of the distribution. Most importantly we find that the mean residence time increases linearly with the mean protein level. Finally, we study the implications of this distribution for the stability of a switch and discuss the influence of the stability on a specific cell differentiation mechanism. Our model explains lineage priming and proposes the need of either high protein numbers or long term modifications such as chromatin remodeling to achieve stable cell fate decisions. Notably we present a system with high protein abundance that nevertheless requires a probabilistic description to exhibit multistability, complex switching dynamics and lineage priming. △ Less

Submitted 1 December, 2011; originally announced December 2011.

Comments: to appear in the Biophysical Journal

arXiv:1005.4397 [pdf, other]

doi 10.1371/journal.pcbi.1000836

Patterns of subnet usage reveal distinct scales of regulation in the transcriptional regulatory network of Escherichia coli

Authors: Carsten Marr, Fabian J. Theis, Larry S. Liebovitch, Marc-Thorsten Hütt

Abstract: The set of regulatory interactions between genes, mediated by transcription factors, forms a species' transcriptional regulatory network (TRN). By comparing this network with measured gene expression data one can identify functional properties of the TRN and gain general insight into transcriptional control. We define the subnet of a node as the subgraph consisting of all nodes topologically downs… ▽ More The set of regulatory interactions between genes, mediated by transcription factors, forms a species' transcriptional regulatory network (TRN). By comparing this network with measured gene expression data one can identify functional properties of the TRN and gain general insight into transcriptional control. We define the subnet of a node as the subgraph consisting of all nodes topologically downstream of the node, including itself. Using a large set of microarray expression data of the bacterium Escherichia coli, we find that the gene expression in different subnets exhibits a structured pattern in response to environmental changes and genotypic mutation. Subnets with less changes in their expression pattern have a higher fraction of feed-forward loop motifs and a lower fraction of small RNA targets within them. Our study implies that the TRN consists of several scales of regulatory organization: 1) subnets with more varying gene expression controlled by both transcription factors and post-transcriptional RNA regulation, and 2) subnets with less varying gene expression having more feed-forward loops and less post-transcriptional RNA regulation. △ Less

Submitted 24 May, 2010; originally announced May 2010.

Comments: 14 pages, 8 figures, to be published in PLoS Computational Biology

Showing 1–23 of 23 results for author: Theis, F