-
In-silico biological discovery with large perturbation models
Authors:
Djordje Miladinovic,
Tobias Höppe,
Mathieu Chevalley,
Andreas Georgiou,
Lachlan Stuart,
Arash Mehrjou,
Marcus Bantscheff,
Bernhard Schölkopf,
Patrick Schwab
Abstract:
Data generated in perturbation experiments link perturbations to the changes they elicit and therefore contain information relevant to numerous biological discovery tasks -- from understanding the relationships between biological entities to developing therapeutics. However, these data encompass diverse perturbations and readouts, and the complex dependence of experimental outcomes on their biolog…
▽ More
Data generated in perturbation experiments link perturbations to the changes they elicit and therefore contain information relevant to numerous biological discovery tasks -- from understanding the relationships between biological entities to developing therapeutics. However, these data encompass diverse perturbations and readouts, and the complex dependence of experimental outcomes on their biological context makes it challenging to integrate insights across experiments. Here, we present the Large Perturbation Model (LPM), a deep-learning model that integrates multiple, heterogeneous perturbation experiments by representing perturbation, readout, and context as disentangled dimensions. LPM outperforms existing methods across multiple biological discovery tasks, including in predicting post-perturbation transcriptomes of unseen experiments, identifying shared molecular mechanisms of action between chemical and genetic perturbations, and facilitating the inference of gene-gene interaction networks.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
Multi-megabase scale genome interpretation with genetic language models
Authors:
Frederik Träuble,
Lachlan Stuart,
Andreas Georgiou,
Pascal Notin,
Arash Mehrjou,
Ron Schwessinger,
Mathieu Chevalley,
Kim Branson,
Bernhard Schölkopf,
Cornelia van Duijn,
Debora Marks,
Patrick Schwab
Abstract:
Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a…
▽ More
Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
Efficient Differentiable Discovery of Causal Order
Authors:
Mathieu Chevalley,
Arash Mehrjou,
Patrick Schwab
Abstract:
In the algorithm Intersort, Chevalley et al. (2024) proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensive and non-differentiable, limiting its ability to be utilised in problems invo…
▽ More
In the algorithm Intersort, Chevalley et al. (2024) proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensive and non-differentiable, limiting its ability to be utilised in problems involving large-scale datasets, such as those in genomics and climate models, or to be integrated into end-to-end gradient-based learning frameworks. We address this limitation by reformulating Intersort using differentiable sorting and ranking techniques. Our approach enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks. Empirical results demonstrate that causal discovery algorithms benefit significantly from regularizing on the causal order, underscoring the effectiveness of our method. Our work opens the door to efficiently incorporating regularization for causal order into the training of differentiable models and thereby addresses a long-standing limitation of purely associational supervised learning.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm
Authors:
Mathieu Chevalley,
Patrick Schwab,
Arash Mehrjou
Abstract:
Targeted and uniform interventions to a system are crucial for unveiling causal relationships. While several methods have been developed to leverage interventional data for causal structure learning, their practical application in real-world scenarios often remains challenging. Recent benchmark studies have highlighted these difficulties, even when large numbers of single-variable intervention sam…
▽ More
Targeted and uniform interventions to a system are crucial for unveiling causal relationships. While several methods have been developed to leverage interventional data for causal structure learning, their practical application in real-world scenarios often remains challenging. Recent benchmark studies have highlighted these difficulties, even when large numbers of single-variable intervention samples are available. In this work, we demonstrate, both theoretically and empirically, that such datasets contain a wealth of causal information that can be effectively extracted under realistic assumptions about the data distribution. More specifically, we introduce a novel variant of interventional faithfulness, which relies on comparisons between the marginal distributions of each variable across observational and interventional settings, and we introduce a score on causal orders. Under this assumption, we are able to prove strong theoretical guarantees on the optimum of our score that also hold for large-scale settings. To empirically verify our theory, we introduce Intersort, an algorithm designed to infer the causal order from datasets containing large numbers of single-variable interventions by approximately optimizing our score. Intersort outperforms baselines (GIES, DCDI, PC and EASE) on almost all simulated data settings replicating common benchmarks in the field. Our proposed novel approach to modeling interventional datasets thus offers a promising avenue for advancing causal inference, highlighting significant potential for further enhancements under realistic assumptions.
△ Less
Submitted 19 May, 2025; v1 submitted 28 May, 2024;
originally announced May 2024.
-
The CausalBench challenge: A machine learning contest for gene network inference from single-cell perturbation data
Authors:
Mathieu Chevalley,
Jacob Sackett-Sanders,
Yusuf Roohani,
Pascal Notin,
Artemy Bakulin,
Dariusz Brzezinski,
Kaiwen Deng,
Yuanfang Guan,
Justin Hong,
Michael Ibrahim,
Wojciech Kotlowski,
Marcin Kowiel,
Panagiotis Misiakos,
Achille Nazaret,
Markus Püschel,
Chris Wendler,
Arash Mehrjou,
Patrick Schwab
Abstract:
In drug discovery, mapping interactions between genes within cellular systems is a crucial early step. Such maps are not only foundational for understanding the molecular mechanisms underlying disease biology but also pivotal for formulating hypotheses about potential targets for new medicines. Recognizing the need to elevate the construction of these gene-gene interaction networks, especially fro…
▽ More
In drug discovery, mapping interactions between genes within cellular systems is a crucial early step. Such maps are not only foundational for understanding the molecular mechanisms underlying disease biology but also pivotal for formulating hypotheses about potential targets for new medicines. Recognizing the need to elevate the construction of these gene-gene interaction networks, especially from large-scale, real-world datasets of perturbed single cells, the CausalBench Challenge was initiated. This challenge aimed to inspire the machine learning community to enhance state-of-the-art methods, emphasizing better utilization of expansive genetic perturbation data. Using the framework provided by the CausalBench benchmark, participants were tasked with refining the current methodologies or proposing new ones. This report provides an analysis and summary of the methods submitted during the challenge to give a partial image of the state of the art at the time of the challenge. Notably, the winning solutions significantly improved performance compared to previous baselines, establishing a new state of the art for this critical task in biology and medicine.
△ Less
Submitted 19 May, 2025; v1 submitted 29 August, 2023;
originally announced August 2023.
-
CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data
Authors:
Mathieu Chevalley,
Yusuf Roohani,
Arash Mehrjou,
Jure Leskovec,
Patrick Schwab
Abstract:
Causal inference is a vital aspect of multiple scientific disciplines and is routinely applied to high-impact applications such as medicine. However, evaluating the performance of causal inference methods in real-world environments is challenging due to the need for observations under both interventional and control conditions. Traditional evaluations conducted on synthetic datasets do not reflect…
▽ More
Causal inference is a vital aspect of multiple scientific disciplines and is routinely applied to high-impact applications such as medicine. However, evaluating the performance of causal inference methods in real-world environments is challenging due to the need for observations under both interventional and control conditions. Traditional evaluations conducted on synthetic datasets do not reflect the performance in real-world systems. To address this, we introduce CausalBench, a benchmark suite for evaluating network inference methods on real-world interventional data from large-scale single-cell perturbation experiments. CausalBench incorporates biologically-motivated performance metrics, including new distribution-based interventional metrics. A systematic evaluation of state-of-the-art causal inference methods using our CausalBench suite highlights how poor scalability of current methods limits performance. Moreover, methods that use interventional information do not outperform those that only use observational data, contrary to what is observed on synthetic benchmarks. Thus, CausalBench opens new avenues in causal network inference research and provides a principled and reliable way to track progress in leveraging real-world interventional data.
△ Less
Submitted 3 July, 2023; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Invariant Causal Mechanisms through Distribution Matching
Authors:
Mathieu Chevalley,
Charlotte Bunne,
Andreas Krause,
Stefan Bauer
Abstract:
Learning representations that capture the underlying data generating process is a key problem for data efficient and robust use of neural networks. One key property for robustness which the learned representation should capture and which recently received a lot of attention is described by the notion of invariance. In this work we provide a causal perspective and new algorithm for learning invaria…
▽ More
Learning representations that capture the underlying data generating process is a key problem for data efficient and robust use of neural networks. One key property for robustness which the learned representation should capture and which recently received a lot of attention is described by the notion of invariance. In this work we provide a causal perspective and new algorithm for learning invariant representations. Empirically we show that this algorithm works well on a diverse set of tasks and in particular we observe state-of-the-art performance on domain generalization, where we are able to significantly boost the score of existing models.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
By the user, for the user: A user-centric approach to quantifying the privacy of websites
Authors:
Matius Chairani,
Mathieu Chevalley,
Abderrahmane Lazraq,
Sruti Bhagavatula
Abstract:
Third-party tracking is common on almost all commercially operated websites. Prior work has studied in detail the extent of third-party tracking on the web, detection of third-party trackers, and defending against third-party tracking. Existing research and tools have also attempted to inform web users of trackers and the extent of their privacy violations. However, existing tools do not take into…
▽ More
Third-party tracking is common on almost all commercially operated websites. Prior work has studied in detail the extent of third-party tracking on the web, detection of third-party trackers, and defending against third-party tracking. Existing research and tools have also attempted to inform web users of trackers and the extent of their privacy violations. However, existing tools do not take into account users' perceptions of and understanding of the extent of trackers on the web. Taking these factors into account is important for the usability of such tools so that users can be aware and protect themselves to a reasonable and necessary extent that aligns with their overall comfort with trackers. In this paper, we elicit user perceptions and preferences about different trackers on various websites through an online survey of 43 users. We use this data to bootstrap a privacy scoring system. This scoring system weights the usage of trackers and the dispersion of user data within a page to third parties, with the type of website being visited. Our work presents a proof-of-concept methodology and tool to calculate a user-centric privacy score with preliminary bootstrap user data. We conclude with concrete future directions.
△ Less
Submitted 15 November, 2019; v1 submitted 13 November, 2019;
originally announced November 2019.