-
Leakage and Interpretability in Concept-Based Models
Authors:
Enrico Parisini,
Tapabrata Chakraborti,
Chris Harbron,
Ben D. MacArthur,
Christopher R. S. Banerji
Abstract:
Concept Bottleneck Models aim to improve interpretability by predicting high-level intermediate concepts, representing a promising approach for deployment in high-risk scenarios. However, they are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts. We introduce an information-theoretic framework to rigorously characterise and…
▽ More
Concept Bottleneck Models aim to improve interpretability by predicting high-level intermediate concepts, representing a promising approach for deployment in high-risk scenarios. However, they are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts. We introduce an information-theoretic framework to rigorously characterise and quantify leakage, and define two complementary measures: the concepts-task leakage (CTL) and interconcept leakage (ICL) scores. We show that these measures are strongly predictive of model behaviour under interventions and outperform existing alternatives in robustness and reliability. Using this framework, we identify the primary causes of leakage and provide strong evidence that Concept Embedding Models exhibit substantial leakage regardless of the hyperparameters choice. Finally, we propose practical guidelines for designing concept-based models to reduce leakage and ensure interpretability.
△ Less
Submitted 19 May, 2025; v1 submitted 18 April, 2025;
originally announced April 2025.
-
Anonymising Clinical Data for Secondary Use
Authors:
Irene Ferreira,
Chris Harbron,
Alex Hughes,
Tamsin Sargood,
Christoph Gerlinger
Abstract:
Secondary use of data already collected in clinical studies has become more and more popular in recent years, with the commitment of the pharmaceutical industry and many academic institutions in Europe and the US to provide access to their clinical trial data. Whilst this clearly provides societal benefit in helping to progress medical research, this has to be balanced against protection of subjec…
▽ More
Secondary use of data already collected in clinical studies has become more and more popular in recent years, with the commitment of the pharmaceutical industry and many academic institutions in Europe and the US to provide access to their clinical trial data. Whilst this clearly provides societal benefit in helping to progress medical research, this has to be balanced against protection of subjects' privacy. There are two main scenarios for sharing subject data: within Clinical Study Reports and Individual Patient Level Data, and these scenarios have different associated risks and generally require different approaches. In any data sharing scenario, there is a trade-off between data utility and the risk of subject re-identification, and achieving this balance is key. Quantitative metrics can guide the amount of de-identification required and new technologies may also start to provide alternative ways to achieve the risk-utility balance.
△ Less
Submitted 17 May, 2023;
originally announced July 2023.
-
A Complete Characterisation of Structured Missingness
Authors:
James Jackson,
Robin Mitra,
Niels Hagenbuch,
Sarah McGough,
Chris Harbron
Abstract:
Our capacity to process large complex data sources is ever-increasing, providing us with new, important applied research questions to address, such as how to handle missing values in large-scale databases. Mitra et al. (2023) noted the phenomenon of Structured Missingness (SM), which is where missingness has an underlying structure. Existing taxonomies for defining missingness mechanisms typically…
▽ More
Our capacity to process large complex data sources is ever-increasing, providing us with new, important applied research questions to address, such as how to handle missing values in large-scale databases. Mitra et al. (2023) noted the phenomenon of Structured Missingness (SM), which is where missingness has an underlying structure. Existing taxonomies for defining missingness mechanisms typically assume that variables' missingness indicator vectors $M_1$, $M_2$, ..., $M_p$ are independent after conditioning on the relevant portion of the data matrix $\mathbf{X}$. As this is often unsuitable for characterising SM in multivariate settings, we introduce a taxonomy for SM, where each ${M}_j$ can depend on $\mathbf{M}_{-j}$ (i.e., all missingness indicator vectors except ${M}_j$), in addition to $\mathbf{X}$. We embed this new framework within the well-established decomposition of mechanisms into MCAR, MAR, and MNAR (Rubin, 1976), allowing us to recast mechanisms into a broader setting, where we can consider the combined effect of $\mathbf{X}$ and $\mathbf{M}_{-j}$ on ${M}_j$. We also demonstrate, via simulations, the impact of SM on inference and prediction, and consider contextual instances of SM arising in a de-identified nationwide (US-based) clinico-genomic database (CGDB). We hope to stimulate interest in SM, and encourage timely research into this phenomenon.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Learning from data with structured missingness
Authors:
Robin Mitra,
Sarah F. McGough,
Tapabrata Chakraborti,
Chris Holmes,
Ryan Copping,
Niels Hagenbuch,
Stefanie Biedermann,
Jack Noonan,
Brieuc Lehmann,
Aditi Shenvi,
Xuan Vinh Doan,
David Leslie,
Ginestra Bianconi,
Ruben Sanchez-Garcia,
Alisha Davies,
Maxine Mackintosh,
Eleni-Rosalina Andrinopoulou,
Anahid Basiri,
Chris Harbron,
Ben D. MacArthur
Abstract:
Missing data are an unavoidable complication in many machine learning tasks. When data are `missing at random' there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or st…
▽ More
Missing data are an unavoidable complication in many machine learning tasks. When data are `missing at random' there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or structure, either explicitly or implicitly. Such `structured missingness' raises a range of challenges that have not yet been systematically addressed, and presents a fundamental hindrance to machine learning at scale. Here, we outline the current literature and propose a set of grand challenges in learning from data with structured missingness.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
The use of external controls: To what extent can it currently be recommended?
Authors:
Hans Ulrich Burger,
Christoph Gerlinger,
Chris Harbron,
Armin Koch,
Martin Posch,
Justine Rochon,
Anja Schiel
Abstract:
With more and better clinical data being captured outside of clinical studies and greater data sharing of clinical studies, external controls may become a more attractive alternative to randomized clinical trials. Both industry and regulators recognize that in situations where a randomized study cannot be performed, external controls can provide the needed contextualization to allow a better inter…
▽ More
With more and better clinical data being captured outside of clinical studies and greater data sharing of clinical studies, external controls may become a more attractive alternative to randomized clinical trials. Both industry and regulators recognize that in situations where a randomized study cannot be performed, external controls can provide the needed contextualization to allow a better interpretation of studies without a randomized control. It is also agreed that external controls will not fully replace randomized clinical trials as the gold standard for formal proof of efficacy in drug development and the yardstick of clinical research. However, it remains unclear in which situations conclusions about efficacy and a positive benefit/risk can reliably be based on the use of an external control. This paper will provide an overview on types of external control, their applications and the different sources of bias their use may incur, and discuss potential mitigation steps. It will also give recommendations on how the use of external controls can be justified.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
A multi-arm multi-stage platform design that allows pre-planned addition of arms while still controlling the family-wise error
Authors:
Peter Greenstreet,
Thomas Jaki,
Alun Bedding,
Chris Harbron,
Pavel Mozgunov
Abstract:
There is growing interest in platform trials that allow for adding of new treatment arms as the trial progresses as well as being able to stop treatments part way through the trial for either lack of benefit/futility or for superiority. In some situations, platform trials need to guarantee that error rates are controlled. This paper presents a multi-stage design that allows additional arms to be a…
▽ More
There is growing interest in platform trials that allow for adding of new treatment arms as the trial progresses as well as being able to stop treatments part way through the trial for either lack of benefit/futility or for superiority. In some situations, platform trials need to guarantee that error rates are controlled. This paper presents a multi-stage design that allows additional arms to be added in a platform trial in a pre-planned fashion, while still controlling the family wise error rate. A method is given to compute the sample size required to achieve a desired level of power and we show how the distribution of the sample size and the expected sample size can be found. A motivating trial is presented which focuses on two settings, with the first being a set number of stages per active treatment arm and the second being a set total number of stages, with treatments that are added later getting fewer stages. Through this example we show that the proposed method results in a smaller sample size while still controlling the errors compared to running multiple separate trials.
△ Less
Submitted 12 December, 2021;
originally announced December 2021.
-
A meta-analytic framework to adjust for bias in external control studies
Authors:
Devin Incerti,
Michael T Bretscher,
Ray Lin,
Chris Harbron
Abstract:
While randomized controlled trials (RCTs) are the gold standard for estimating treatment effects in medical research, there is increasing use of and interest in using real-world data for drug development. One such use case is the construction of external control arms for evaluation of efficacy in single-arm trials, particularly in cases where randomization is either infeasible or unethical. Howeve…
▽ More
While randomized controlled trials (RCTs) are the gold standard for estimating treatment effects in medical research, there is increasing use of and interest in using real-world data for drug development. One such use case is the construction of external control arms for evaluation of efficacy in single-arm trials, particularly in cases where randomization is either infeasible or unethical. However, it is well known that treated patients in non-randomized studies may not be comparable to control patients -- on either measured or unmeasured variables -- and that the underlying population differences between the two groups may result in biased treatment effect estimates as well as increased variability in estimation. To address these challenges for analyses of time-to-event outcomes, we developed a meta-analytic framework that uses historical reference studies to adjust a log hazard ratio estimate in a new external control study for its additional bias and variability. The set of historical studies is formed by constructing external control arms for historical RCTs, and a meta-analysis compares the trial controls to the external control arms. Importantly, a prospective external control study can be performed independently of the meta-analysis using standard causal inference techniques for observational data. We illustrate our approach with a simulation study and an empirical example based on reference studies for advanced non-small cell lung cancer. In our empirical analysis, external control patients had lower survival than trial controls (hazard ratio: 0.907), but our methodology is able to correct for this bias. An implementation of our approach is available in the R package ecmeta.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.