-
Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data
Authors:
John Zobolas,
Anne-Marie George,
Alberto López,
Sebastian Fischer,
Marc Becker,
Tero Aittokallio
Abstract:
Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies f…
▽ More
Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.
△ Less
Submitted 2 September, 2025;
originally announced September 2025.
-
Tutorial on survival modeling with applications to omics data
Authors:
Zhi Zhao,
John Zobolas,
Manuela Zucknick,
Tero Aittokallio
Abstract:
Motivation: Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomi…
▽ More
Motivation: Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics data sets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients' survival outcomes. Results: We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and validating survival models. In particular, we focus on the commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, which are are especially useful for high-dimensional data, but the framework is applicable more generally. Availability and implementation: A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics/survomics.html.
△ Less
Submitted 4 March, 2024; v1 submitted 24 February, 2023;
originally announced February 2023.
-
Boolean function metrics can assist modelers to check and choose logical rules
Authors:
John Zobolas,
Pedro T. Monteiro,
Martin Kuiper,
Åsmund Flobak
Abstract:
Computational models of biological processes provide one of the most powerful methods for a detailed analysis of the mechanisms that drive the behavior of complex systems. Logic-based modeling has enhanced our understanding and interpretation of those systems. Defining rules that determine how the output activity of biological entities is regulated by their respective inputs has proven to be chall…
▽ More
Computational models of biological processes provide one of the most powerful methods for a detailed analysis of the mechanisms that drive the behavior of complex systems. Logic-based modeling has enhanced our understanding and interpretation of those systems. Defining rules that determine how the output activity of biological entities is regulated by their respective inputs has proven to be challenging, due to increasingly larger models and the presence of noise in data, allowing multiple model parameterizations to fit the experimental observations.
We present several Boolean function metrics that provide modelers with the appropriate framework to analyze the impact of a particular model parameterization. We demonstrate the link between a semantic characterization of a Boolean function and its consistency with the model's underlying regulatory structure. We further define the properties that outline such consistency and show that several of the Boolean functions under study violate them, questioning their biological plausibility and subsequent use. We also illustrate that regulatory functions can have major differences with regard to their asymptotic output behavior, with some of them being biased towards specific Boolean outcomes when others are dependent on the ratio between activating and inhibitory regulators.
Application results show that in a specific signaling cancer network, the function bias can be used to guide the choice of logical operators for a model that matches data observations. Moreover, graph analysis indicates that the standardized Boolean function bias becomes more prominent with increasing numbers of regulators, confirming the fact that rule specification can effectively determine regulatory outcome despite the complex dynamics of biological networks.
△ Less
Submitted 2 April, 2021;
originally announced April 2021.