Skip to main content

Showing 1–11 of 11 results for author: Boulesteix, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.03491  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications

    Authors: Christina Sauer, Anne-Laure Boulesteix, Luzia Hanßum, Farina Hodiamont, Claudia Bausewein, Theresa Ullmann

    Abstract: Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance o… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

  2. arXiv:2409.18836  [pdf, other

    stat.ML cs.LG

    Constructing Confidence Intervals for 'the' Generalization Error -- a Comprehensive Benchmark Study

    Authors: Hannah Schulz-Kümpel, Sebastian Fischer, Roman Hornung, Anne-Laure Boulesteix, Thomas Nagler, Bernd Bischl

    Abstract: When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-va… ▽ More

    Submitted 15 January, 2025; v1 submitted 27 September, 2024; originally announced September 2024.

  3. arXiv:2405.02200  [pdf, other

    cs.LG stat.ML

    Position: Why We Must Rethink Empirical Research in Machine Learning

    Authors: Moritz Herrmann, F. Julian D. Lange, Katharina Eggensperger, Giuseppe Casalicchio, Marcel Wever, Matthias Feurer, David Rügamer, Eyke Hüllermeier, Anne-Laure Boulesteix, Bernd Bischl

    Abstract: We warn against a common but incomplete understanding of empirical research in machine learning that leads to non-replicable results, makes findings unreliable, and threatens to undermine progress in the field. To overcome this alarming situation, we call for more awareness of the plurality of ways of gaining knowledge experimentally but also of some epistemic limitations. In particular, we argue… ▽ More

    Submitted 25 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

    Comments: 20 pages, accepted for publication at ICML 2024, camera-ready version

  4. arXiv:2402.18612  [pdf

    stat.ME cs.CY cs.LG

    Understanding overfitting in random forest for probability estimation: a visualization and simulation study

    Authors: Lasai Barreñada, Paula Dhiman, Dirk Timmerman, Anne-Laure Boulesteix, Ben Van Calster

    Abstract: Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For t… ▽ More

    Submitted 30 September, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: 20 pages, 8 figures

    Journal ref: Diagn Progn Res 8, 14 (2024)

  5. arXiv:2310.15108  [pdf, other

    stat.ML cs.LG stat.AP stat.CO stat.ME

    Evaluating machine learning models in non-standard settings: An overview and new findings

    Authors: Roman Hornung, Malte Nalenz, Lennart Schneider, Andreas Bender, Ludwig Bothmann, Bernd Bischl, Thomas Augustin, Anne-Laure Boulesteix

    Abstract: Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines fo… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  6. arXiv:2302.03991  [pdf, other

    q-bio.GN cs.AI cs.LG stat.AP stat.CO

    Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study

    Authors: Roman Hornung, Frederik Ludwigs, Jonas Hagenberg, Anne-Laure Boulesteix

    Abstract: As the availability of omics data has increased in the last few years, more multi-omics data have been generated, that is, high-dimensional molecular data consisting of several types such as genomic, transcriptomic, or proteomic data, all obtained from the same patients. Such data lend themselves to being used as covariates in automatic outcome prediction because each omics type may contribute uni… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

  7. arXiv:2107.05847  [pdf, other

    stat.ML cs.LG

    Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges

    Authors: Bernd Bischl, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, Theresa Ullmann, Marc Becker, Anne-Laure Boulesteix, Difan Deng, Marius Lindauer

    Abstract: Most machine learning algorithms are configured by one or several hyperparameters that must be carefully chosen and often considerably impact performance. To avoid a time consuming and unreproducible manual trial-and-error process to find well-performing hyperparameter configurations, various automatic hyperparameter optimization (HPO) methods, e.g., based on resampling error estimation for superv… ▽ More

    Submitted 24 November, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

  8. arXiv:2003.03621  [pdf, ps, other

    stat.ML cs.LG stat.AP stat.ME

    Large-scale benchmark study of survival prediction methods using multi-omics data

    Authors: Moritz Herrmann, Philipp Probst, Roman Hornung, Vindi Jurinovic, Anne-Laure Boulesteix

    Abstract: Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables (often in addition to classical clinical variables), are increasingly generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are… ▽ More

    Submitted 7 March, 2020; originally announced March 2020.

    Comments: 23 pages, 6 tables, 3 figures

    Journal ref: Briefings in Bioinformatics (2020) bbaa167

  9. arXiv:1804.03515  [pdf, other

    stat.ML cs.LG

    Hyperparameters and Tuning Strategies for Random Forest

    Authors: Philipp Probst, Marvin Wright, Anne-Laure Boulesteix

    Abstract: The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a… ▽ More

    Submitted 26 February, 2019; v1 submitted 10 April, 2018; originally announced April 2018.

    Comments: 19 pages, 2 figures

    Journal ref: WIREs Data Mining Knowl Discov 2019

  10. arXiv:1705.05654  [pdf, other

    stat.ML cs.LG

    To tune or not to tune the number of trees in random forest?

    Authors: Philipp Probst, Anne-Laure Boulesteix

    Abstract: The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimu… ▽ More

    Submitted 16 May, 2017; originally announced May 2017.

    Comments: 20 pages, 4 figures

    Journal ref: Journal of Machine Learning Research 18 (2018) 1-18

  11. arXiv:1208.2651  [pdf, ps, other

    stat.CO cs.CV stat.ME stat.ML

    A Plea for Neutral Comparison Studies in Computational Sciences

    Authors: Anne-Laure Boulesteix, Manuel J. A. Eugster

    Abstract: In a context where most published articles are devoted to the development of "new methods", comparison studies are generally appreciated by readers but surprisingly given poor consideration by many scientific journals. In connection with recent articles on over-optimism and epistemology published in Bioinformatics, this letter stresses the importance of neutral comparison studies for the objective… ▽ More

    Submitted 13 August, 2012; originally announced August 2012.