Skip to main content

Showing 1–13 of 13 results for author: Rahnenführer, J

.
  1. Methods for Quantifying Dataset Similarity: a Review, Taxonomy and Comparison

    Authors: Marieke Stolte, Franziska Kappenberg, Jörg Rahnenführer, Andrea Bommert

    Abstract: Quantifying the similarity between datasets has widespread applications in statistics and machine learning. The performance of a predictive model on novel datasets, referred to as generalizability, depends on how similar the training and evaluation datasets are. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In simulation studie… ▽ More

    Submitted 17 June, 2025; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: 124 pages

    MSC Class: 62E99; 62G10; 62H15; 62H30; 05C90

    Journal ref: Statist. Surv. 18, 163 - 298, 2024

  2. Simulation study to evaluate when Plasmode simulation is superior to parametric simulation in estimating the mean squared error of the least squares estimator in linear regression

    Authors: Marieke Stolte, Nicholas Schreck, Alla Slynko, Maral Saadati, Axel Benner, Jörg Rahnenführer, Andrea Bommert

    Abstract: Simulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made u… ▽ More

    Submitted 17 June, 2025; v1 submitted 7 December, 2023; originally announced December 2023.

    Journal ref: PLOS ONE (2024)

  3. Employing an Adjusted Stability Measure for Multi-Criteria Model Fitting on Data Sets with Similar Features

    Authors: Andrea Bommert, Jörg Rahnenführer, Michel Lang

    Abstract: Fitting models with high predictive accuracy that include all relevant but no irrelevant or redundant features is a challenging task on data sets with similar (e.g. highly correlated) features. We propose the approach of tuning the hyperparameters of a predictive model in a multi-criteria fashion with respect to predictive accuracy and feature selection stability. We evaluate this approach based o… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  4. arXiv:2105.09223  [pdf, other

    stat.ME

    Improving Adaptive Seamless Designs through Bayesian optimization

    Authors: Jakob Richter, Tim Friede, Jörg Rahnenführer

    Abstract: We propose to use Bayesian optimization (BO) to improve the efficiency of the design selection process in clinical trials. BO is a method to optimize expensive black-box functions, by using a regression as a surrogate to guide the search. In clinical trials, planning test procedures and sample sizes is a crucial task. A common goal is to maximize the test power, given a set of treatments, correspo… ▽ More

    Submitted 19 May, 2021; originally announced May 2021.

    Comments: Submitted to: Biometrical Journal

  5. Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features

    Authors: Andrea Bommert, Jörg Rahnenführer

    Abstract: For data sets with similar features, for example highly correlated features, most existing stability measures behave in an undesired way: They consider features that are almost identical but have different identifiers as different features. Existing adjusted stability measures, that is, stability measures that take into account the similarities between features, have major theoretical drawbacks. W… ▽ More

    Submitted 25 September, 2020; originally announced September 2020.

  6. arXiv:2008.06298  [pdf, ps, other

    stat.ML cs.LG

    Feature Selection Methods for Cost-Constrained Classification in Random Forests

    Authors: Rudolf Jagdhuber, Michel Lang, Jörg Rahnenführer

    Abstract: Cost-sensitive feature selection describes a feature selection problem, where features raise individual costs for inclusion in a model. These costs allow to incorporate disfavored aspects of features, e.g. failure rates of as measuring device, or patient harm, in the model selection process. Random Forests define a particularly challenging problem for feature selection, as features are generally e… ▽ More

    Submitted 17 August, 2020; v1 submitted 14 August, 2020; originally announced August 2020.

    Comments: Corrected minor typo in Figure 1, Added ancillary files

  7. arXiv:2008.05163  [pdf, ps, other

    stat.ML cs.LG

    Implications on Feature Detection when using the Benefit-Cost Ratio

    Authors: Rudolf Jagdhuber, Jörg Rahnenführer

    Abstract: In many practical machine learning applications, there are two objectives: one is to maximize predictive accuracy and the other is to minimize costs of the resulting model. These costs of individual features may be financial costs, but can also refer to other aspects, like for example evaluation time. Feature selection addresses both objectives, as it reduces the number of features and can improve… ▽ More

    Submitted 15 August, 2020; v1 submitted 12 August, 2020; originally announced August 2020.

    Comments: v2: Added ancillary files and corrected floating of figures. 10 pages, 2 figures, submitted to SN Computer Science

  8. arXiv:2004.07542  [pdf, other

    stat.AP stat.ME

    Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression

    Authors: Katrin Madjar, Manuela Zucknick, Katja Ickstadt, Jörg Rahnenführer

    Abstract: Important objectives in cancer research are the prediction of a patient's risk based on molecular measurements such as gene expression data and the identification of new prognostic biomarkers (e.g. genes). In clinical practice, this is often challenging because patient cohorts are typically small and can be heterogeneous. In classical subgroup analysis, a separate prediction model is fitted using… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

    Comments: under review, 19 pages, 10 figures

  9. arXiv:2003.08965  [pdf, other

    stat.ME stat.AP

    Weighted Cox regression for the prediction of heterogeneous patient subgroups

    Authors: Katrin Madjar, Jörg Rahnenführer

    Abstract: An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-… ▽ More

    Submitted 19 March, 2020; originally announced March 2020.

    Comments: under review, 15 pages, 6 figures

  10. arXiv:2003.04980  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

    Authors: Jonas Rieger, Lars Koppers, Carsten Jentsch, Jörg Rahnenführer

    Abstract: For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment v… ▽ More

    Submitted 14 February, 2020; originally announced March 2020.

    Comments: 16 pages, 2 figures

  11. arXiv:1902.08999  [pdf, other

    cs.LG stat.ML

    High Dimensional Restrictive Federated Model Selection with multi-objective Bayesian Optimization over shifted distributions

    Authors: Xudong Sun, Andrea Bommert, Florian Pfisterer, Jörg Rahnenführer, Michel Lang, Bernd Bischl

    Abstract: A novel machine learning optimization process coined Restrictive Federated Model Selection (RFMS) is proposed under the scenario, for example, when data from healthcare units can not leave the site it is situated on and it is forbidden to carry out training algorithms on remote data sites due to either technical or privacy and trust concerns. To carry out a clinical research under this scenario, a… ▽ More

    Submitted 8 August, 2019; v1 submitted 24 February, 2019; originally announced February 2019.

  12. arXiv:1606.05110  [pdf, other

    stat.ML cs.CY

    Machine Learning meets Data-Driven Journalism: Boosting International Understanding and Transparency in News Coverage

    Authors: Elena Erdmann, Karin Boczek, Lars Koppers, Gerret von Nordheim, Christian Pölitz, Alejandro Molina, Katharina Morik, Henrik Müller, Jörg Rahnenführer, Kristian Kersting

    Abstract: Migration crisis, climate change or tax havens: Global challenges need global solutions. But agreeing on a joint approach is difficult without a common ground for discussion. Public spheres are highly segmented because news are mainly produced and received on a national level. Gain- ing a global view on international debates about important issues is hindered by the enormous quantity of news and b… ▽ More

    Submitted 16 June, 2016; originally announced June 2016.

    Comments: presented at 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, New York, NY

  13. arXiv:1504.00826  [pdf, ps, other

    q-bio.MN q-bio.QM

    TiMEx: A Waiting Time Model for Mutually Exclusive Groups of Cancer Alterations

    Authors: Simona Constantinescu, Ewa Szczurek, Pejman Mohammadi, Jörg Rahnenführer, Niko Beerenwinkel

    Abstract: Despite recent technological advances in genomic sciences, our understanding of cancer progression and its driving genetic alterations remains incomplete. Here, we introduce TiMEx, a generative probabilistic model for detecting patterns of various degrees of mutual exclusivity across genetic alterations, which can indicate pathways involved in cancer progression. TiMEx explicitly accounts for the… ▽ More

    Submitted 27 October, 2015; v1 submitted 3 April, 2015; originally announced April 2015.

    Comments: Paper accepted for oral presentation at RECOMB CCB Satellite Meeting (April 2015, Warsaw)