Search | arXiv e-print repository

arXiv:2004.13809 [pdf, other]

Analysis of ensemble feature selection for correlated high-dimensional RNA-Seq cancer data

Authors: Aneta Polewko-Klim, Witold R. Rudnicki

Abstract: Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These ap… ▽ More Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets. △ Less

Submitted 28 April, 2020; originally announced April 2020.

Comments: 14 pages, 1 table, 29 figure, submitted to International Conference on Computational Science, Amsterdam 2020

arXiv:2003.08342 [pdf, other]

Bootstrap Bias Corrected Cross Validation applied to Super Learning

Authors: Krzysztof Mnich, Agnieszka Kitlas Golińska, Aneta Polewko-Klim, Witold R. Rudnicki

Abstract: Super learner algorithm can be applied to combine results of multiple base learners to improve quality of predictions. The default method for verification of super learner results is by nested cross validation. It has been proposed by Tsamardinos et al., that nested cross validation can be replaced by resampling for tuning hyper-parameters of the learning algorithms. We apply this idea to verifica… ▽ More Super learner algorithm can be applied to combine results of multiple base learners to improve quality of predictions. The default method for verification of super learner results is by nested cross validation. It has been proposed by Tsamardinos et al., that nested cross validation can be replaced by resampling for tuning hyper-parameters of the learning algorithms. We apply this idea to verification of super learner and compare with other verification methods, including nested cross validation. Tests were performed on artificial data sets of diverse size and on seven real, biomedical data sets. The resampling method, called Bootstrap Bias Correction, proved to be a reasonably precise and very cost-efficient alternative for nested cross validation. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Comments: 14 pages, 4 tables, 1 figure, submitted to International Conference on Computational Science, Amsterdam 2020

arXiv:1811.00631 [pdf, other]

MDFS - MultiDimensional Feature Selection

Authors: Radosław Piliszek, Krzysztof Mnich, Szymon Migacz, Paweł Tabaszewski, Andrzej Sułecki, Aneta Polewko-Klim, Witold Rudnicki

Abstract: Identification of informative variables in an information system is often performed using simple one-dimensional filtering procedures that discard information about interactions between variables. Such approach may result in removing some relevant variables from consideration. Here we present an R package MDFS (MultiDimensional Feature Selection) that performs identification of informative variabl… ▽ More Identification of informative variables in an information system is often performed using simple one-dimensional filtering procedures that discard information about interactions between variables. Such approach may result in removing some relevant variables from consideration. Here we present an R package MDFS (MultiDimensional Feature Selection) that performs identification of informative variables taking into account synergistic interactions between multiple descriptors and the decision variable. MDFS is an implementation of an algorithm based on information theory. Computational kernel of the package is implemented in C++. A high-performance version implemented in CUDA C is also available. The applications of MDFS are demonstrated using the well-known Madelon dataset that has synergistic variables by design. The dataset comes from the UCI Machine Learning Repository. It is shown that multidimensional analysis is more sensitive than one-dimensional tests and returns more reliable rankings of importance. △ Less

Submitted 31 October, 2018; originally announced November 2018.

Comments: 12 pages, 3 figures, 5 tables, license: CC-BY

Showing 1–3 of 3 results for author: Rudnicki, W