Search | arXiv e-print repository

doi 10.1007/s10618-022-00840-5

Grouped Feature Importance and Combined Features Effect Plot

Authors: Quay Au, Julia Herbinger, Clemens Stachl, Bernd Bischl, Giuseppe Casalicchio

Abstract: Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms and their inherently challenging interpretability. Most work in this area has been focused on the interpretation of single features in a model. However, for researchers and practitioners, it is often equally important to quantify the importance or visualize the effec… ▽ More Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms and their inherently challenging interpretability. Most work in this area has been focused on the interpretation of single features in a model. However, for researchers and practitioners, it is often equally important to quantify the importance or visualize the effect of feature groups. To address this research gap, we provide a comprehensive overview of how existing model-agnostic techniques can be defined for feature groups to assess the grouped feature importance, focusing on permutation-based, refitting, and Shapley-based methods. We also introduce an importance-based sequential procedure that identifies a stable and well-performing combination of features in the grouped feature space. Furthermore, we introduce the combined features effect plot, which is a technique to visualize the effect of a group of features based on a sparse, interpretable linear combination of features. We used simulation studies and a real data example from computational psychology to analyze, compare, and discuss these methods. △ Less

Submitted 23 April, 2021; originally announced April 2021.

Journal ref: Data Mining and Knowledge Discovery 36, 1401--1450 (2022)

arXiv:1905.01973 [pdf, other]

Who wrote this book? A challenge for e-commerce

Authors: Béranger Dumont, Simona Maggio, Ghiles Sidi Said, Quoc-Tien Au

Abstract: Modern e-commerce catalogs contain millions of references, associated with textual and visual information that is of paramount importance for the products to be found via search or browsing. Of particular significance is the book category, where the author name(s) field poses a significant challenge. Indeed, books written by a given author (such as F. Scott Fitzgerald) might be listed with differe… ▽ More Modern e-commerce catalogs contain millions of references, associated with textual and visual information that is of paramount importance for the products to be found via search or browsing. Of particular significance is the book category, where the author name(s) field poses a significant challenge. Indeed, books written by a given author (such as F. Scott Fitzgerald) might be listed with different authors' names in a catalog due to abbreviations and spelling variants and mistakes, among others. To solve this problem at scale, we design a composite system involving open data sources for books as well as machine learning components leveraging deep learning-based techniques for natural language processing. In particular, we use Siamese neural networks for an approximate match with known author names, and direct correction of the provided author's name using sequence-to-sequence learning with neural networks. We evaluate this approach on product data from the e-commerce website Rakuten France, and find that the top proposal of the system is the normalized author name with 72% accuracy. △ Less

Submitted 19 April, 2019; originally announced May 2019.

Comments: 8 pages, 1 figure

arXiv:1904.03943 [pdf, other]

Component-Wise Boosting of Targets for Multi-Output Prediction

Authors: Quay Au, Daniel Schalk, Giuseppe Casalicchio, Ramona Schoedel, Clemens Stachl, Bernd Bischl

Abstract: Multi-output prediction deals with the prediction of several targets of possibly diverse types. One way to address this problem is the so called problem transformation method. This method is often used in multi-label learning, but can also be used for multi-output prediction due to its generality and simplicity. In this paper, we introduce an algorithm that uses the problem transformation method f… ▽ More Multi-output prediction deals with the prediction of several targets of possibly diverse types. One way to address this problem is the so called problem transformation method. This method is often used in multi-label learning, but can also be used for multi-output prediction due to its generality and simplicity. In this paper, we introduce an algorithm that uses the problem transformation method for multi-output prediction, while simultaneously learning the dependencies between target variables in a sparse and interpretable manner. In a first step, predictions are obtained for each target individually. Target dependencies are then learned via a component-wise boosting approach. We compare our new method with similar approaches in a benchmark using multi-label, multivariate regression and mixed-type datasets. △ Less

Submitted 8 April, 2019; originally announced April 2019.

arXiv:1703.08991 [pdf, other]

doi 10.32614/RJ-2017-012

Multilabel Classification with R Package mlr

Authors: Philipp Probst, Quay Au, Giuseppe Casalicchio, Clemens Stachl, Bernd Bischl

Abstract: We implemented several multilabel classification algorithms in the machine learning package mlr. The implemented methods are binary relevance, classifier chains, nested stacking, dependent binary relevance and stacking, which can be used with any base learner that is accessible in mlr. Moreover, there is access to the multilabel classification versions of randomForestSRC and rFerns. All these meth… ▽ More We implemented several multilabel classification algorithms in the machine learning package mlr. The implemented methods are binary relevance, classifier chains, nested stacking, dependent binary relevance and stacking, which can be used with any base learner that is accessible in mlr. Moreover, there is access to the multilabel classification versions of randomForestSRC and rFerns. All these methods can be easily compared by different implemented multilabel performance measures and resampling methods in the standardized mlr framework. In a benchmark experiment with several multilabel datasets, the performance of the different methods is evaluated. △ Less

Submitted 3 April, 2017; v1 submitted 27 March, 2017; originally announced March 2017.

Comments: 18 pages, 2 figures, to be published in R Journal; reference corrected

Journal ref: The R Journal 9/1 (2017) 352-369

Showing 1–4 of 4 results for author: Au, Q