Search | arXiv e-print repository

arXiv:2405.19865 [pdf, other]

Reduced Rank Regression for Mixed Predictor and Response Variables

Authors: Mark de Rooij, Lorenza Cotugno, Roberta Siciliano

Abstract: In this paper, we propose the generalized mixed reduced rank regression method, GMR$^3$ for short. GMR$^3$ is a regression method for a mix of numeric, binary, and ordinal response variables. The predictor variables can be a mix of binary, nominal, ordinal, and numeric variables. For dealing with the categorical predictors we use optimal scaling. A majorization-minimization algorithm is derived fo… ▽ More In this paper, we propose the generalized mixed reduced rank regression method, GMR$^3$ for short. GMR$^3$ is a regression method for a mix of numeric, binary, and ordinal response variables. The predictor variables can be a mix of binary, nominal, ordinal, and numeric variables. For dealing with the categorical predictors we use optimal scaling. A majorization-minimization algorithm is derived for maximum likelihood estimation under a local independence assumption. A series of simulation studies is shown (Section 4) to evaluate the performance of the algorithm with different types of predictor and response variables. In Section 5.2, we briefly discuss the choices to make when applying the model the empirical data and give suggestions for supporting such choices. In Section 6.1, we show an application of GMR$^3$ using the Eurobarometer Surveys data set of 2023. △ Less

Submitted 22 January, 2025; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: 29 pages, 4 figures

arXiv:2402.07634 [pdf, other]

A Multinomial Canonical Decomposition Model, with emphasis on the analysis of Multivariate Binary data

Authors: Mark de Rooij

Abstract: In this paper, we propose to decompose the canonical parameter of a multinomial model into a set of participant scores and category scores. External information about the participants or the categories can be used to restrict these scores. Therefore, we impose the constraint that the scores are linear combinations of the external variables. For the estimation of the parameters of the decomposition… ▽ More In this paper, we propose to decompose the canonical parameter of a multinomial model into a set of participant scores and category scores. External information about the participants or the categories can be used to restrict these scores. Therefore, we impose the constraint that the scores are linear combinations of the external variables. For the estimation of the parameters of the decomposition, we derive a majorization-minimization algorithm. We place special emphasis on the case where the categories represent profiles of binary response variables. In that case, the multinomial model becomes a regression model for multiple binary response variables and researchers might be interested in the effect of an external variable for the participant (i.e., a predictor) on a binary response variable or in the effect of this predictor on the association among binary response variables. We derive interpretational rules for these relationships in terms of changes in log odds or log odds ratios. Connections between our multinomial canonical decomposition and loglinear models, multinomial logistic regression, multinomial reduced rank logistic regression, and double constrained correspondence analysis are discussed. We use two empirical data sets, the first to show the relationships between a loglinear analysis approach and our modelling approach. The second data set is used as an illustration of our modelling approach and describes the model selection and interpretation in detail. △ Less

Submitted 22 January, 2025; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: 28 pages, 0 figures

arXiv:2402.07629 [pdf, other]

doi 10.1017/psy.2025.10

Logistic Multidimensional Data Analysis for Ordinal Response Variables using a Cumulative Link function

Authors: Mark de Rooij, Ligaya Breemer, Dion Woestenburg, Frank Busing

Abstract: We present a multidimensional data analysis framework for the analysis of ordinal response variables. Underlying the ordinal variables, we assume a continuous latent variable, leading to cumulative logit models. The framework includes unsupervised methods, when no predictor variables are available, and supervised methods, when predictor variables are available. We distinguish between dominance var… ▽ More We present a multidimensional data analysis framework for the analysis of ordinal response variables. Underlying the ordinal variables, we assume a continuous latent variable, leading to cumulative logit models. The framework includes unsupervised methods, when no predictor variables are available, and supervised methods, when predictor variables are available. We distinguish between dominance variables and proximity variables, where dominance variables are analyzed using inner product models, whereas the proximity variables are analyzed using distance models. An expectation-majorization-minimization algorithm is derived for estimation of the parameters of the models. We illustrate our methodology with three empirical data sets highlighting the advantages of the proposed framework. A simulation study is conducted to evaluate the performance of the algorithm. △ Less

Submitted 22 January, 2025; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: 56 pages, 10 figures

arXiv:2402.07624 [pdf, other]

doi 10.1007/s41237-024-00248-z

Supervised and Unsupervised Mapping of Binary Variables: A proximity perspective

Authors: Mark de Rooij, Dion Woestenburg, Frank Busing

Abstract: We propose a new mapping tool for supervised and unsupervised analysis of multivariate binary data with multiple items, questions, or response variables. The mapping assumes an underlying proximity response function, where participants can have multiple reasons to disagree or say ``no'' to a question. The probability to endorse, or to agree with an item depends on an item specific parameter and th… ▽ More We propose a new mapping tool for supervised and unsupervised analysis of multivariate binary data with multiple items, questions, or response variables. The mapping assumes an underlying proximity response function, where participants can have multiple reasons to disagree or say ``no'' to a question. The probability to endorse, or to agree with an item depends on an item specific parameter and the distance in a joint space between a point representing the item and a point representing the participant. The item specific parameter defines a circle in the joint space around the location of the item such that for participants positioned within the circle the endorsement probability is larger than 0.5. For map estimation, we develop and test an MM-algorithm in which the negative log-likelihood function is majorized with a weighted least squares function. The weighted least squares function can be minimized with standard algorithms for multidimensional unfolding. To illustrate the new mapping, two empirical data sets are analyzed. The mappings are interpreted in detail and the unsupervised map is compared to a visualization based on correspondence analysis. In a Monte Carlo study, we test the performance of the algorithm in terms of recovery of population parameters and conclude that this recovery is adequate. A second Monte Carlo study investigates the predictive performance of the new mapping compared to a similar mapping with a monotone response function. △ Less

Submitted 22 January, 2025; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: 38 pages, 11 figures

arXiv:2308.08387 [pdf, other]

Continuous Sweep for Binary Quantification Learning

Authors: Kevin Kloos, Julian D. Karch, Quinten A. Meertens, Mark de Rooij

Abstract: A quantifier is a supervised machine learning algorithm, focused on estimating the class prevalence in a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep, which is an ensemble method based on Adjusted Count estimators. We modified two aspects of Median Sweep: 1) using parametric… ▽ More A quantifier is a supervised machine learning algorithm, focused on estimating the class prevalence in a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep, which is an ensemble method based on Adjusted Count estimators. We modified two aspects of Median Sweep: 1) using parametric class distributions instead of empirical distributions for the true and false positive rate; 2) using the mean instead of the median of a set of Adjusted Count estimates. These two modifications allow for a theoretical analysis of the bias and variance of Continuous Sweep. Furthermore, the expressions of bias and variance can be used to define optimal decision boundaries of the set of Adjusted count estimates to be used in the ensemble. We show in three simulation studies that Continuous Sweep outperforms the quantifiers in the group Classify, Count, and Correct, including Median Sweep, and is competitive with the two best quantifiers from the group Distribution Matchers. Also an empirical data set is analysed with these quantifiers showing similar performances. △ Less

Submitted 11 October, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

MSC Class: 68U99

arXiv:2210.14484 [pdf, other]

doi 10.1016/j.inffus.2024.102524

Imputation of missing values in multi-view data

Authors: Wouter van Loon, Marjolein Fokkema, Frank de Vos, Marisa Koini, Reinhold Schmidt, Mark de Rooij

Abstract: Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation met… ▽ More Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multi-view structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible. △ Less

Submitted 20 June, 2024; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: 49 pages, 15 figures. Accepted manuscript

Journal ref: Information Fusion 111 (2024) 102524

arXiv:2108.05761 [pdf, other]

doi 10.3389/fnins.2022.830630

Analyzing hierarchical multi-view MRI data with StaPLR: An application to Alzheimer's disease classification

Authors: Wouter van Loon, Frank de Vos, Marjolein Fokkema, Botond Szabo, Marisa Koini, Reinhold Schmidt, Mark de Rooij

Abstract: Multi-view data refers to a setting where features are divided into feature sets, for example because they correspond to different sources. Stacked penalized logistic regression (StaPLR) is a recently introduced method that can be used for classification and automatically selecting the views that are most important for prediction. We introduce an extension of this method to a setting where the dat… ▽ More Multi-view data refers to a setting where features are divided into feature sets, for example because they correspond to different sources. Stacked penalized logistic regression (StaPLR) is a recently introduced method that can be used for classification and automatically selecting the views that are most important for prediction. We introduce an extension of this method to a setting where the data has a hierarchical multi-view structure. We also introduce a new view importance measure for StaPLR, which allows us to compare the importance of views at any level of the hierarchy. We apply our extended StaPLR algorithm to Alzheimer's disease classification where different MRI measures have been calculated from three scan types: structural MRI, diffusion-weighted MRI, and resting-state fMRI. StaPLR can identify which scan types and which derived MRI measures are most important for classification, and it outperforms elastic net regression in classification performance. △ Less

Submitted 26 April, 2022; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: 36 pages, 9 figures. Accepted manuscript

Journal ref: Frontiers in Neuroscience 16:830630 (2022) 1-15

arXiv:2107.13920 [pdf, other]

The Bradly-Terry Regression Trunk approach for modelling preference data with small trees

Authors: Alessio Baldassarre, Elise Dusseldorp, Antonio D'Ambrosio, Mark de Rooij, Claudio Conversano

Abstract: This paper introduces the Bradley-Terry Regression Trunk model, a novel probabilistic approach for the analysis of preference data expressed through paired comparison rankings. In some cases, it may be reasonable to assume that the preferences expressed by individuals depend on their characteristics. Within the framework of tree-based partitioning, we specify a tree-based model estimating the join… ▽ More This paper introduces the Bradley-Terry Regression Trunk model, a novel probabilistic approach for the analysis of preference data expressed through paired comparison rankings. In some cases, it may be reasonable to assume that the preferences expressed by individuals depend on their characteristics. Within the framework of tree-based partitioning, we specify a tree-based model estimating the joint effects of subject-specific covariates over and above their main effects. We combine a tree-based model and the log-linear Bradley-Terry model using the outcome of the comparisons as response variable. The proposed model provides a solution to discover interaction effects when no a-priori hypotheses are available. It produces a small tree, called trunk, that represents a fair compromise between a simple interpretation of the interaction effects and an easy to read partition of judges based on their characteristics and the preferences they have expressed. We present an application on a real data set following two different approaches, and a simulation study to test the model's performance. Simulations showed that the quality of the model performance increases when the number of rankings and objects increases. In addition, the performance is considerably amplified when the judges' characteristics have a high impact on their choices. △ Less

Submitted 29 July, 2021; originally announced July 2021.

arXiv:2102.08232 [pdf, other]

doi 10.1007/978-981-99-2240-6_4

The MELODIC family for simultaneous binary logistic regression in a reduced space

Authors: Mark de Rooij, Patrick J. F. Groenen

Abstract: Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strengt… ▽ More Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strength, which might increase predictive accuracy. In this paper, we propose the MELODIC family for simultaneous binary logistic regression modeling. In this family, the regression models are defined in a Euclidean space of reduced dimension, based on a distance rule. The model may be interpreted in terms of logistic regression coefficients or in terms of a biplot. We discuss a fast iterative majorization (or MM) algorithm for parameter estimation. Two applications are shown in detail: one relating personality characteristics to drug consumption profiles and one relating personality characteristics to depressive and anxiety disorders. We present a thorough comparison of our MELODIC family with alternative approaches for multivariate binary data. △ Less

Submitted 24 June, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: Comment [v2]: added a paragraph on page 7 about the equivalence to a logistic reduced rank model Comment [v2]: the description of the relationship towards logistic reduced rank models is updated on page 37

arXiv:2010.16271 [pdf, other]

doi 10.1007/s11634-024-00587-5

View selection in multi-view stacking: Choosing the meta-learner

Authors: Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij

Abstract: Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has… ▽ More Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three. △ Less

Submitted 15 April, 2024; v1 submitted 30 October, 2020; originally announced October 2020.

Comments: 47 pages, 17 figures. Accepted manuscript

MSC Class: 62; 68

Journal ref: Advances in Data Analysis and Classification (2024)

arXiv:1911.11463 [pdf, other]

The Early Roots of Statistical Learning in the Psychometric Literature: A review and two new results

Authors: Mark de Rooij, Bunga Citra Pratiwi, Marjolein Fokkema, Elise Dusseldorp, Henk Kelderman

Abstract: Machine and Statistical learning techniques become more and more important for the analysis of psychological data. Four core concepts of machine learning are the bias variance trade-off, cross-validation, regularization, and basis expansion. We present some early psychometric papers, from almost a century ago, that dealt with cross-validation and regularization. From this review it is safe to conc… ▽ More Machine and Statistical learning techniques become more and more important for the analysis of psychological data. Four core concepts of machine learning are the bias variance trade-off, cross-validation, regularization, and basis expansion. We present some early psychometric papers, from almost a century ago, that dealt with cross-validation and regularization. From this review it is safe to conclude that the origins of these lie partly in the field of psychometrics. From our historical review, two new ideas arose which we investigated further: The first is about the relationship between reliability and predictive validity; the second is whether optimal regression weights should be estimated by regularizing their values towards equality or shrinking their values towards zero. In a simulation study we show that the reliability of a test score does not influence the predictive validity as much as is usually written in psychometric textbooks. Using an empirical example we show that regularization towards equal regression coefficients is beneficial in terms of prediction error. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Comments: 22 pages, 3 figures

arXiv:1811.02316 [pdf, other]

doi 10.1016/j.inffus.2020.03.007

Stacked Penalized Logistic Regression for Selecting Views in Multi-View Learning

Authors: Wouter van Loon, Marjolein Fokkema, Botond Szabo, Mark de Rooij

Abstract: In biomedical research, many different types of patient data can be collected, such as various types of omics data and medical imaging modalities. Applying multi-view learning to these different sources of information can increase the accuracy of medical classification models compared with single-view procedures. However, collecting biomedical data can be expensive and/or burdening for patients, s… ▽ More In biomedical research, many different types of patient data can be collected, such as various types of omics data and medical imaging modalities. Applying multi-view learning to these different sources of information can increase the accuracy of medical classification models compared with single-view procedures. However, collecting biomedical data can be expensive and/or burdening for patients, so that it is important to reduce the amount of required data collection. It is therefore necessary to develop multi-view learning methods which can accurately identify those views that are most important for prediction. In recent years, several biomedical studies have used an approach known as multi-view stacking (MVS), where a model is trained on each view separately and the resulting predictions are combined through stacking. In these studies, MVS has been shown to increase classification accuracy. However, the MVS framework can also be used for selecting a subset of important views. To study the view selection potential of MVS, we develop a special case called stacked penalized logistic regression (StaPLR). Compared with existing view-selection methods, StaPLR can make use of faster optimization algorithms and is easily parallelized. We show that nonnegativity constraints on the parameters of the function which combines the views play an important role in preventing unimportant views from entering the model. We investigate the performance of StaPLR through simulations, and consider two real data examples. We compare the performance of StaPLR with an existing view selection method called the group lasso and observe that, in terms of view selection, StaPLR is often more conservative and has a consistently lower false positive rate. △ Less

Submitted 12 May, 2020; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: 26 pages, 9 figures. Accepted manuscript

Journal ref: Information Fusion 61 (2020) 113-123

Showing 1–12 of 12 results for author: de Rooij, M