-
Onset of a conceptual outline map to get a hold on the jungle of cluster analysis
Authors:
Iven Van Mechelen,
Christian Hennig,
Henk A. L. Kiers
Abstract:
The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the doma…
▽ More
The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the domain suffers from a major accessibility problem as well as from the fact that it is rife with division across many pretty isolated islands. As a way out, the present paper offers a thorough and in-depth review of the clustering domain as a whole under the form of an outline map based on an overarching conceptual framework and a common language. With this framework we wish to contribute to structuring the clustering domain, to characterizing methods that have often been developed and studied in quite different contexts, to identifying links between methods, and to introducing a frame of reference for optimally setting up cluster analyses in data-analytic practice.
△ Less
Submitted 11 July, 2024; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Penalized Optimal Scaling for Ordinal Variables with an Application to International Classification of Functioning Core Sets
Authors:
Aisouda Hoshiyar,
Henk A. L. Kiers,
Jan Gertheiss
Abstract:
Ordinal data occur frequently in the social sciences. When applying principal component analysis (PCA), however, those data are often treated as numeric implying linear relationships between the variables at hand, or non-linear PCA is applied where the obtained quantifications are sometimes hard to interpret. Non-linear PCA for categorical data, also called optimal scoring/scaling, constructs new…
▽ More
Ordinal data occur frequently in the social sciences. When applying principal component analysis (PCA), however, those data are often treated as numeric implying linear relationships between the variables at hand, or non-linear PCA is applied where the obtained quantifications are sometimes hard to interpret. Non-linear PCA for categorical data, also called optimal scoring/scaling, constructs new variables by assigning numerical values to categories such that the proportion of variance in those new variables that is explained by a predefined number of principal components is maximized. We propose a penalized version of non-linear PCA for ordinal variables that is a smoothed intermediate between standard PCA on category labels and non-linear PCA as used so far. The new approach is by no means limited to monotonic effects and offers both better interpretability of the non-linear transformation of the category labels as well as better performance on validation data than unpenalized non-linear PCA and/or standard linear PCA. In particular, an application of penalized optimal scaling to ordinal data as given with the International Classification of Functioning, Disability and Health (ICF) is provided.
△ Less
Submitted 17 January, 2023; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Heterofusion: Fusing genomics data of different measurement scales
Authors:
Age K. Smilde,
Yipeng Song,
Johan A. Westerhuis,
Henk A. L. Kiers,
Nanne Aben,
Lodewyk F. A. Wessels
Abstract:
In systems biology, it is becoming increasingly common to measure biochemical entities at different levels of the same biological system. Hence, data fusion problems are abundant in the life sciences. With the availability of a multitude of measuring techniques, one of the central problems is the heterogeneity of the data. In this paper, we discuss a specific form of heterogeneity, namely that of…
▽ More
In systems biology, it is becoming increasingly common to measure biochemical entities at different levels of the same biological system. Hence, data fusion problems are abundant in the life sciences. With the availability of a multitude of measuring techniques, one of the central problems is the heterogeneity of the data. In this paper, we discuss a specific form of heterogeneity, namely that of measurements obtained at different measurement scales, such as binary, ordinal, interval and ratio-scaled variables. Three generic fusion approaches are presented of which two are new to the systems biology community. The methods are presented, put in context and illustrated with a real-life genomics example.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Common and Distinct Components in Data Fusion
Authors:
Age K. Smilde,
Ingrid Mage,
Tormod Naes,
Thomas Hankemeier,
Mirjam A. Lips,
Henk A. L. Kiers,
Evrim Acar,
Rasmus Bro
Abstract:
In many areas of science multiple sets of data are collected pertaining to the same system. Examples are food products which are characterized by different sets of variables, bio-processes which are on-line sampled with different instruments, or biological systems of which different genomics measurements are obtained. Data fusion is concerned with analyzing such sets of data simultaneously to arri…
▽ More
In many areas of science multiple sets of data are collected pertaining to the same system. Examples are food products which are characterized by different sets of variables, bio-processes which are on-line sampled with different instruments, or biological systems of which different genomics measurements are obtained. Data fusion is concerned with analyzing such sets of data simultaneously to arrive at a global view of the system under study. One of the upcoming areas of data fusion is exploring whether the data sets have something in common or not. This gives insight into common and distinct variation in each data set, thereby facilitating understanding the relationships between the data sets. Unfortunately, research on methods to distinguish common and distinct components is fragmented, both in terminology as well as in methods: there is no common ground which hampers comparing methods and understanding their relative merits. This paper provides a unifying framework for this subfield of data fusion by using rigorous arguments from linear algebra. The most frequently used methods for distinguishing common and distinct components are explained in this framework and some practical examples are given of these methods in the areas of (medical) biology and food science.
△ Less
Submitted 8 July, 2016;
originally announced July 2016.