-
Selective inference after convex clustering with $\ell_1$ penalization
Authors:
François Bachoc,
Cathy Maugis-Rabusseau,
Pierre Neuvial
Abstract:
Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clusterin…
▽ More
Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with $\ell_1$ penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with $\ell_1$ penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data
Authors:
Antoine Godichon-Baggioni,
Cathy Maugis-Rabusseau,
Andrea Rau
Abstract:
Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e., data made up of profiles, whose rows belong to the simplex) remains largely unexplored in cases where the observed value of an observation is equal or close to zero for one or more samples. This work is motivated by the analysis of t…
▽ More
Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e., data made up of profiles, whose rows belong to the simplex) remains largely unexplored in cases where the observed value of an observation is equal or close to zero for one or more samples. This work is motivated by the analysis of two sets of compositional data, both focused on the categorization of profiles but arising from considerably different applications: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we focus on the use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension we propose called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a nonasymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters present in the data. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.
△ Less
Submitted 20 April, 2017;
originally announced April 2017.
-
Parameter recovery in two-component contamination mixtures: the $\mathbb{L}^2$ strategy
Authors:
Sébastien Gadat,
Jonas Kahn,
Clément Marteau,
Cathy Maugis-Rabusseau
Abstract:
In this paper, we consider a parametric density contamination model. We work with a sample of i.i.d. data with a common density, $f^\star =(1-λ^\star) φ+ λ^\star φ(.-μ^\star)$, where the shape $φ$ is assumed to be known. We establish the optimal rates of convergence for the estimation of the mixture parameters $(λ^\star,μ^\star)$. In particular, we prove that the classical parametric rate…
▽ More
In this paper, we consider a parametric density contamination model. We work with a sample of i.i.d. data with a common density, $f^\star =(1-λ^\star) φ+ λ^\star φ(.-μ^\star)$, where the shape $φ$ is assumed to be known. We establish the optimal rates of convergence for the estimation of the mixture parameters $(λ^\star,μ^\star)$. In particular, we prove that the classical parametric rate $1/\sqrt{n}$ cannot be reached when at least one of these parameters is allowed to tend to $0$ with $n$.
△ Less
Submitted 21 November, 2018; v1 submitted 1 April, 2016;
originally announced April 2016.
-
Multidimensional two-component Gaussian mixtures detection
Authors:
Béatrice Laurent,
Clément Marteau,
Cathy Maugis-Rabusseau
Abstract:
Let $(X\_1,\ldots,X\_n)$ be a $d$-dimensional i.i.d sample from a distribution with density $f$. The problem of detection of a two-component mixture is considered. Our aim is to decide whether $f$ is the density of a standard Gaussian random $d$-vector ($f=φ\_d$) against $f$ is a two-component mixture: $f=(1-\varepsilon)φ\_d +\varepsilon φ\_d (.-μ)$ where $(\varepsilon,μ)$ are unknown parameters.…
▽ More
Let $(X\_1,\ldots,X\_n)$ be a $d$-dimensional i.i.d sample from a distribution with density $f$. The problem of detection of a two-component mixture is considered. Our aim is to decide whether $f$ is the density of a standard Gaussian random $d$-vector ($f=φ\_d$) against $f$ is a two-component mixture: $f=(1-\varepsilon)φ\_d +\varepsilon φ\_d (.-μ)$ where $(\varepsilon,μ)$ are unknown parameters. Optimal separation conditions on $\varepsilon, μ, n$ and the dimension $d$ are established, allowing to separate both hypotheses with prescribed errors. Several testing procedures are proposed and two alternative subsets are considered.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Non-asymptotic detection of two-component mixtures with unknown means
Authors:
Béatrice Laurent,
Clément Marteau,
Cathy Maugis-Rabusseau
Abstract:
This work is concerned with the detection of a mixture distribution from a $\mathbb{R}$-valued sample. Given a sample $X_1,\dots,X_n$ and an even density $φ$, our aim is to detect whether the sample distribution is $φ(\cdot-μ)$ for some unknown mean $μ$, or is defined as a two-component mixture based on translations of $φ$. We propose a procedure which is based on several spacings of the order sta…
▽ More
This work is concerned with the detection of a mixture distribution from a $\mathbb{R}$-valued sample. Given a sample $X_1,\dots,X_n$ and an even density $φ$, our aim is to detect whether the sample distribution is $φ(\cdot-μ)$ for some unknown mean $μ$, or is defined as a two-component mixture based on translations of $φ$. We propose a procedure which is based on several spacings of the order statistics, which provides a level-$α$ test for all $n$. Our test is therefore a multiple testing procedure and we prove from a theoretical and practical point of view that it automatically adapts to the proportion of the mixture and to the difference of the means of the two components of the mixture under the alternative. From a theoretical point of view, we prove the optimality of the power of our procedure in various situations. A simulation study shows the good performances of our test compared with several classical procedures.
△ Less
Submitted 21 January, 2016; v1 submitted 25 April, 2013;
originally announced April 2013.