-
An integrated method for clustering and association network inference
Authors:
Jeanne Tous,
Julien Chiquet
Abstract:
We consider high dimensional Gaussian graphical models inference. These models provide a rigorous framework to describe a network of statistical dependencies between entities, such as genes in genomic regulation studies or species in ecology. Penalized methods, including the standard Graphical-Lasso, are well-known approaches to infer the parameters of these models. As the number of variables in t…
▽ More
We consider high dimensional Gaussian graphical models inference. These models provide a rigorous framework to describe a network of statistical dependencies between entities, such as genes in genomic regulation studies or species in ecology. Penalized methods, including the standard Graphical-Lasso, are well-known approaches to infer the parameters of these models. As the number of variables in the model (of entities in the network) grow, the network inference and interpretation become more complex. We propose Normal-Block, a new model that clusters variables and consider a network at the cluster level. Normal-Block both adds structure to the network and reduces its size. We build on Graphical-Lasso to add a penalty on the network's edges and limit the detection of spurious dependencies, we also propose a zero-inflated version of the model to account for real-world data properties. For the inference procedure, we propose a direct heuristic method and another more rigorous one that simultaneously infers the clustering of variables and the association network between clusters, using a penalized variational Expectation-Maximization approach. An implementation of the model in R, in a package called normalblockr, is available on github (https://github.com/jeannetous/normalblockr). We present the results in terms of clustering and network inference using both simulated data and various types of real-world data (proteomics, words occurrences on webpages, and microbiota distribution).
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Evaluating Parameter Uncertainty in the Poisson Lognormal Model with Corrected Variational Estimators
Authors:
Bastien Batardière,
Julien Chiquet,
Mahendra Mariadassou
Abstract:
Count data analysis is essential across diverse fields, from ecology and accident analysis to single-cell RNA sequencing (scRNA-seq) and metagenomics. While log transformations are computationally efficient, model-based approaches such as the Poisson-Log-Normal (PLN) model provide robust statistical foundations and are more amenable to extensions. The PLN model, with its latent Gaussian structure,…
▽ More
Count data analysis is essential across diverse fields, from ecology and accident analysis to single-cell RNA sequencing (scRNA-seq) and metagenomics. While log transformations are computationally efficient, model-based approaches such as the Poisson-Log-Normal (PLN) model provide robust statistical foundations and are more amenable to extensions. The PLN model, with its latent Gaussian structure, not only captures overdispersion but also enables correlation between variables and inclusion of covariates, making it suitable for multivariate count data analysis. Variational approximations are a golden standard to estimate parameters of complex latent variable models such as PLN, maximizing a surrogate likelihood. However, variational estimators lack theoretical statistical properties such as consistency and asymptotic normality. In this paper, we investigate the consistency and variance estimation of PLN parameters using M-estimation theory. We derive the Sandwich estimator, previously studied in Westling and McCormick (2019), specifically for the PLN model. We compare this approach to the variational Fisher Information method, demonstrating the Sandwich estimator's effectiveness in terms of coverage through simulation studies. Finally, we validate our method on a scRNA-seq dataset.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
Importance sampling-based gradient method for dimension reduction in Poisson log-normal model
Authors:
Bastien Batardière,
Julien Chiquet,
Joon Kwon,
Julien Stoehr
Abstract:
High-dimensional count data poses significant challenges for statistical analysis, necessitating effective methods that also preserve explainability. We focus on a low rank constrained variant of the Poisson log-normal model, which relates the observed data to a latent low-dimensional multivariate Gaussian variable via a Poisson distribution. Variational inference methods have become a golden stan…
▽ More
High-dimensional count data poses significant challenges for statistical analysis, necessitating effective methods that also preserve explainability. We focus on a low rank constrained variant of the Poisson log-normal model, which relates the observed data to a latent low-dimensional multivariate Gaussian variable via a Poisson distribution. Variational inference methods have become a golden standard solution to infer such a model. While computationally efficient, they usually lack theoretical statistical properties with respect to the model. To address this issue we propose a projected stochastic gradient scheme that directly maximizes the log-likelihood. We prove the convergence of the proposed method when using importance sampling for estimating the gradient. Specifically, we obtain a rate of convergence of $O(T^{-1/2} + N^{-1})$ with $T$ the number of iterations and $N$ the number of Monte Carlo draws. The latter follows from a novel descent lemma for non convex $L$-smooth objective functions, and random biased gradient estimate. We also demonstrate numerically the efficiency of our solution compared to its variational competitor. Our method not only scales with respect to the number of observed samples but also provides access to the desirable properties of the maximum likelihood estimator.
△ Less
Submitted 23 April, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Zero-inflation in the Multivariate Poisson Lognormal Family
Authors:
Bastien Batardière,
Julien Chiquet,
François Gindraud,
Mahendra Mariadassou
Abstract:
Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependen…
▽ More
Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Automated calibration of consensus weighted distance-based clustering approaches using sharp
Authors:
Barbara Bodinier,
Dragana Vuckovic,
Sabrina Rodrigues,
Sarah Filippi,
Julien Chiquet,
Marc Chadeau-Hyam
Abstract:
In consensus clustering, a clustering algorithm is used in combination with a subsampling procedure to detect stable clusters. Previous studies on both simulated and real data suggest that consensus clustering outperforms native algorithms. We extend here consensus clustering to allow for attribute weighting in the calculation of pairwise distances using existing regularised approaches. We propose…
▽ More
In consensus clustering, a clustering algorithm is used in combination with a subsampling procedure to detect stable clusters. Previous studies on both simulated and real data suggest that consensus clustering outperforms native algorithms. We extend here consensus clustering to allow for attribute weighting in the calculation of pairwise distances using existing regularised approaches. We propose a procedure for the calibration of the number of clusters (and regularisation parameter) by maximising a novel consensus score calculated directly from consensus clustering outputs, making it extremely computationally competitive. Our simulation study shows better clustering performances of (i) models calibrated by maximising our consensus score compared to existing calibration scores, and (ii) weighted compared to unweighted approaches in the presence of features that do not contribute to cluster definition. Application on real gene expression data measured in lung tissue reveals clear clusters corresponding to different lung cancer subtypes. The R package sharp (version 1.4.0) is available on CRAN.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
A Probabilistic Graph Coupling View of Dimension Reduction
Authors:
Hugues Van Assel,
Thibault Espinasse,
Julien Chiquet,
Franck Picard
Abstract:
Most popular dimension reduction (DR) methods like t-SNE and UMAP are based on minimizing a cost between input and latent pairwise similarities. Though widely used, these approaches lack clear probabilistic foundations to enable a full understanding of their properties and limitations. To that extent, we introduce a unifying statistical framework based on the coupling of hidden graphs using cross…
▽ More
Most popular dimension reduction (DR) methods like t-SNE and UMAP are based on minimizing a cost between input and latent pairwise similarities. Though widely used, these approaches lack clear probabilistic foundations to enable a full understanding of their properties and limitations. To that extent, we introduce a unifying statistical framework based on the coupling of hidden graphs using cross entropy. These graphs induce a Markov random field dependency structure among the observations in both input and latent spaces. We show that existing pairwise similarity DR methods can be retrieved from our framework with particular choices of priors for the graphs. Moreover this reveals that these methods suffer from a statistical deficiency that explains poor performances in conserving coarse-grain dependencies. Our model is leveraged and extended to address this issue while new links are drawn with Laplacian eigenmaps and PCA.
△ Less
Submitted 5 October, 2023; v1 submitted 31 January, 2022;
originally announced January 2022.
-
Automated calibration for stability selection in penalised regression and graphical models
Authors:
Barbara Bodinier,
Sarah Filippi,
Therese Haugdahl Nost,
Julien Chiquet,
Marc Chadeau-Hyam
Abstract:
Stability selection represents an attractive approach to identify sparse sets of features jointly associated with an outcome in high-dimensional contexts. We introduce an automated calibration procedure via maximisation of an in-house stability score and accommodating a priori-known block structure (e.g. multi-OMIC) data. It applies to (LASSO) penalised regression and graphical models. Simulations…
▽ More
Stability selection represents an attractive approach to identify sparse sets of features jointly associated with an outcome in high-dimensional contexts. We introduce an automated calibration procedure via maximisation of an in-house stability score and accommodating a priori-known block structure (e.g. multi-OMIC) data. It applies to (LASSO) penalised regression and graphical models. Simulations show our approach outperforms non-stability-based and stability selection approaches using the original calibration. Application of multi-block graphical LASSO on real (epigenetic and transcriptomic) data from the Norwegian Women and Cancer study reveals a central/credible and novel cross-OMIC role of LRRN3 in the biological response to smoking. Proposed approaches were implemented in the R package sharp.
△ Less
Submitted 22 February, 2023; v1 submitted 4 June, 2021;
originally announced June 2021.
-
Adjusting the adjusted Rand Index -- A multinomial story
Authors:
Martina Sundqvist,
Julien Chiquet,
Guillem Rigaill
Abstract:
The Adjusted Rand Index ($ARI$) is arguably one of the most popular measures for cluster comparison. The adjustment of the $ARI$ is based on a hypergeometric distribution assumption which is unsatisfying from a modeling perspective as (i) it is not appropriate when the two clusterings are dependent, (ii) it forces the size of the clusters, and (iii) it ignores randomness of the sampling. In this w…
▽ More
The Adjusted Rand Index ($ARI$) is arguably one of the most popular measures for cluster comparison. The adjustment of the $ARI$ is based on a hypergeometric distribution assumption which is unsatisfying from a modeling perspective as (i) it is not appropriate when the two clusterings are dependent, (ii) it forces the size of the clusters, and (iii) it ignores randomness of the sampling. In this work, we present a new "modified" version of the Rand Index. First, we redefine the $MRI$ by only counting the pairs consistent by similarity and ignoring the pairs consistent by difference, increasing the interpretability of the score. Second, we base the adjusted version, $MARI$, on a multinomial distribution instead of a hypergeometric distribution. The multinomial model is advantageous as it does not force the size of the clusters, properly models randomness, and is easily extended to the dependant case. We show that the $ARI$ is biased under the multinomial model and that the difference between the $ARI$ and $MARI$ can be large for small $n$ but essentially vanish for large $n$, where $n$ is the number of individuals. Finally, we provide an efficient algorithm to compute all these quantities ($(A)RI$ and $M(A)RI$) by relying on a sparse representation of the contingency table in our \texttt{aricode} package. The space and time complexity is linear in the number of samples and importantly does not depend on the number of clusters as we do not explicitly compute the contingency table.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Identification of deregulated transcription factors involved in subtypes of cancers
Authors:
Magali Champion,
Julien Chiquet,
Pierre Neuvial,
Mohamed Elati,
François Radvanyi,
Etienne Birmelé
Abstract:
We propose a methodology for the identification of transcription factors involved in the deregulation of genes in tumoral cells. This strategy is based on the inference of a reference gene regulatory network that connects transcription factors to their downstream targets using gene expression data. The behavior of genes in tumor samples is then carefully compared to this network of reference to de…
▽ More
We propose a methodology for the identification of transcription factors involved in the deregulation of genes in tumoral cells. This strategy is based on the inference of a reference gene regulatory network that connects transcription factors to their downstream targets using gene expression data. The behavior of genes in tumor samples is then carefully compared to this network of reference to detect deregulated target genes. A linear model is finally used to measure the ability of each transcription factor to explain those deregulations. We assess the performance of our method by numerical experiments on a breast cancer data set. We show that the information about deregulation is complementary to the expression data as the combination of the two improves the supervised classification performance of samples into cancer subtypes.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.
-
missSBM: An R Package for Handling Missing Values in the Stochastic Block Model
Authors:
Pierre Barbillon,
Julien Chiquet,
Timothée Tabouy
Abstract:
The Stochastic Block Model (SBM) is a popular probabilistic model for random graphs. It is commonly used for clustering network data by aggregating nodes that share similar connectivity patterns into blocks. When fitting an SBM to a network which is partially observed, it is important to take into account the underlying process that generates the missing values, otherwise the inference may be bias…
▽ More
The Stochastic Block Model (SBM) is a popular probabilistic model for random graphs. It is commonly used for clustering network data by aggregating nodes that share similar connectivity patterns into blocks. When fitting an SBM to a network which is partially observed, it is important to take into account the underlying process that generates the missing values, otherwise the inference may be biased. This paper introduces missSBM, an R-package fitting the SBM when the network is partially observed, i.e., the adjacency matrix contains not only 1's or 0's encoding presence or absence of edges but also NA's encoding missing information between pairs of nodes. This package implements a set of algorithms for fitting the binary SBM, possibly in the presence of external covariates, by performing variational inference adapted to several observation processes. Our implementation automatically explores different block numbers to select the most relevant model according to the Integrated Classification Likelihood (ICL) criterion. The ICL criterion can also help determine which observation process better corresponds to a given dataset. Finally, missSBM can be used to perform imputation of missing entries in the adjacency matrix. We illustrate the package on a network data set consisting of interactions between political blogs sampled during the French presidential election in 2007.
△ Less
Submitted 27 May, 2021; v1 submitted 28 June, 2019;
originally announced June 2019.
-
Fast Computation of Genome-Metagenome Interaction Effects
Authors:
Florent Guinot,
Marie Szafranski,
Julien Chiquet,
Anouk Zancarini,
Christine Le Signor,
Christophe Mougel,
Christophe Ambroise
Abstract:
Motivation. Association studies have been widely used to search for associations between common genetic variants observations and a given phenotype. However, it is now generally accepted that genes and environment must be examined jointly when estimating phenotypic variance. In this work we consider two types of biological markers: genotypic markers, which characterize an observation in terms of i…
▽ More
Motivation. Association studies have been widely used to search for associations between common genetic variants observations and a given phenotype. However, it is now generally accepted that genes and environment must be examined jointly when estimating phenotypic variance. In this work we consider two types of biological markers: genotypic markers, which characterize an observation in terms of inherited genetic information, and metagenomic marker which are related to the environment. Both types of markers are available in their millions and can be used to characterize any observation uniquely. Objective. Our focus is on detecting interactions between groups of genetic and metagenomic markers in order to gain a better understanding of the complex relationship between environment and genome in the expression of a given phenotype. Contributions. We propose a novel approach for efficiently detecting interactions between complementary datasets in a high-dimensional setting with a reduced computational cost. The method, named SICOMORE, reduces the dimension of the search space by selecting a subset of supervariables in the two complementary datasets. These supervariables are given by a weighted group structure defined on sets of variables at different scales. A Lasso selection is then applied on each type of supervariable to obtain a subset of potential interactions that will be explored via linear model testing. Results. We compare SICOMORE with other approaches in simulations, with varying sample sizes, noise, and numbers of true interactions. SICOMORE exhibits convincing results in terms of recall, as well as competitive performances with respect to running time. The method is also used to detect interaction between genomic markers in Medicago truncatula and metagenomic markers in its rhizosphere bacterial community. Software availability. A R package is available, along with its documentation and associated scripts, allowing the reader to reproduce the results presented in the paper.
△ Less
Submitted 18 June, 2020; v1 submitted 29 October, 2018;
originally announced October 2018.
-
Variational inference for sparse network reconstruction from count data
Authors:
Julien Chiquet,
Mahendra Mariadassou,
Stéphane Robin
Abstract:
In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.
To circumvent this l…
▽ More
In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.
To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts.
To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix.
We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Variable selection in multivariate linear models with high-dimensional covariance matrix estimation
Authors:
Marie Perrot-Dockès,
Céline Lévy-Leduc,
Laure Sansonnet,
Julien Chiquet
Abstract:
In this paper, we propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approa…
▽ More
In this paper, we propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approach are investigated both from a theoretical and a numerical point of view. More precisely, we give general conditions that the estimators of the covariance matrix and its inverse have to satisfy in order to recover the positions of the null and non null entries of the coefficient matrix when the size of the covariance matrix is not fixed and can tend to infinity. We prove that these conditions are satisfied in the particular case of some Toeplitz matrices. Our approach is implemented in the R package MultiVarSel available from the Comprehensive R Archive Network (CRAN) and is very attractive since it benefits from a low computational load. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that including the estimation of the covariance matrix in the Lasso criterion dramatically improves the variable selection performance in many cases.
△ Less
Submitted 13 July, 2017;
originally announced July 2017.
-
Variational Inference for Stochastic Block Models from Sampled Data
Authors:
Timothée Tabouy,
Pierre Barbillon,
Julien Chiquet
Abstract:
This paper deals with non-observed dyads during the sampling of a network and consecutive issues in the inference of the Stochastic Block Model (SBM). We review sampling designs and recover Missing At Random (MAR) and Not Missing At Random (NMAR) conditions for the SBM. We introduce variants of the variational EM algorithm for inferring the SBM under various sampling designs (MAR and NMAR) all ava…
▽ More
This paper deals with non-observed dyads during the sampling of a network and consecutive issues in the inference of the Stochastic Block Model (SBM). We review sampling designs and recover Missing At Random (MAR) and Not Missing At Random (NMAR) conditions for the SBM. We introduce variants of the variational EM algorithm for inferring the SBM under various sampling designs (MAR and NMAR) all available as an R package. Model selection criteria based on Integrated Classification Likelihood are derived for selecting both the number of blocks and the sampling design. We investigate the accuracy and the range of applicability of these algorithms with simulations. We explore two real-world networks from ethnology (seed circulation network) and biology (protein-protein interaction network), where the interpretations considerably depends on the sampling designs considered.
△ Less
Submitted 9 January, 2019; v1 submitted 13 July, 2017;
originally announced July 2017.
-
A multivariate variable selection approach for analyzing LC-MS metabolomics data
Authors:
M. Perrot-Dockès,
C. Lévy-Leduc,
J. Chiquet,
L. Sansonnet,
M. Brégère,
M. -P. Étienne,
S. Robin,
G. Genta-Jouve
Abstract:
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) -- a technique which gives access to a large coverage of metabolites -- exhibit such patterns. These data sets are typically used to find…
▽ More
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) -- a technique which gives access to a large coverage of metabolites -- exhibit such patterns. These data sets are typically used to find the metabolites characterizing a phenotype of interest associated with the samples. However, applying some statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure in the multivariate linear model that accounts for the dependence structure of the multiple outputs which may lead in the LC-MS framework to the selection of more relevant metabolites. We propose a novel Lasso-based approach in the multivariate framework of the general linear model taking into account the dependence structure by using various modelings of the covariance matrix of the residuals. Our numerical experiments show that including the estimation of the covariance matrix of the residuals in the Lasso criterion dramatically improves the variable selection performance. Our approach is also successfully applied to a LC-MS data set made of African copals samples for which it is able to provide a small list of metabolites without altering the phenotype discrimination. Our methodology is implemented in the R package MultiVarSel which is available from the CRAN (Comprehensive R Archive Network).
△ Less
Submitted 31 March, 2017;
originally announced April 2017.
-
Variational inference for probabilistic Poisson PCA
Authors:
Julien Chiquet,
Mahendra Mariadassou,
Stéphane Robin
Abstract:
Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We consi…
▽ More
Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We consider here the multivariate exponential family framework for which we introduce a generic model with multivariate Gaussian latent variables. We show that approximate maximum likelihood inference can be achieved via a variational algorithm for which gradient descent easily applies. We show that this setting enables us to account for covariates and offsets. We then focus on the case of the Poisson-lognormal model in the context of community ecology. We demonstrate the efficiency of our algorithm on microbial ecology datasets. We illustrate the importance of accounting for the effects of covariates to better understand interactions between species.
△ Less
Submitted 30 April, 2018; v1 submitted 20 March, 2017;
originally announced March 2017.
-
Fast Detection of Block Boundaries in Block Wise Constant Matrices: An Application to HiC data
Authors:
Vincent Brault,
Julien Chiquet,
Céline Lévy-Leduc
Abstract:
We propose a novel approach for estimating the location of block boundaries (change-points) in a random matrix consisting of a block wise constant matrix observed in white noise. Our method consists in rephrasing this task as a variable selection issue. We use a penalized least-squares criterion with an $\ell_1$-type penalty for dealing with this issue. We first provide some theoretical results en…
▽ More
We propose a novel approach for estimating the location of block boundaries (change-points) in a random matrix consisting of a block wise constant matrix observed in white noise. Our method consists in rephrasing this task as a variable selection issue. We use a penalized least-squares criterion with an $\ell_1$-type penalty for dealing with this issue. We first provide some theoretical results ensuring the consistency of our change-point estimators. Then, we explain how to implement our method in a very efficient way. Finally, we provide some empirical evidence to support our claims and apply our approach to HiC data which are used in molecular biology for better understanding the influence of the chromosomal conformation on the cells functioning.
△ Less
Submitted 11 March, 2016;
originally announced March 2016.
-
A model for gene deregulation detection using expression data
Authors:
Thomas Picchetti,
Julien Chiquet,
Mohamed Elati,
Pierre Neuvial,
Rémy Nicolle,
Etienne Birmelé
Abstract:
In tumoral cells, gene regulation mechanisms are severely altered, and these modifications in the regulations may be characteristic of different subtypes of cancer. However, these alterations do not necessarily induce differential expressions between the subtypes. To answer this question, we propose a statistical methodology to identify the misregulated genes given a reference network and gene exp…
▽ More
In tumoral cells, gene regulation mechanisms are severely altered, and these modifications in the regulations may be characteristic of different subtypes of cancer. However, these alterations do not necessarily induce differential expressions between the subtypes. To answer this question, we propose a statistical methodology to identify the misregulated genes given a reference network and gene expression data. Our model is based on a regulatory process in which all genes are allowed to be deregulated. We derive an EM algorithm where the hidden variables correspond to the status (under/over/normally expressed) of the genes and where the E-step is solved thanks to a message passing algorithm. Our procedure provides posterior probabilities of deregulation in a given sample for each gene. We assess the performance of our method by numerical experiments on simulations and on a bladder cancer data set.
△ Less
Submitted 8 January, 2016; v1 submitted 21 May, 2015;
originally announced May 2015.
-
Fast tree inference with weighted fusion penalties
Authors:
Julien Chiquet,
Pierre Gutierrez,
Guillem Rigaill
Abstract:
Given a data set with many features observed in a large number of conditions, it is desirable to fuse and aggregate conditions which are similar to ease the interpretation and extract the main characteristics of the data. This paper presents a multidimensional fusion penalty framework to address this question when the number of conditions is large. If the fusion penalty is encoded by an $\ell_q$-n…
▽ More
Given a data set with many features observed in a large number of conditions, it is desirable to fuse and aggregate conditions which are similar to ease the interpretation and extract the main characteristics of the data. This paper presents a multidimensional fusion penalty framework to address this question when the number of conditions is large. If the fusion penalty is encoded by an $\ell_q$-norm, we prove for uniform weights that the path of solutions is a tree which is suitable for interpretability. For the $\ell_1$ and $\ell_\infty$-norms, the path is piecewise linear and we derive a homotopy algorithm to recover exactly the whole tree structure. For weighted $\ell_1$-fusion penalties, we demonstrate that distance-decreasing weights lead to balanced tree structures. For a subclass of these weights that we call "exponentially adaptive", we derive an $\mathcal{O}(n\log(n))$ homotopy algorithm and we prove an asymptotic oracle property. This guarantees that we recover the underlying structure of the data efficiently both from a statistical and a computational point of view. We provide a fast implementation of the homotopy algorithm for the single feature case, as well as an efficient embedded cross-validation procedure that takes advantage of the tree structure of the path of solutions. Our proposal outperforms its competing procedures on simulations both in terms of timings and prediction accuracy. As an example we consider phenotypic data: given one or several traits, we reconstruct a balanced tree structure and assess its agreement with the known taxonomy.
△ Less
Submitted 27 May, 2015; v1 submitted 22 July, 2014;
originally announced July 2014.
-
Structured Regularization for conditional Gaussian Graphical Models
Authors:
Julien Chiquet,
Tristan Mary-Huard,
Stéphane Robin
Abstract:
Conditional Gaussian graphical models (cGGM) are a recent reparametrization of the multivariate linear regression model which explicitly exhibits $i)$ the partial covariances between the predictors and the responses, and $ii)$ the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe strong relationships…
▽ More
Conditional Gaussian graphical models (cGGM) are a recent reparametrization of the multivariate linear regression model which explicitly exhibits $i)$ the partial covariances between the predictors and the responses, and $ii)$ the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe strong relationships between variables. In this framework, we propose a regularization scheme to enhance the learning strategy of the model by driving the selection of the relevant input features by prior structural information. It comes with an efficient alternating optimization procedure which is guaranteed to converge to the global minimum. On top of showing competitive performance on artificial and real datasets, our method demonstrates capabilities for fine interpretation of its parameters, as illustrated on three high-dimensional datasets from spectroscopy, genetics, and genomics.
△ Less
Submitted 25 September, 2014; v1 submitted 24 March, 2014;
originally announced March 2014.
-
Sparsity by Worst-Case Penalties
Authors:
Yves Grandvalet,
Julien Chiquet,
Christophe Ambroise
Abstract:
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying so…
▽ More
This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems very accurately, at machine precision, in the time required to get a rough estimate with competing state-of-the-art algorithms. We illustrate on real and artificial datasets that this accuracy is required to for the correctness of the support of the solution, which is an important element for the interpretability of sparsity-inducing penalties.
△ Less
Submitted 19 July, 2017; v1 submitted 7 October, 2012;
originally announced October 2012.
-
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
Authors:
Julien Chiquet,
Yves Grandvalet,
Camille Charbonnier
Abstract:
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive or null parameters. To tackle this problem, we propose the cooperative-Lasso penalty. We derive the optimality conditions defining the cooperative-Lasso estimate for generalized linear models, an…
▽ More
We consider the problems of estimation and selection of parameters endowed with a known group structure, when the groups are assumed to be sign-coherent, that is, gathering either nonnegative, nonpositive or null parameters. To tackle this problem, we propose the cooperative-Lasso penalty. We derive the optimality conditions defining the cooperative-Lasso estimate for generalized linear models, and propose an efficient active set algorithm suited to high-dimensional problems. We study the asymptotic consistency of the estimator in the linear regression setup and derive its irrepresentable conditions, which are milder than the ones of the group-Lasso regarding the matching of groups with the sparsity pattern of the true parameters. We also address the problem of model selection in linear regression by deriving an approximation of the degrees of freedom of the cooperative-Lasso estimator. Simulations comparing the proposed estimator to the group and sparse group-Lasso comply with our theoretical results, showing consistent improvements in support recovery for sign-coherent groups. We finally propose two examples illustrating the wide applicability of the cooperative-Lasso: first to the processing of ordinal variables, where the penalty acts as a monotonicity prior; second to the processing of genomic data, where the set of differentially expressed probes is enriched by incorporating all the probes of the microarray that are related to the corresponding genes.
△ Less
Submitted 2 July, 2012; v1 submitted 14 March, 2011;
originally announced March 2011.
-
Inferring Multiple Graphical Structures
Authors:
Julien Chiquet,
Yves Grandvalet,
Christophe Ambroise
Abstract:
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interac…
▽ More
Gaussian Graphical Models provide a convenient framework for representing dependencies between variables. Recently, this tool has received a high interest for the discovery of biological networks. The literature focuses on the case where a single network is inferred from a set of measurements, but, as wetlab data is typically scarce, several assays, where the experimental conditions affect interactions, are usually merged to infer a single network. In this paper, we propose two approaches for estimating multiple related graphs, by rendering the closeness assumption into an empirical prior or group penalties. We provide quantitative results demonstrating the benefits of the proposed approaches. The methods presented in this paper are embeded in the R package 'simone' from version 1.0-0 and later.
△ Less
Submitted 12 May, 2010; v1 submitted 22 December, 2009;
originally announced December 2009.
-
Weighted-Lasso for Structured Network Inference from Time Course Data
Authors:
Camille Charbonnier,
Julien Chiquet,
Christophe Ambroise
Abstract:
We present a weighted-Lasso method to infer the parameters of a first-order vector auto-regressive model that describes time course expression data generated by directed gene-to-gene regulation networks. These networks are assumed to own a prior internal structure of connectivity which drives the inference method. This prior structure can be either derived from prior biological knowledge or infe…
▽ More
We present a weighted-Lasso method to infer the parameters of a first-order vector auto-regressive model that describes time course expression data generated by directed gene-to-gene regulation networks. These networks are assumed to own a prior internal structure of connectivity which drives the inference method. This prior structure can be either derived from prior biological knowledge or inferred by the method itself. We illustrate the performance of this structure-based penalization both on synthetic data and on two canonical regulatory networks, first yeast cell cycle regulation network by analyzing Spellman et al's dataset and second E. coli S.O.S. DNA repair network by analysing U. Alon's lab data.
△ Less
Submitted 9 December, 2009; v1 submitted 9 October, 2009;
originally announced October 2009.
-
Inferring sparse Gaussian graphical models with latent structure
Authors:
Christophe Ambroise,
Julien Chiquet,
Catherine Matias
Abstract:
Our concern is selecting the concentration matrix's nonzero coefficients for a sparse Gaussian graphical model in a high-dimensional setting. This corresponds to estimating the graph of conditional dependencies between the variables. We describe a novel framework taking into account a latent structure on the concentration matrix. This latent structure is used to drive a penalty matrix and thus t…
▽ More
Our concern is selecting the concentration matrix's nonzero coefficients for a sparse Gaussian graphical model in a high-dimensional setting. This corresponds to estimating the graph of conditional dependencies between the variables. We describe a novel framework taking into account a latent structure on the concentration matrix. This latent structure is used to drive a penalty matrix and thus to recover a graphical model with a constrained topology. Our method uses an $\ell_1$ penalized likelihood criterion. Inference of the graph of conditional dependencies between the variates and of the hidden variables is performed simultaneously in an iterative \textsc{em}-like algorithm. The performances of our method is illustrated on synthetic as well as real data, the latter concerning breast cancer.
△ Less
Submitted 17 October, 2008;
originally announced October 2008.