-
A Structured Estimator for large Covariance Matrices in the Presence of Pairwise and Spatial Covariates
Authors:
Martin Metodiev,
Marie Perrot-Dockès,
Sarah Ouadah,
Bailey K. Fosdick,
Stéphane Robin,
Pierre Latouche,
Adrian E. Raftery
Abstract:
We consider the problem of estimating a high-dimensional covariance matrix from a small number of observations when covariates on pairs of variables are available and the variables can have spatial structure. This is motivated by the problem arising in demography of estimating the covariance matrix of the total fertility rate (TFR) of 195 different countries when only 11 observations are available…
▽ More
We consider the problem of estimating a high-dimensional covariance matrix from a small number of observations when covariates on pairs of variables are available and the variables can have spatial structure. This is motivated by the problem arising in demography of estimating the covariance matrix of the total fertility rate (TFR) of 195 different countries when only 11 observations are available. We construct an estimator for high-dimensional covariance matrices by exploiting information about pairwise covariates, such as whether pairs of variables belong to the same cluster, or spatial structure of the variables, and interactions between the covariates. We reformulate the problem in terms of a mixed effects model. This requires the estimation of only a small number of parameters, which are easy to interpret and which can be selected using standard procedures. The estimator is consistent under general conditions, and asymptotically normal. It works if the mean and variance structure of the data is already specified or if some of the data are missing. We assess its performance under our model assumptions, as well as under model misspecification, using simulations. We find that it outperforms several popular alternatives. We apply it to the TFR dataset and draw some conclusions.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Online and Offline Robust Multivariate Linear Regression
Authors:
Antoine Godichon-Baggioni,
Stephane S. Robin,
Laure Sansonnet
Abstract:
We consider the robust estimation of the parameters of multivariate Gaussian linear regression models. To this aim we consider robust version of the usual (Mahalanobis) least-square criterion, with or without Ridge regularization. We introduce two methods each considered contrast: (i) online stochastic gradient descent algorithms and their averaged versions and (ii) offline fix-point algorithms…
▽ More
We consider the robust estimation of the parameters of multivariate Gaussian linear regression models. To this aim we consider robust version of the usual (Mahalanobis) least-square criterion, with or without Ridge regularization. We introduce two methods each considered contrast: (i) online stochastic gradient descent algorithms and their averaged versions and (ii) offline fix-point algorithms. Under weak assumptions, we prove the asymptotic normality of the resulting estimates. Because the variance matrix of the noise is usually unknown, we propose to plug a robust estimate of it in the Mahalanobis-based stochastic gradient descent algorithms. We show, on synthetic data, the dramatic gain in terms of robustness of the proposed estimates as compared to the classical least-square ones. Well also show the computational efficiency of the online versions of the proposed algorithms. All the proposed algorithms are implemented in the R package RobRegression available on CRAN.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Composite likelihood inference for the Poisson log-normal model
Authors:
Julien Stoehr,
Stephane S. Robin
Abstract:
Inferring parameters of a latent variable model can be a daunting task when the conditional distribution of the latent variables given the observed ones is intractable. Variational approaches prove to be computationally efficient but, possibly, lack theoretical guarantees on the estimates, while sampling based solutions are quite the opposite. Starting from already available variational approximat…
▽ More
Inferring parameters of a latent variable model can be a daunting task when the conditional distribution of the latent variables given the observed ones is intractable. Variational approaches prove to be computationally efficient but, possibly, lack theoretical guarantees on the estimates, while sampling based solutions are quite the opposite. Starting from already available variational approximations, we define a first Monte Carlo EM algorithm to obtain maximum likelihood estimators, focusing on the Poisson log-normal model which provides a generic framework for the analysis of multivariate count data. We then extend this algorithm to the case of a composite likelihood in order to be able to handle higher dimensional count data.
△ Less
Submitted 18 April, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Multiple change-point detection for some point processes
Authors:
C. Dion-Blanc,
D. Hawat,
E. Lebarbier,
S. Robin
Abstract:
The aim of change-point detection is to identify behavioral shifts within time series data. This article focuses on scenarios where the data is derived from an inhomogeneous Poisson process or a marked Poisson process. We present a methodology for detecting multiple offline change-points using a minimum contrast estimator. Specifically, we address how to manage the continuous nature of the process…
▽ More
The aim of change-point detection is to identify behavioral shifts within time series data. This article focuses on scenarios where the data is derived from an inhomogeneous Poisson process or a marked Poisson process. We present a methodology for detecting multiple offline change-points using a minimum contrast estimator. Specifically, we address how to manage the continuous nature of the process given the available discrete observations. Additionally, we select the appropriate number of changes via a cross-validation procedure which is particularly effective given the characteristics of the Poisson process. Lastly, we show how to use this methodology to self-exciting processes with changes in the intensity. Through experiments, with both simulated and real datasets, we showcase the advantages of the proposed method, which has been implemented in the R package \texttt{CptPointProcess}.
△ Less
Submitted 6 November, 2024; v1 submitted 17 February, 2023;
originally announced February 2023.
-
A robust model-based clustering based on the geometric median and the Median Covariation Matrix
Authors:
Antoine Godichon-Baggioni,
Stéphane Robin
Abstract:
Grouping observations into homogeneous groups is a recurrent task in statistical data analysis. We consider Gaussian Mixture Models, which are the most famous parametric model-based clustering method. We propose a new robust approach for model-based clustering, which consists in a modification of the EM algorithm (more specifically, the M-step) by replacing the estimates of the mean and the varian…
▽ More
Grouping observations into homogeneous groups is a recurrent task in statistical data analysis. We consider Gaussian Mixture Models, which are the most famous parametric model-based clustering method. We propose a new robust approach for model-based clustering, which consists in a modification of the EM algorithm (more specifically, the M-step) by replacing the estimates of the mean and the variance by robust versions based on the median and the median covariation matrix. All the proposed methods are available in the R package RGMM accessible on CRAN.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Querying multiple sets of $p$-values through composed hypothesis testing
Authors:
Tristan Mary-Huard,
Sarmistha Das,
Indranil Mukhopadhyay,
Stéphane Robin
Abstract:
Motivation: Combining the results of different experiments to exhibit complex patterns or to improve statistical power is a typical aim of data integration. The starting point of the statistical analysis often comes as sets of p-values resulting from previous analyses, that need to be combined in a flexible way to explore complex hypotheses, while guaranteeing a low proportion of false discoveries…
▽ More
Motivation: Combining the results of different experiments to exhibit complex patterns or to improve statistical power is a typical aim of data integration. The starting point of the statistical analysis often comes as sets of p-values resulting from previous analyses, that need to be combined in a flexible way to explore complex hypotheses, while guaranteeing a low proportion of false discoveries. Results: We introduce the generic concept of composed hypothesis, which corresponds to an arbitrary complex combination of simple hypotheses. We rephrase the problem of testing a composed hypothesis as a classification task, and show that finding items for which the composed null hypothesis is rejected boils down to fitting a mixture model and classify the items according to their posterior probabilities. We show that inference can be efficiently performed and provide a thorough classification rule to control for type I error. The performance and the usefulness of the approach are illustrated on simulations and on two different applications. The method is scalable, does not require any parameter tuning, and provided valuable biological insight on the considered application cases. Availability: The QCH methodology is implemented in the qch R package hosted on CRAN.
△ Less
Submitted 1 December, 2021; v1 submitted 29 April, 2021;
originally announced April 2021.
-
Mixture-based estimation of entropy
Authors:
Stéphane Robin,
Luca Scrucca
Abstract:
The entropy is a measure of uncertainty that plays a central role in information theory. When the distribution of the data is unknown, an estimate of the entropy needs be obtained from the data sample itself. We propose a semi-parametric estimate, based on a mixture model approximation of the distribution of interest. The estimate can rely on any type of mixture, but we focus on Gaussian mixture m…
▽ More
The entropy is a measure of uncertainty that plays a central role in information theory. When the distribution of the data is unknown, an estimate of the entropy needs be obtained from the data sample itself. We propose a semi-parametric estimate, based on a mixture model approximation of the distribution of interest. The estimate can rely on any type of mixture, but we focus on Gaussian mixture model to demonstrate its accuracy and versatility. Performance of the proposed approach is assessed through a series of simulation studies. We also illustrate its use on two real-life data examples.
△ Less
Submitted 5 January, 2022; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Accounting for missing actors in interaction network inference from abundance data
Authors:
Raphaëlle Momal,
Stéphane Robin,
Christophe Ambroise
Abstract:
Network inference aims at unraveling the dependency structure relating jointly observed variables. Graphical models provide a general framework to distinguish between marginal and conditional dependency. Unobserved variables (missing actors) may induce apparent conditional dependencies.In the context of count data, we introduce a mixture of Poisson log-normal distributions with tree-shaped graphic…
▽ More
Network inference aims at unraveling the dependency structure relating jointly observed variables. Graphical models provide a general framework to distinguish between marginal and conditional dependency. Unobserved variables (missing actors) may induce apparent conditional dependencies.In the context of count data, we introduce a mixture of Poisson log-normal distributions with tree-shaped graphical models, to recover the dependency structure, including missing actors. We design a variational EM algorithm and assess its performance on synthetic data. We demonstrate the ability of our approach to recover environmental drivers on two ecological datasets. The corresponding R package is available from github.com/Rmomal/nestor.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
Bayesian inference for network Poisson models
Authors:
Sophie Donnet,
Stéphane Robin
Abstract:
This work is motivated by the analysis of ecological interaction networks. Poisson stochastic blockmodels are widely used in this field to decipher the structure that underlies a weighted network, while accounting for covariate effects. Efficient algorithms based on variational approximations exist for frequentist inference, but without statistical guaranties as for the resulting estimates. In abs…
▽ More
This work is motivated by the analysis of ecological interaction networks. Poisson stochastic blockmodels are widely used in this field to decipher the structure that underlies a weighted network, while accounting for covariate effects. Efficient algorithms based on variational approximations exist for frequentist inference, but without statistical guaranties as for the resulting estimates. In absence of variational Bayes estimates, we show that a good proxy of the posterior distribution can be straightforwardly derived from the frequentist variational estimation procedure, using a Laplace approximation. We use this proxy to sample from the true posterior distribution via a sequential Monte-Carlo algorithm. As shown in the simulation study, the efficiency of the posterior sampling is greatly improved by the accuracy of the approximate posterior distribution. The proposed procedure can be easily extended to other latent variable models. We use this methodology to assess the influence of available covariates on the organization of two ecological networks, as well as the existence of a residual interaction structure.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Tree-based Inference of Species Interaction Network from Abundance Data
Authors:
Raphaëlle Momal,
Stéphane Robin,
Christophe Ambroise
Abstract:
The behavior of ecological systems mainly relies on the interactions between the species it involves. We consider the problem of inferring the species interaction network from abundance data. To be relevant, any network inference methodology needs to handle count data and to account for possible environmental effects. It also needs to distinguish between direct interactions and indirect associatio…
▽ More
The behavior of ecological systems mainly relies on the interactions between the species it involves. We consider the problem of inferring the species interaction network from abundance data. To be relevant, any network inference methodology needs to handle count data and to account for possible environmental effects. It also needs to distinguish between direct interactions and indirect associations and graphical models provide a convenient framework for this purpose. We introduce a generic statistical model for network inference based on abundance data. The model includes fixed effects to account for environmental covariates and sampling efforts, and correlated random effects to encode species interactions. The inferred network is obtained by averaging over all possible tree-shaped (and therefore sparse) networks, in a computationally efficient manner. An output of the procedure is the probability for each edge to be part of the underlying network. A simulation study shows that the proposed methodology compares well with state-of-the-art approaches, even when the underlying graph strongly differs from a tree. The analysis of two datasets highlights the influence of covariates on the inferred network. Accounting for covariates is critical to avoid spurious edges. The proposed approach could be extended to perform network comparison or to look for missing species.
△ Less
Submitted 28 October, 2019; v1 submitted 7 May, 2019;
originally announced May 2019.
-
Nine Quick Tips for Analyzing Network Data
Authors:
Stephane Robin,
Vincent Miele,
Catherine Matias,
Stéphane Dray
Abstract:
These tips provide a quick and concentrated guide for beginners in the analysis of network data.
These tips provide a quick and concentrated guide for beginners in the analysis of network data.
△ Less
Submitted 26 July, 2019; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Variational inference for sparse network reconstruction from count data
Authors:
Julien Chiquet,
Mahendra Mariadassou,
Stéphane Robin
Abstract:
In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.
To circumvent this l…
▽ More
In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.
To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts.
To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix.
We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Using deterministic approximations to accelerate SMC for posterior sampling
Authors:
Sophie Donnet,
Stéphane Robin
Abstract:
Sequential Monte Carlo has become a standard tool for Bayesian Inference of complex models. This approach can be computationally demanding, especially when initialized from the prior distribution. On the other hand, deter-ministic approximations of the posterior distribution are often available with no theoretical guaranties. We propose a bridge sampling scheme starting from such a deterministic a…
▽ More
Sequential Monte Carlo has become a standard tool for Bayesian Inference of complex models. This approach can be computationally demanding, especially when initialized from the prior distribution. On the other hand, deter-ministic approximations of the posterior distribution are often available with no theoretical guaranties. We propose a bridge sampling scheme starting from such a deterministic approximation of the posterior distribution and targeting the true one. The resulting Shortened Bridge Sampler (SBS) relies on a sequence of distributions that is determined in an adaptive way. We illustrate the robustness and the efficiency of the methodology on a large simulation study. When applied to network datasets, SBS inference leads to different statistical conclusions from the one supplied by the standard variational Bayes approximation.
△ Less
Submitted 25 July, 2017;
originally announced July 2017.
-
Variational inference for coupled Hidden Markov Models applied to the joint detection of copy number variations
Authors:
Xiaoqiang Wang,
Emilie Lebarbier,
Julie Aubert,
Stéphane Robin
Abstract:
Hidden Markov models provide a natural statistical framework for the detection of the copy number variations (CNV) in genomics. In this paper, we consider a Hidden Markov Model involving several correlated hidden processes at the same time. When dealing with a large number of series, maximum likelihood inference (performed classically using the EM algorithm) becomes intractable. We thus propose an…
▽ More
Hidden Markov models provide a natural statistical framework for the detection of the copy number variations (CNV) in genomics. In this paper, we consider a Hidden Markov Model involving several correlated hidden processes at the same time. When dealing with a large number of series, maximum likelihood inference (performed classically using the EM algorithm) becomes intractable. We thus propose an approximate inference algorithm based on a variational approach (VEM). A simulation study is performed to assess the performance of the proposed method and an application to the detection of structural variations in plant genomes is presented.
△ Less
Submitted 21 June, 2017;
originally announced June 2017.
-
Incomplete graphical model inference via latent tree aggregation
Authors:
Geneviève Robin,
Christophe Ambroise,
Stéphane Robin
Abstract:
Graphical network inference is used in many fields such as genomics or ecology to infer the conditional independence structure between variables, from measurements of gene expression or species abundances for instance. In many practical cases, not all variables involved in the network have been observed, and the samples are actually drawn from a distribution where some variables have been marginal…
▽ More
Graphical network inference is used in many fields such as genomics or ecology to infer the conditional independence structure between variables, from measurements of gene expression or species abundances for instance. In many practical cases, not all variables involved in the network have been observed, and the samples are actually drawn from a distribution where some variables have been marginalized out. This challenges the sparsity assumption commonly made in graphical model inference, since marginalization yields locally dense structures, even when the original network is sparse. We present a procedure for inferring Gaussian graphical models when some variables are unobserved, that accounts both for the influence of missing variables and the low density of the original network. Our model is based on the aggregation of spanning trees, and the estimation procedure on the Expectation-Maximization algorithm. We treat the graph structure and the unobserved nodes as missing variables and compute posterior probabilities of edge appearance. To provide a complete methodology, we also propose several model selection criteria to estimate the number of missing nodes. A simulation study and an illustration flow cytometry data reveal that our method has favorable edge detection properties compared to existing graph inference techniques. The methods are implemented in an R package.
△ Less
Submitted 21 March, 2018; v1 submitted 26 May, 2017;
originally announced May 2017.
-
A multivariate variable selection approach for analyzing LC-MS metabolomics data
Authors:
M. Perrot-Dockès,
C. Lévy-Leduc,
J. Chiquet,
L. Sansonnet,
M. Brégère,
M. -P. Étienne,
S. Robin,
G. Genta-Jouve
Abstract:
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) -- a technique which gives access to a large coverage of metabolites -- exhibit such patterns. These data sets are typically used to find…
▽ More
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) -- a technique which gives access to a large coverage of metabolites -- exhibit such patterns. These data sets are typically used to find the metabolites characterizing a phenotype of interest associated with the samples. However, applying some statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure in the multivariate linear model that accounts for the dependence structure of the multiple outputs which may lead in the LC-MS framework to the selection of more relevant metabolites. We propose a novel Lasso-based approach in the multivariate framework of the general linear model taking into account the dependence structure by using various modelings of the covariance matrix of the residuals. Our numerical experiments show that including the estimation of the covariance matrix of the residuals in the Lasso criterion dramatically improves the variable selection performance. Our approach is also successfully applied to a LC-MS data set made of African copals samples for which it is able to provide a small list of metabolites without altering the phenotype discrimination. Our methodology is implemented in the R package MultiVarSel which is available from the CRAN (Comprehensive R Archive Network).
△ Less
Submitted 31 March, 2017;
originally announced April 2017.
-
Variational inference for probabilistic Poisson PCA
Authors:
Julien Chiquet,
Mahendra Mariadassou,
Stéphane Robin
Abstract:
Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We consi…
▽ More
Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. We consider here the multivariate exponential family framework for which we introduce a generic model with multivariate Gaussian latent variables. We show that approximate maximum likelihood inference can be achieved via a variational algorithm for which gradient descent easily applies. We show that this setting enables us to account for covariates and offsets. We then focus on the case of the Poisson-lognormal model in the context of community ecology. We demonstrate the efficiency of our algorithm on microbial ecology datasets. We illustrate the importance of accounting for the effects of covariates to better understand interactions between species.
△ Less
Submitted 30 April, 2018; v1 submitted 20 March, 2017;
originally announced March 2017.
-
Exact Bayesian inference for off-line change-point detection in tree-structured graphical models
Authors:
Loïc Schwaller,
Stéphane Robin
Abstract:
We consider the problem of change-point detection in multivariate time-series. The multivariate distribution of the observations is supposed to follow a graphical model, whose graph and parameters are affected by abrupt changes throughout time. We demonstrate that it is possible to perform exact Bayesian inference whenever one considers a simple class of undirected graphs called spanning trees as…
▽ More
We consider the problem of change-point detection in multivariate time-series. The multivariate distribution of the observations is supposed to follow a graphical model, whose graph and parameters are affected by abrupt changes throughout time. We demonstrate that it is possible to perform exact Bayesian inference whenever one considers a simple class of undirected graphs called spanning trees as possible structures. We are then able to integrate on the graph and segmentation spaces at the same time by combining classical dynamic programming with algebraic results pertaining to spanning trees. In particular, we show that quantities such as posterior distributions for change-points or posterior edge probabilities over time can efficiently be obtained. We illustrate our results on both synthetic and experimental data arising from biology and neuroscience.
△ Less
Submitted 16 June, 2016; v1 submitted 25 March, 2016;
originally announced March 2016.
-
Goodness of fit of logistic models for random graphs
Authors:
Pierre Latouche,
Stéphane Robin,
Sarah Ouadah
Abstract:
Logistic regression is a natural and simple tool to understand how covariates contribute to explain the topology of a binary network. Once the model fitted, the practitioner is interested in the goodness-of-fit of the regression in order to check if the covariates are sufficient to explain the whole topology of the network and, if they are not, to analyze the residual structure. To address this pr…
▽ More
Logistic regression is a natural and simple tool to understand how covariates contribute to explain the topology of a binary network. Once the model fitted, the practitioner is interested in the goodness-of-fit of the regression in order to check if the covariates are sufficient to explain the whole topology of the network and, if they are not, to analyze the residual structure. To address this problem, we introduce a generic model that combines logistic regression with a network-oriented residual term. This residual term takes the form of the graphon function of a W-graph. Using a variational Bayes framework, we infer the residual graphon by averaging over a series of blockwise constant functions. This approach allows us to define a generic goodness-of-fit criterion, which corresponds to the posterior probability for the residual graphon to be constant. Experiments on toy data are carried out to assess the accuracy of the procedure. Several networks from social sciences and ecology are studied to illustrate the proposed methodology.
△ Less
Submitted 6 January, 2017; v1 submitted 2 August, 2015;
originally announced August 2015.
-
Detection of adaptive shifts on phylogenies using shifted stochastic processes on a tree
Authors:
Paul Bastide,
Mahendra Mariadassou,
Stéphane Robin
Abstract:
Comparative and evolutive ecologists are interested in the distribution of quantitative traits among related species. The classical framework for these distributions consists of a random process running along the branches of a phylogenetic tree relating the species. We consider shifts in the process parameters, which reveal fast adaptation to changes of ecological niches. We show that models with…
▽ More
Comparative and evolutive ecologists are interested in the distribution of quantitative traits among related species. The classical framework for these distributions consists of a random process running along the branches of a phylogenetic tree relating the species. We consider shifts in the process parameters, which reveal fast adaptation to changes of ecological niches. We show that models with shifts are not identifiable in general. Constraining the models to be parsimonious in the number of shifts partially alleviates the problem but several evolutionary scenarios can still provide the same joint distribution for the extant species. We provide a recursive algorithm to enumerate all the equivalent scenarios and to count the effectively different scenarios. We introduce an incomplete-data framework and develop a maximum likelihood estimation procedure based on the EM algorithm. Finally, we propose a model selection procedure, based on the cardinal of effective scenarios, to estimate the number of shifts and prove an oracle inequality.
△ Less
Submitted 13 July, 2016; v1 submitted 2 August, 2015;
originally announced August 2015.
-
Exact and approximate inference in graphical models: variable elimination and beyond
Authors:
Nathalie Peyrard,
Marie-Josée Cros,
Simon de Givry,
Alain Franc,
Stéphane Robin,
Régis Sabbadin,
Thomas Schiex,
Matthieu Vignes
Abstract:
Probabilistic graphical models offer a powerful framework to account for the dependence structure between variables, which is represented as a graph. However, the dependence between variables may render inference tasks intractable. In this paper we review techniques exploiting the graph structure for exact inference, borrowed from optimisation and computer science. They are built on the principle…
▽ More
Probabilistic graphical models offer a powerful framework to account for the dependence structure between variables, which is represented as a graph. However, the dependence between variables may render inference tasks intractable. In this paper we review techniques exploiting the graph structure for exact inference, borrowed from optimisation and computer science. They are built on the principle of variable elimination whose complexity is dictated in an intricate way by the order in which variables are eliminated. The so-called treewidth of the graph characterises this algorithmic complexity: low-treewidth graphs can be processed efficiently. The first message that we illustrate is therefore the idea that for inference in graphical model, the number of variables is not the limiting factor, and it is worth checking for the treewidth before turning to approximate methods. We show how algorithms providing an upper bound of the treewidth can be exploited to derive a 'good' elimination order enabling to perform exact inference. The second message is that when the treewidth is too large, algorithms for approximate inference linked to the principle of variable elimination, such as loopy belief propagation and variational approaches, can lead to accurate results while being much less time consuming than Monte-Carlo approaches. We illustrate the techniques reviewed in this article on benchmarks of inference problems in genetic linkage analysis and computer vision, as well as on hidden variables restoration in coupled Hidden Markov Models.
△ Less
Submitted 12 March, 2018; v1 submitted 29 June, 2015;
originally announced June 2015.
-
A factor model approach for the joint segmentation with between-series correlation
Authors:
Xavier Collilieux,
Emilie Lebarbier,
Stéphane Robin
Abstract:
We consider the segmentation of set of correlated time-series, the correlation being allowed to take an arbitrary form but being the same at each time-position. We show that encoding the dependency in a factor model enables us to use the dynamic programming algorithm for the inference of the breakpoints, which remains one the most efficient algorithm. We propose a model selection procedure to dete…
▽ More
We consider the segmentation of set of correlated time-series, the correlation being allowed to take an arbitrary form but being the same at each time-position. We show that encoding the dependency in a factor model enables us to use the dynamic programming algorithm for the inference of the breakpoints, which remains one the most efficient algorithm. We propose a model selection procedure to determine both the number of breakpoints and the number of factors. This proposed method is implemented in the FASeg R package, which is available on the CRAN. We demonstrate the performances of our procedure through simulation experiments and an application to geodesic data is presented.
△ Less
Submitted 17 July, 2018; v1 submitted 21 May, 2015;
originally announced May 2015.
-
SegCorr: a statistical procedure for the detection of genomic regions of correlated expression
Authors:
Eleni Ioanna Delatola,
Emilie Lebarbier,
Tristan Mary-Huard,
François Radvanyi,
Stéphane Robin,
Jennifer Wong
Abstract:
Motivation: Detecting local correlations in expression between neighbor genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigenetic alterations as factors that may significantly alter expression in large chromo…
▽ More
Motivation: Detecting local correlations in expression between neighbor genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigenetic alterations as factors that may significantly alter expression in large chromosomic regions (gene silencing or gene activation).
Results: The identification of correlated regions requires segmenting the gene expression correlation matrix into regions of homogeneously correlated genes and assessing whether the observed local correlation is significantly higher than the background chromosomal correlation. A unified statistical framework is proposed to achieve these two tasks, where optimal segmentation is efficiently performed using dynamic programming algorithm, and detection of highly correlated regions is then achieved using an exact test procedure. We also propose a simple and efficient procedure to correct the expression signal for mechanisms already known to impact expression correlation. The performance and robustness of the proposed procedure, called SegCorr, are evaluated on simulated data. The procedure is illustrated on cancer data, where the signal is corrected for correlations possibly caused by copy number variation. The correction permitted the detection of regions with high correlations linked to DNA methylation.
Availability and implementation: R package SegCorr is available on the CRAN.
△ Less
Submitted 22 April, 2015;
originally announced April 2015.
-
A closed-form approach to Bayesian inference in tree-structured graphical models
Authors:
Loïc Schwaller,
Stéphane Robin,
Michael Stumpf
Abstract:
We consider the inference of the structure of an undirected graphical model in an exact Bayesian framework. More specifically we aim at achieving the inference with close-form posteriors, avoiding any sampling step. This task would be intractable without any restriction on the considered graphs, so we limit our exploration to mixtures of spanning trees. We consider the inference of the structure o…
▽ More
We consider the inference of the structure of an undirected graphical model in an exact Bayesian framework. More specifically we aim at achieving the inference with close-form posteriors, avoiding any sampling step. This task would be intractable without any restriction on the considered graphs, so we limit our exploration to mixtures of spanning trees. We consider the inference of the structure of an undirected graphical model in a Bayesian framework. To avoid convergence issues and highly demanding Monte Carlo sampling, we focus on exact inference. More specifically we aim at achieving the inference with close-form posteriors, avoiding any sampling step. To this aim, we restrict the set of considered graphs to mixtures of spanning trees. We investigate under which conditions on the priors - on both tree structures and parameters - exact Bayesian inference can be achieved. Under these conditions, we derive a fast an exact algorithm to compute the posterior probability for an edge to belong to {the tree model} using an algebraic result called the Matrix-Tree theorem. We show that the assumption we have made does not prevent our approach to perform well on synthetic and flow cytometry data.
△ Less
Submitted 1 May, 2017; v1 submitted 10 April, 2015;
originally announced April 2015.
-
Modelling overdispersion heterogeneity in differential expression analysis using mixtures
Authors:
Elisabetta Bonafede,
Franck Picard,
Stéphane Robin,
Cinzia Viroli
Abstract:
Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using Negative Binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many…
▽ More
Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using Negative Binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Many strategies have been proposed to estimate this parameter, but when differential analysis is the purpose, they often result in procedures based on plug-in estimates, and we show here that this discrepancy between the estimation framework and the testing framework can lead to uncontrolled type-I errors. Instead we propose a mixture model that allows each gene to share information with other genes that exhibit similar variability. Three consistent statistical tests are developed for differential expression analysis. We show that the proposed method improves the sensitivity of detecting differentially expressed genes with respect to the common procedures, since it is the best one in reaching the nominal value for the first-type error, while keeping elevate power. The method is finally illustrated on prostate cancer RNA-seq data.
△ Less
Submitted 7 November, 2014; v1 submitted 23 October, 2014;
originally announced October 2014.
-
Nonparametric species richness estimation under convexity constraint
Authors:
Cécile Durot,
Sylvie Huet,
François Koladjo,
Stéphane Robin
Abstract:
We consider the estimation of the total number $N$ of species based on the abundances of species that have been observed. We adopt a non parametric approach where the true abundance distribution $p$ is only supposed to be convex. From this assumption, we propose a definition for convex abundance distributions. We use a least-squares estimate of the truncated version of $p$ under the convexity cons…
▽ More
We consider the estimation of the total number $N$ of species based on the abundances of species that have been observed. We adopt a non parametric approach where the true abundance distribution $p$ is only supposed to be convex. From this assumption, we propose a definition for convex abundance distributions. We use a least-squares estimate of the truncated version of $p$ under the convexity constraint. We deduce two estimators of the total number of species, the asymptotic distribution of which are derived. We propose three different procedures, including a bootstrap one, to obtain a confidence interval for $N$. The performances of the estimators are assessed in a simulation study and compared with competitors. The proposed method is illustrated on several examples.
△ Less
Submitted 18 April, 2014;
originally announced April 2014.
-
Network impact on persistence in a finite population dynamic diffusion model: application to an emergent seed exchange network
Authors:
Pierre Barbillon,
Mathieu Thomas,
Isabelle Goldringer,
Frédéric Hospital,
Stéphane Robin
Abstract:
Dynamic extinction colonisation models (also called contact processes) are widely studied in epidemiology and in metapopulation theory. Contacts are usually assumed to be possible only through a network of connected patches. This network accounts for a spatial landscape or a social organisation of interactions. Thanks to social network literature, heterogeneous networks of contacts can be consider…
▽ More
Dynamic extinction colonisation models (also called contact processes) are widely studied in epidemiology and in metapopulation theory. Contacts are usually assumed to be possible only through a network of connected patches. This network accounts for a spatial landscape or a social organisation of interactions. Thanks to social network literature, heterogeneous networks of contacts can be considered. A major issue is to assess the influence of the network in the dynamic model. Most work with this common purpose uses deterministic models or an approximation of a stochastic Extinction-Colonisation model (sEC) which are relevant only for large networks. When working with a limited size network, the induced stochasticity is essential and has to be taken into account in the conclusions. Here, a rigorous framework is proposed for limited size networks and the limitations of the deterministic approximation are exhibited. This framework allows exact computations when the number of patches is small. Otherwise, simulations are used and enhanced by adapted simulation techniques when necessary. A sensitivity analysis was conducted to compare four main topologies of networks in contrasting settings to determine the role of the network. A challenging case was studied in this context: seed exchange of crop species in the Réseau Semences Paysannes (RSP), an emergent French farmers' organisation. A stochastic Extinction-Colonisation model was used to characterize the consequences of substantial changes in terms of RSP's social organisation on the ability of the system to maintain crop varieties.
△ Less
Submitted 16 April, 2014;
originally announced April 2014.
-
Structured Regularization for conditional Gaussian Graphical Models
Authors:
Julien Chiquet,
Tristan Mary-Huard,
Stéphane Robin
Abstract:
Conditional Gaussian graphical models (cGGM) are a recent reparametrization of the multivariate linear regression model which explicitly exhibits $i)$ the partial covariances between the predictors and the responses, and $ii)$ the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe strong relationships…
▽ More
Conditional Gaussian graphical models (cGGM) are a recent reparametrization of the multivariate linear regression model which explicitly exhibits $i)$ the partial covariances between the predictors and the responses, and $ii)$ the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe strong relationships between variables. In this framework, we propose a regularization scheme to enhance the learning strategy of the model by driving the selection of the relevant input features by prior structural information. It comes with an efficient alternating optimization procedure which is guaranteed to converge to the global minimum. On top of showing competitive performance on artificial and real datasets, our method demonstrates capabilities for fine interpretation of its parameters, as illustrated on three high-dimensional datasets from spectroscopy, genetics, and genomics.
△ Less
Submitted 25 September, 2014; v1 submitted 24 March, 2014;
originally announced March 2014.
-
A robust approach for estimating change-points in the mean of an AR(1) process
Authors:
Souhil Chakar,
Émilie Lebarbier,
Céline Lévy-Leduc,
Stéphane Robin
Abstract:
We consider the problem of multiple change-point estimation in the mean of a Gaussian AR(1) process. Taking into account the dependence structure does not allow us to use the dynamic programming algorithm, which is the only algorithm giving the optimal solution in the independent case. We propose a robust estimator of the autocorrelation parameter, which is consistent and satisfies a central limit…
▽ More
We consider the problem of multiple change-point estimation in the mean of a Gaussian AR(1) process. Taking into account the dependence structure does not allow us to use the dynamic programming algorithm, which is the only algorithm giving the optimal solution in the independent case. We propose a robust estimator of the autocorrelation parameter, which is consistent and satisfies a central limit theorem. Then, we propose to follow the classical inference approach, by plugging this estimator in the criteria used for change-points estimation. We show that the asymptotic properties of these estimators are the same as those of the classical estimators in the independent framework. The same plug-in approach is then used to approximate the modified BIC and choose the number of segments. This method is implemented in the R package AR1seg and is available from the Comprehensive R Archive Network (CRAN). This package is used in the simulation section in which we show that for finite sample sizes taking into account the dependence structure improves the statistical performance of the change-point estimators and of the selection criterion.
△ Less
Submitted 3 March, 2015; v1 submitted 8 March, 2014;
originally announced March 2014.
-
Modeling heterogeneity in random graphs through latent space models: a selective review
Authors:
Catherine Matias,
Stéphane Robin
Abstract:
We present a selective review on probabilistic modeling of heterogeneity in random graphs. We focus on latent space models and more particularly on stochastic block models and their extensions that have undergone major developments in the last five years.
We present a selective review on probabilistic modeling of heterogeneity in random graphs. We focus on latent space models and more particularly on stochastic block models and their extensions that have undergone major developments in the last five years.
△ Less
Submitted 25 September, 2014; v1 submitted 18 February, 2014;
originally announced February 2014.
-
Variational Bayes model averaging for graphon functions and motif frequencies inference in W-graph models
Authors:
P. Latouche,
S Robin
Abstract:
W-graph refers to a general class of random graph models that can be seen as a random graph limit. It is characterized by both its graphon function and its motif frequencies. In this paper, relying on an existing variational Bayes algorithm for the stochastic block models along with the corresponding weights for model averaging, we derive an estimate of the graphon function as an average of stocha…
▽ More
W-graph refers to a general class of random graph models that can be seen as a random graph limit. It is characterized by both its graphon function and its motif frequencies. In this paper, relying on an existing variational Bayes algorithm for the stochastic block models along with the corresponding weights for model averaging, we derive an estimate of the graphon function as an average of stochastic block models with increasing number of blocks. In the same framework, we derive the variational posterior frequency of any motif. A simulation study and an illustration on a social network complete our work.
△ Less
Submitted 5 November, 2015; v1 submitted 23 October, 2013;
originally announced October 2013.
-
Comparing change-point locations of independent profiles with application to gene annotation
Authors:
Alice Cleynen,
Stéphane Robin
Abstract:
We are interested in the comparison of transcript boundaries from cells which originated in different environments. The goal is to assess whether this phenomenon, called differential splicing, is used to modify the transcription of the genome in response to stress factors. We address this question by comparing the change-points locations in the individual segmentation of each profile, which corres…
▽ More
We are interested in the comparison of transcript boundaries from cells which originated in different environments. The goal is to assess whether this phenomenon, called differential splicing, is used to modify the transcription of the genome in response to stress factors. We address this question by comparing the change-points locations in the individual segmentation of each profile, which correspond to the RNA-Seq data for a gene in one growth condition. This requires the ability to evaluate the uncertainty of the change-point positions, and the work of Rigaill et. al. (2011) provides an appropriate framework in such case. Building on their approach, we propose two methods for the comparison of change-points, and illustrate our results on a dataset from the yeast specie. We show that the UTR boundaries are subject to differential splicing, while the intron boundaries are conserved in all profiles. Our approach is implemented in an R package called EBS which is available on the CRAN.
△ Less
Submitted 11 July, 2013;
originally announced July 2013.
-
Finite state space non parametric Hidden Markov Models are in general identifiable
Authors:
Elisabeth Gassiat,
Alice Cleynen,
Stéphane Robin
Abstract:
In this paper, we prove that finite state space non parametric hidden Markov models are identifiable as soon as the transition matrix of the latent Markov chain has full rank and the emission probability distributions are linearly independent. We then propose several non parametric likelihood based estimation methods, which we apply to models used in applications. We finally show on examples that…
▽ More
In this paper, we prove that finite state space non parametric hidden Markov models are identifiable as soon as the transition matrix of the latent Markov chain has full rank and the emission probability distributions are linearly independent. We then propose several non parametric likelihood based estimation methods, which we apply to models used in applications. We finally show on examples that the use of non parametric modeling and estimation may improve the classification performances.
△ Less
Submitted 19 June, 2013;
originally announced June 2013.
-
Hidden Markov Models with mixtures as emission distributions
Authors:
Stevenn Volant,
Caroline Bérard,
Marie-Laure Martin-Magniette,
Stéphane Robin
Abstract:
In unsupervised classification, Hidden Markov Models (HMM) are used to account for a neighborhood structure between observations. The emission distributions are often supposed to belong to some parametric family. In this paper, a semiparametric modeling where the emission distributions are a mixture of parametric distributions is proposed to get a higher flexibility. We show that the classical EM…
▽ More
In unsupervised classification, Hidden Markov Models (HMM) are used to account for a neighborhood structure between observations. The emission distributions are often supposed to belong to some parametric family. In this paper, a semiparametric modeling where the emission distributions are a mixture of parametric distributions is proposed to get a higher flexibility. We show that the classical EM algorithm can be adapted to infer the model parameters. For the initialisation step, starting from a large number of components, a hierarchical method to combine them into the hidden states is proposed. Three likelihood-based criteria to select the components to be combined are discussed. To estimate the number of hidden states, BIC-like criteria are derived. A simulation study is carried out both to determine the best combination between the merging criteria and the model selection criteria and to evaluate the accuracy of classification. The proposed method is also illustrated using a biological dataset from the model plant Arabidopsis thaliana. A R package HMMmix is freely available on the CRAN.
△ Less
Submitted 22 June, 2012;
originally announced June 2012.
-
Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data
Authors:
Alice Cleynen,
Michel Koskas,
Emilie Lebarbier,
Guillem Rigaill,
Stephane Robin
Abstract:
Genome annotation is an important issue in biology which has long been addressed with gene prediction methods and manual experiments requiring biological expertise. The expanding Next Generation Sequencing technologies and their enhanced precision allow a new approach to the domain: the segmentation of RNA-Seq data to determine gene boundaries. Because of its almost linear complexity, we propose t…
▽ More
Genome annotation is an important issue in biology which has long been addressed with gene prediction methods and manual experiments requiring biological expertise. The expanding Next Generation Sequencing technologies and their enhanced precision allow a new approach to the domain: the segmentation of RNA-Seq data to determine gene boundaries. Because of its almost linear complexity, we propose to use the Pruned Dynamic Programming Algorithm, which performances had been acknowledged for CGH arrays, for Seq-experiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used and we provide an estimator for this dispersion. We then propose to estimate the number of segments, which can be associated to coding or non-coding regions of the genome, using an oracle penalty. We illustrate the results of our approach on a real data-set and show its good performance. Our algorithm is available as an R package on the CRAN repository.
△ Less
Submitted 1 July, 2013; v1 submitted 25 April, 2012;
originally announced April 2012.
-
Estimation of a convex discrete distribution
Authors:
Cécile Durot,
François Koladjo,
Sylvie Huet,
Stéphane Robin
Abstract:
Non-parametric estimation of a convex discrete distribution may be of interest in several applications, such as the estimation of species abundance distribution in ecology. In this paper we study the least squares estimator of a discrete distribution under the constraint of convexity. We show that this estimator exists and is unique, and that it always outperforms the classical empirical estimator…
▽ More
Non-parametric estimation of a convex discrete distribution may be of interest in several applications, such as the estimation of species abundance distribution in ecology. In this paper we study the least squares estimator of a discrete distribution under the constraint of convexity. We show that this estimator exists and is unique, and that it always outperforms the classical empirical estimator in terms of the $\ell_{2}$-distance. We provide an algorithm for its computation, based on the support reduction algorithm. We compare its performance to those of the empirical estimator, on the basis of a simulation study.
△ Less
Submitted 28 February, 2012;
originally announced February 2012.
-
Variational Bayes approach for model aggregation in unsupervised classification with Markovian dependency
Authors:
Stevenn Volant,
Marie-Laure Martin Magniette,
Stéphane Robin
Abstract:
We consider a binary unsupervised classification problem where each observation is associated with an unobserved label that we want to retrieve. More precisely, we assume that there are two groups of observation: normal and abnormal. The `normal' observations are coming from a known distribution whereas the distribution of the `abnormal' observations is unknown. Several models have been developed…
▽ More
We consider a binary unsupervised classification problem where each observation is associated with an unobserved label that we want to retrieve. More precisely, we assume that there are two groups of observation: normal and abnormal. The `normal' observations are coming from a known distribution whereas the distribution of the `abnormal' observations is unknown. Several models have been developed to fit this unknown distribution. In this paper, we propose an alternative based on a mixture of Gaussian distributions. The inference is done within a variational Bayesian framework and our aim is to infer the posterior probability of belonging to the class of interest. To this end, it makes no sense to estimate the mixture component number since each mixture model provides more or less relevant information to the posterior probability estimation. By computing a weighted average (named aggregated estimator) over the model collection, Bayesian Model Averaging (BMA) is one way of combining models in order to account for information provided by each model. The aim is then the estimation of the weights and the posterior probability for one specific model. In this work, we derive optimal approximations of these quantities from the variational theory and propose other approximations of the weights. To perform our method, we consider that the data are dependent (Markovian dependency) and hence we consider a Hidden Markov Model. A simulation study is carried out to evaluate the accuracy of the estimates in terms of classification. We also present an application to the analysis of public health surveillance systems.
△ Less
Submitted 4 May, 2011;
originally announced May 2011.
-
Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome
Authors:
Caroline Bérard,
Marie-Laure Martin-Magniette,
Véronique Brunaud,
Sébastien Aubourg,
Stéphane Robin
Abstract:
Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification…
▽ More
Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. In contrast to previous methods, we account for all available information on the probes as well as biological knowledge like annotation and spatial dependence between probes. Since probes are not biologically relevant units we propose a classification rule for non-connected regions covered by several probes. Applications to transcriptomic and ChIP-chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the importance of a precise modeling and the region classification.
△ Less
Submitted 28 April, 2011;
originally announced April 2011.
-
Uncovering latent structure in valued graphs: A variational approach
Authors:
Mahendra Mariadassou,
Stéphane Robin,
Corinne Vacher
Abstract:
As more and more network-structured data sets are available, the statistical analysis of valued graphs has become common place. Looking for a latent structure is one of the many strategies used to better understand the behavior of a network. Several methods already exist for the binary case. We present a model-based strategy to uncover groups of nodes in valued graphs. This framework can be used f…
▽ More
As more and more network-structured data sets are available, the statistical analysis of valued graphs has become common place. Looking for a latent structure is one of the many strategies used to better understand the behavior of a network. Several methods already exist for the binary case. We present a model-based strategy to uncover groups of nodes in valued graphs. This framework can be used for a wide span of parametric random graphs models and allows to include covariates. Variational tools allow us to achieve approximate maximum likelihood estimation of the parameters of these models. We provide a simulation study showing that our estimation method performs well over a broad range of situations. We apply this method to analyze host--parasite interaction networks in forest ecosystems.
△ Less
Submitted 8 November, 2010;
originally announced November 2010.
-
Exact posterior distributions over the segmentation space and model selection for multiple change-point detection problems
Authors:
Guillem Rigaill,
Emilie Lebarbier,
Stéphane Robin
Abstract:
In segmentation problems, inference on change-point position and model selection are two difficult issues due to the discrete nature of change-points. In a Bayesian context, we derive exact, non-asymptotic, explicit and tractable formulae for the posterior distribution of variables such as the number of change-points or their positions. We also derive a new selection criterion that accounts for th…
▽ More
In segmentation problems, inference on change-point position and model selection are two difficult issues due to the discrete nature of change-points. In a Bayesian context, we derive exact, non-asymptotic, explicit and tractable formulae for the posterior distribution of variables such as the number of change-points or their positions. We also derive a new selection criterion that accounts for the reliability of the results. All these results are based on an efficient strategy to explore the whole segmentation space, which is very large. We illustrate our methodology on both simulated data and a comparative genomic hybridisation profile.
△ Less
Submitted 25 April, 2010;
originally announced April 2010.