-
Multivariate Generalised Linear Mixed Models With Graphical Latent Covariance Structure
Authors:
Jeanett S. Pelck,
Rodrigo Labouriau
Abstract:
This paper introduces a method for studying the correlation structure of a range of responses modelled by a multivariate generalised linear mixed model (MGLMM). The methodology requires the existence of clusters of observations and that each of the several responses studied is modelled using a generalised linear mixed models (GLMM) containing random components representing the clusters. We constru…
▽ More
This paper introduces a method for studying the correlation structure of a range of responses modelled by a multivariate generalised linear mixed model (MGLMM). The methodology requires the existence of clusters of observations and that each of the several responses studied is modelled using a generalised linear mixed models (GLMM) containing random components representing the clusters. We construct a MGLMM by assuming that the distribution of each of the random components representing the clusters is the marginal distribution of a (sufficiently regular) multivariate elliptically contoured distribution. We use an undirected graphical model to represent the correlation structure of the random components representing the clusters of observations for each response. This representation allows us to draw conclusions regarding unknown underlying determining factors related to the clusters of observations. Using a combination of an undirected graph and a directed acyclic graph (DAG), we jointly represent the correlation structure of the responses and the related random components. Applying the theory of graphical models allows us to describe and draw conclusions on the correlation and, in some cases, the dependence between responses of different statistical nature (\eg following different distributions, different linear predictors and link functions). We present some simulation studies illustrating the proposed methodology.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
Conditional Inference for Multivariate Generalised Linear Mixed Models
Authors:
Jeanett S. Pelck,
Rodrigo Labouriau
Abstract:
We propose a method for inference in generalised linear mixed models (GLMMs) and several extensions of these models. First, we extend the GLMM by allowing the distribution of the random components to be non-Gaussian, that is, assuming an absolutely continuous distribution with respect to the Lebesgue measure that is symmetric around zero, unimodal and with finite moments up to fourth-order. Second…
▽ More
We propose a method for inference in generalised linear mixed models (GLMMs) and several extensions of these models. First, we extend the GLMM by allowing the distribution of the random components to be non-Gaussian, that is, assuming an absolutely continuous distribution with respect to the Lebesgue measure that is symmetric around zero, unimodal and with finite moments up to fourth-order. Second, we allow the conditional distribution to follow a dispersion model instead of exponential dispersion models. Finally, we extend these models to a multivariate framework where multiple responses are combined by imposing a multivariate absolute continuous distribution on the random components representing common clusters of observations in all the marginal models.
Maximum likelihood inference in these models involves evaluating an integral that often cannot be computed in closed form. We suggest an inference method that predicts values of random components and does not involve the integration of conditional likelihood quantities.
The multivariate GLMMs that we studied can be constructed with marginal GLMMs of different statistical nature, and at the same time, represent complex dependence structure providing a rather flexible tool for applications.
△ Less
Submitted 25 July, 2021;
originally announced July 2021.
-
Multivariate Methods for Detection of Rubbery Rot in Storage Apples by Monitoring Volatile Organic Compounds: An Example of Multivariate Generalised Mixed Models
Authors:
J. S. Pelck,
H. Holthusen,
M. Edelenbos,
A. Luca,
R. Labouriau
Abstract:
This article is a case study illustrating the use of a multivariate statistical method for screening potential chemical markers for early detection of post-harvest disease in storage fruit. We simultaneously measure a range of volatile organic compounds (VOCs) and two measures of severity of disease infection in apples under storage: the number of apples presenting visible symptoms and the lesion…
▽ More
This article is a case study illustrating the use of a multivariate statistical method for screening potential chemical markers for early detection of post-harvest disease in storage fruit. We simultaneously measure a range of volatile organic compounds (VOCs) and two measures of severity of disease infection in apples under storage: the number of apples presenting visible symptoms and the lesion area. We use multivariate generalised linear mixed models (MGLMM) for studying association patterns of those simultaneously observed responses via the covariance structure of random components. Remarkably, those MGLMMs can be used to represent patterns of association between quantities of different statistical nature. In the particular example considered in this paper, there are positive responses (concentrations of VOC, Gamma distribution based models), positive responses possibly containing observations with zero values (lesion area, Compound Poisson distribution based models) and binomially distributed responses (proportion of apples presenting infection symptoms). We represent patterns of association inferred with the MGLMMs using graphical models (a network represented by a graph), which allow us to eliminate spurious associations due to a cascade of indirect correlations between the responses.
△ Less
Submitted 23 July, 2021;
originally announced July 2021.
-
A Multivariate Methodology for Analysing Students' Performance Using Register Data
Authors:
Jeanett S. Pelck,
Rafael Pimentel Maia,
Hildete P. Pinheiro,
Rodrigo Labouriau
Abstract:
We present a new method for jointly modelling the students' results in the university's admission exams and their performance in subsequent courses at the university. The case considered involved all the students enrolled at the University of Campinas in 2014 to evening studies programs in educational branches related to exact sciences. We collected the number of attempts used for passing the univ…
▽ More
We present a new method for jointly modelling the students' results in the university's admission exams and their performance in subsequent courses at the university. The case considered involved all the students enrolled at the University of Campinas in 2014 to evening studies programs in educational branches related to exact sciences. We collected the number of attempts used for passing the university course of geometry and the results of the admission exams of those students in seven disciplines. The method introduced involved a combination of multivariate generalised linear mixed models (GLMM) and graphical models for representing the covariance structure of the random components. The models we used allowed us to discuss the association of quantities of very different nature. We used Gaussian GLMM for modelling the performance in the admission exams and a frailty discrete-time Cox proportional model, represented by a GLMM, to describe the number of attempts for passing Geometry.
The analyses were stratified into two populations: the students who received a bonus giving advantages in the university's admission process to compensate social and racial inequalities and those who did not receive the compensation. The two populations presented different patterns. Using general properties of graphical models, we argue that, on the one hand, the predicted performance in the admission exam of Mathematics could solely be used as a predictor of the performance in geometry for the students who received the bonus. On the other hand, the Portuguese admission exam's predicted performance could be used as a single predictor of the performance in geometry for the students who did not receive the bonus.
△ Less
Submitted 21 February, 2021;
originally announced February 2021.
-
Inference Functions for Semiparametric Models
Authors:
Rodrigo Labouriau
Abstract:
The paper discusses inference techniques for semiparametric models based on suitable versions of inference functions. The text contains two parts. In the first part, we review the optimality theory for non-parametric models based on the notions of path differentiability and statistical functional differentiability. Those notions are adapted to the context of semiparametric models by applying the i…
▽ More
The paper discusses inference techniques for semiparametric models based on suitable versions of inference functions. The text contains two parts. In the first part, we review the optimality theory for non-parametric models based on the notions of path differentiability and statistical functional differentiability. Those notions are adapted to the context of semiparametric models by applying the inference theory of statistical functionals to the functional that associates the value of the interest parameter to the corresponding probability measure. The second part of the paper discusses the theory of inference functions for semiparametric models. We define a class of regular inference functions, and provide two equivalent characterisations of those inference functions: One adapted from the classic theory of inference functions for parametric models, and one motivated by differential geometric considerations concerning the statistical model. Those characterisations yield an optimality theory for estimation under semiparametric models. We present a necessary and sufficient condition for the coincidence of the bound for the concentration of estimators based on inference functions and the semiparametric Cramèr-Rao bound. Projecting the score function for the parameter of interest on specially designed spaces of functions, we obtain optimal inference functions. Considering estimation when a sufficient statistic is present, we provide an alternative justification for the conditioning principle in a context of semiparametric models. The article closes with a characterisation of when the semiparametric Cramèr-Rao bound is attained by estimators derived from regular inference functions.
△ Less
Submitted 14 November, 2020;
originally announced November 2020.
-
Using Multivariate Generalised Linear Mixed Models for Studying Roots Development: An Example Based on Minirhizotron Observations
Authors:
Jeanett S. Pelck,
Rodrigo Labouriau
Abstract:
The characterisation of the spatial and temporal distribution of the root system in a cultivated field depends on the soil volume occupied by the root systems (the scatter), and the local intensity of the root colonisation in the field (the intensity). We introduce a multivariate generalised linear mixed model for simultaneously describing the scatter and the intensity using data obtained with min…
▽ More
The characterisation of the spatial and temporal distribution of the root system in a cultivated field depends on the soil volume occupied by the root systems (the scatter), and the local intensity of the root colonisation in the field (the intensity). We introduce a multivariate generalised linear mixed model for simultaneously describing the scatter and the intensity using data obtained with minirhizotrons (i.e., tubes with observation windows, which are inserted in the soil, enabling to observe the roots directly). The models presented allow studying intricate spatial and temporal dependence patterns using a graphical model to represent the dependence structure of latent random components.
The scatter is described by a binomial mixed model (presence of roots in observation windows). The number of roots crossing the reference lines in the observational windows of the minirhizotron is used to estimate the intensity through a specially defined Poisson mixed model. We explore the fact that it is possible to construct multivariate extensions of generalised linear mixed models that allow to simultaneously represent patterns of dependency of the scatter and the intensity along with time and space.
We present an example where the intensity and scatter are simultaneously determined at three different time points. A positive association between the intensity and scatter at each time point was found, suggesting that the plants are not compensating a reduced occupation of the soil by increasing the number of roots per volume of soil. Using the general properties of graphical models, we identify a first-order Markovian dependence pattern between successively observed scatters and intensities. This lack of memory indicates that no long-lasting temporal causal effects are affecting the roots' development. The two dependence patterns described above cannot be detected with univariate models.
△ Less
Submitted 1 November, 2020;
originally announced November 2020.
-
Construction and Extension of Dispersion Models
Authors:
Rodrigo Labouriau
Abstract:
There are two main classes of dispersion models studied in the literature: proper (PDM), and exponential dispersion models (EDM). Dispersion models that are neither proper nor exponential dispersion models are termed here non-standard dispersion models (NSDM). This paper exposes a technique for constructing new PDMs and NSDMs. This construction provides a solution to an open question in the theory…
▽ More
There are two main classes of dispersion models studied in the literature: proper (PDM), and exponential dispersion models (EDM). Dispersion models that are neither proper nor exponential dispersion models are termed here non-standard dispersion models (NSDM). This paper exposes a technique for constructing new PDMs and NSDMs. This construction provides a solution to an open question in the theory of dispersion models about the extension of non-standard dispersion models. Given a unit deviance function, a dispersion model is usually constructed by calculating a normalising function that makes the density function integrates one. This calculation involves the solution of non-trivial integral equations. The main idea explored here is to use characteristic functions of real non-lattice symmetric probability measures to construct a family of unit deviances that are sufficiently regular to make the associated integral equations tractable. The integral equations associated to those unit deviances admit a trivial solution, in the sense that the normalising function is a constant function independent of the observed values. However, we show, using the machinery of distributions (i.e., generalised functions) and expansions of the normalising function with respect to specially constructed Riez systems, that those integral equations also admit infinitely many non-trivial solutions, generating many NSDMs. We conclude that, the cardinality of the class of non-standard dispersion models is larger than the cardinality of the class of real non-lattice symmetric probability measures.
△ Less
Submitted 18 August, 2020; v1 submitted 12 August, 2020;
originally announced August 2020.
-
On the Bias of the Score Function of Finite Mixture Models
Authors:
Rodrigo Labouriau
Abstract:
We characterise the unbiasedness of the score function, viewed as an inference function for a class of finite mixture models. The models studied represent the situation where there is a stratification of the observations in a finite number of groups. We show that, under mild regularity conditions, the score function for estimating the parameters identifying each group's distribution is unbiased. W…
▽ More
We characterise the unbiasedness of the score function, viewed as an inference function for a class of finite mixture models. The models studied represent the situation where there is a stratification of the observations in a finite number of groups. We show that, under mild regularity conditions, the score function for estimating the parameters identifying each group's distribution is unbiased. We also show that if one introduces a mixture in the scenario described above so that for some observations, it is only known that they belong to some of the groups with a probability not in $\{ 0, 1 \}$, then the score function becomes biased. We argue then that under further mild regularity, the maximum likelihood estimate is not consistent. The results above are extended to regular models containing arbitrary nuisance parameters, including semiparametric models.
△ Less
Submitted 15 May, 2023; v1 submitted 9 February, 2020;
originally announced February 2020.
-
An introduction to Bent Jorgensen's ideas
Authors:
Gauss M. Cordeiro,
Rodrigo Labouriau,
Denise A. Botter
Abstract:
We briefly expose some key aspects of the theory and use of dispersion models, for which Bent Jorgensen played a crucial role as a driving force and an inspiration source. Starting with the general notion of dispersion models, built using minimalistic mathematical assumptions, we specialize in two classes of families of distributions with different statistical flavors: exponential dispersion and p…
▽ More
We briefly expose some key aspects of the theory and use of dispersion models, for which Bent Jorgensen played a crucial role as a driving force and an inspiration source. Starting with the general notion of dispersion models, built using minimalistic mathematical assumptions, we specialize in two classes of families of distributions with different statistical flavors: exponential dispersion and proper dispersion models. The construction of dispersion models involves the solution of integral equations that are, in general, untractable. These difficulties disappear when a more mathematical structure is assumed: it reduces to the calculation of a moment generating function or of a Riemann-Stieltjes integral for the exponential dispersion and the proper dispersion models, respectively. A new technique for constructing dispersion models based on characteristic functions is introduced turning the integral equations above into a tractable convolution equation and yielding examples of dispersion models that are neither proper dispersion nor exponential dispersion models. A corollary is that the cardinality of regular and non-regular dispersion models are both large.
Some selected applications are discussed including exponential families non-linear models (for which generalized linear models are particular cases) and several models for clustered and dependent data based on a latent Levy process.
△ Less
Submitted 25 August, 2020; v1 submitted 19 September, 2019;
originally announced September 2019.
-
The Laplace transform and polynomial approximation in L2
Authors:
Rodrigo Labouriau
Abstract:
This short note gives a sufficient condition for having the class of polynomials dense in the space of square integrable functions with respect to a finite measure dominated by the Lebesgue measure in the real line, here denoted by $L^2$. It is shown that if the Laplace transform of the measure in play is bounded in a neighbourhood of the origin, then the moments of all order are finite and the cl…
▽ More
This short note gives a sufficient condition for having the class of polynomials dense in the space of square integrable functions with respect to a finite measure dominated by the Lebesgue measure in the real line, here denoted by $L^2$. It is shown that if the Laplace transform of the measure in play is bounded in a neighbourhood of the origin, then the moments of all order are finite and the class of polynomials is dense in $L^2$. The existence of the moments of all orders is well known for the case where the measure is concentrated in the positive real line (see Feller, 1966), but the result concerning the polynomial approximation is original, even thought the proof is relatively simple. Additionally, an alternative stronger condition easier to be verified not involving the calculation of the Laplace transform is given. The condition essentially says that the density of the measure should have exponential decaying tails. The tools presented are of interest for constructing semiparametric extensions of classic parametric models.
△ Less
Submitted 9 March, 2016;
originally announced March 2016.
-
A Note on the Identifiability of Generalized Linear Mixed Models
Authors:
Rodrigo Labouriau
Abstract:
I present here a simple proof that, under general regularity conditions, the standard parametrization of generalized linear mixed model is identifiable. The proof is based on the assumptions of generalized linear mixed models on the first and second order moments and some general mild regularity conditions, and, therefore, is extensible to quasi-likelihood based generalized linear models. In parti…
▽ More
I present here a simple proof that, under general regularity conditions, the standard parametrization of generalized linear mixed model is identifiable. The proof is based on the assumptions of generalized linear mixed models on the first and second order moments and some general mild regularity conditions, and, therefore, is extensible to quasi-likelihood based generalized linear models. In particular, binomial and Poisson mixed models with dispersion parameter are identifiable when equipped with the standard parametrization.
△ Less
Submitted 4 May, 2014;
originally announced May 2014.
-
Multivariate Survival Mixed Models for Genetic Analysis of Longevity Traits
Authors:
Rafael Pimentel Maia,
Per Madsen,
Rodrigo Labouriau
Abstract:
A class of multivariate mixed survival models for continuous and discrete time with a complex covariance structure is introduced in a context of quantitative genetic applications. The methods introduced can be used in many applications in quantitative genetics although the discussion presented concentrates on longevity studies. The framework presented allows to combine models based on continuous t…
▽ More
A class of multivariate mixed survival models for continuous and discrete time with a complex covariance structure is introduced in a context of quantitative genetic applications. The methods introduced can be used in many applications in quantitative genetics although the discussion presented concentrates on longevity studies. The framework presented allows to combine models based on continuous time with models based on discrete time in a joint analysis. The continuous time models are approximations of the frailty model in which the hazard function will be assumed to be piece-wise constant. The discrete time models used are multivariate variants of the discrete relative risk models. These models allow for regular parametric likelihood-based inference by exploring a coincidence of their likelihood functions and the likelihood functions of suitably defined multivariate generalized linear mixed models. The models include a dispersion parameter, which is essential for obtaining a decomposition of the variance of the trait of interest as a sum of parcels representing the additive genetic effects, environmental effects and unspecified sources of variability; as required in quantitative genetic applications. The methods presented are implemented in such a way that large and complex quantitative genetic data can be analyzed.
△ Less
Submitted 4 May, 2014; v1 submitted 4 March, 2013;
originally announced March 2013.
-
Characterization of differentially expressed genes using high-dimensional co-expression networks
Authors:
Gabriel C. G. de Abreu,
Rodrigo Labouriau
Abstract:
We present a technique to characterize differentially expressed genes in terms of their position in a high-dimensional co-expression network. The set-up of Gaussian graphical models is used to construct representations of the co-expression network in such a way that redundancy and the propagation of spurious information along the network are avoided. The proposed inference procedure is based on th…
▽ More
We present a technique to characterize differentially expressed genes in terms of their position in a high-dimensional co-expression network. The set-up of Gaussian graphical models is used to construct representations of the co-expression network in such a way that redundancy and the propagation of spurious information along the network are avoided. The proposed inference procedure is based on the minimization of the Bayesian Information Criterion (BIC) in the class of decomposable graphical models. This class of models can be used to represent complex relationships and has suitable properties that allow to make effective inference in problems with high degree of complexity (e.g. several thousands of genes) and small number of observations (e.g. 10-100) as typically occurs in high throughput gene expression studies. Taking advantage of the internal structure of decomposable graphical models, we construct a compact representation of the co-expression network that allows to identify the regions with high concentration of differentially expressed genes. It is argued that differentially expressed genes located in highly interconnected regions of the co-expression network are less informative than differentially expressed genes located in less interconnected regions. Based on that idea, a measure of uncertainty that resembles the notion of relative entropy is proposed. Our methods are illustrated with three publically available data sets on microarray experiments (the larger involving more than 50,000 genes and 64 patients) and a short simulation study.
△ Less
Submitted 15 November, 2010;
originally announced November 2010.
-
High-dimensional Graphical Model Search with gRapHD R Package
Authors:
Gabriel C. G. de Abreu,
Rodrigo Labouriau,
David Edwards
Abstract:
This paper presents the R package gRapHD for efficient selection of high-dimensional undirected graphical models. The package provides tools for selecting trees, forests and decomposable models minimizing information criteria such as AIC or BIC, and for displaying the independence graphs of the models. It has also some useful tools for analysing graphical structures. It supports the use of discret…
▽ More
This paper presents the R package gRapHD for efficient selection of high-dimensional undirected graphical models. The package provides tools for selecting trees, forests and decomposable models minimizing information criteria such as AIC or BIC, and for displaying the independence graphs of the models. It has also some useful tools for analysing graphical structures. It supports the use of discrete, continuous, or both types of variables simultaneously.
△ Less
Submitted 22 September, 2010; v1 submitted 7 September, 2009;
originally announced September 2009.
-
An efficient strategy to characterize alleles and complex haplotypes using DNA-markers
Authors:
Rodrigo Labouriau,
Poul Sørensen,
Helle R. Juul-Madsen
Abstract:
We consider the problem of detecting and estimating the strength of association between a trait of interest and alleles or haplotypes in a small genomic region (e.g. a gene or a gene complex), when no direct information on that region is available but the values of neighbouring DNA-markers are at hand. We argue that the effects of the non-observable haplotypes of the genomic regions can and shou…
▽ More
We consider the problem of detecting and estimating the strength of association between a trait of interest and alleles or haplotypes in a small genomic region (e.g. a gene or a gene complex), when no direct information on that region is available but the values of neighbouring DNA-markers are at hand. We argue that the effects of the non-observable haplotypes of the genomic regions can and should be represented by factors representing disjoint groups of marker-alleles. A theoretical argument based on a hypothetical phylogenetic tree supports this general claim.
The techniques described allow to identify and to infer the number of detectable haplotypes in the genomic region that are associated with a trait. The methods proposed use an exhaustive combinatorial search coupled with the maximization of a version of the likelihood function penalized for the number of parameters. This procedure can easily be implemented with standard statistical methods for a moderate number of marker-alleles.
△ Less
Submitted 10 April, 2008;
originally announced April 2008.