Search | arXiv e-print repository

doi 10.1002/bimj.202100105

Semi-supervised empirical Bayes group-regularized factor regression

Authors: Magnus M. Münch, Mark A. van de Wiel, Aad W. van der Vaart, Carel F. W. Peeters

Abstract: The features in high dimensional biomedical prediction problems are often well described with lower dimensional manifolds. An example is genes that are organised in smaller functional networks. The outcome can then be described with the factor regression model. A benefit of the factor model is that is allows for straightforward inclusion of unlabeled observations in the estimation of the model, i.… ▽ More The features in high dimensional biomedical prediction problems are often well described with lower dimensional manifolds. An example is genes that are organised in smaller functional networks. The outcome can then be described with the factor regression model. A benefit of the factor model is that is allows for straightforward inclusion of unlabeled observations in the estimation of the model, i.e., semi-supervised learning. In addition, the high dimensional features in biomedical prediction problems are often well characterised. Examples are genes, for which annotation is available, and metabolites with $p$-values from a previous study available. In this paper, the extra information on the features is included in the prior model for the features. The extra information is weighted and included in the estimation through empirical Bayes, with Variational approximations to speed up the computation. The method is demonstrated in simulations and two applications. One application considers influenza vaccine efficacy prediction based on microarray data. The second application predictions oral cancer metastatsis from RNAseq data. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: 19 pages, 5 figures, submitted to Biometrical Journal

Journal ref: Biometrical Journal, 64(7): 1289-1306 (2022)

arXiv:1901.10217 [pdf, other]

Incorporating prior information and borrowing information in high-dimensional sparse regression using the horseshoe and variational Bayes

Authors: Gino B. Kpogbezan, Mark A. van de Wiel, Wessel N. van Wieringen, Aad W. van der Vaart

Abstract: We introduce a sparse high-dimensional regression approach that can incorporate prior information on the regression parameters and can borrow information across a set of similar datasets. Prior information may for instance come from previous studies or genomic databases, and information borrowed across a set of genes or genomic networks. The approach is based on prior modelling of the regression p… ▽ More We introduce a sparse high-dimensional regression approach that can incorporate prior information on the regression parameters and can borrow information across a set of similar datasets. Prior information may for instance come from previous studies or genomic databases, and information borrowed across a set of genes or genomic networks. The approach is based on prior modelling of the regression parameters using the horseshoe prior, with a prior on the sparsity index that depends on external information. Multiple datasets are integrated by applying an empirical Bayes strategy on hyperparameters. For computational efficiency we approximate the posterior distribution using a variational Bayes method. The proposed framework is useful for analysing large-scale data sets with complex dependence structures. We illustrate this by applications to the reconstruction of gene regulatory networks and to eQTL mapping. △ Less

Submitted 29 January, 2019; originally announced January 2019.

arXiv:1805.00389 [pdf, other]

doi 10.1093/biostatistics/kxz062

Adaptive group-regularized logistic elastic net regression

Authors: Magnus M. Münch, Carel F. W. Peeters, Aad W. van der Vaart, Mark A. van de Wiel

Abstract: In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (a) p-values from a previous study, (b) a summary of prior information, and (c) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection, but is not straightforward in the s… ▽ More In high-dimensional data settings, additional information on the features is often available. Examples of such external information in omics research are: (a) p-values from a previous study, (b) a summary of prior information, and (c) omics annotation. The inclusion of this information in the analysis may enhance classification performance and feature selection, but is not straightforward in the standard regression setting. As a solution to this problem, we propose a group-regularized (logistic) elastic net regression method, where each penalty parameter corresponds to a group of features based on the external information. The method, termed gren, makes use of the Bayesian formulation of logistic elastic net regression to estimate both the model and penalty parameters in an approximate empirical-variational Bayes framework. Simulations and an application to a colon cancer microRNA study show that, if the partitioning of the features is informative, classification performance and feature selection are indeed enhanced. △ Less

Submitted 1 May, 2018; originally announced May 2018.

Comments: 19 pages, 5 figures, supplementary material available from first author's personal website

Journal ref: Biostatistics, 22(4): 723-737, 2021

arXiv:1711.06926 [pdf, other]

The Bayes Lepski's Method and Credible Bands through Volume of Tubular Neighborhoods

Authors: William Weimin Yoo, Aad W. van der Vaart

Abstract: For a general class of priors based on random series basis expansion, we develop the Bayes Lepski's method to estimate unknown regression function. In this approach, the series truncation point is determined based on a stopping rule that balances the posterior mean bias and the posterior standard deviation. Equipped with this mechanism, we present a method to construct adaptive Bayesian credible b… ▽ More For a general class of priors based on random series basis expansion, we develop the Bayes Lepski's method to estimate unknown regression function. In this approach, the series truncation point is determined based on a stopping rule that balances the posterior mean bias and the posterior standard deviation. Equipped with this mechanism, we present a method to construct adaptive Bayesian credible bands, where this statistical task is reformulated into a problem in geometry, and the band's radius is computed based on finding the volume of certain tubular neighborhood embedded on a unit sphere. We consider two special cases involving B-splines and wavelets, and discuss some interesting consequences such as the uncertainty principle and self-similarity. Lastly, we show how to program the Bayes Lepski stopping rule on a computer, and numerical simulations in conjunction with our theoretical investigations concur that this is a promising Bayesian uncertainty quantification procedure. △ Less

Submitted 18 November, 2017; originally announced November 2017.

Comments: 42 pages, 2 figures, 1 table

MSC Class: Primary 62G15 62G05; secondary 62G08 62C10

arXiv:1605.07514 [pdf, other]

An empirical Bayes approach to network recovery using external knowledge

Authors: Gino B. Kpogbezan, Aad W. van der Vaart, Wessel N. van Wieringen, Gwenaël G. R. Leday, Mark A. van de Wiel

Abstract: Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simul… ▽ More Reconstruction of a high-dimensional network may benefit substantially from the inclusion of prior knowledge on the network topology. In the case of gene interaction networks such knowledge may come for instance from pathway repositories like KEGG, or be inferred from data of a pilot study. The Bayesian framework provides a natural means of including such prior knowledge. Based on a Bayesian Simultaneous Equation Model, we develop an appealing empirical Bayes procedure which automatically assesses the relevance of the used prior knowledge. We use variational Bayes method for posterior densities approximation and compare its accuracy with that of Gibbs sampling strategy. Our method is computationally fast, and can outperform known competitors. In a simulation study we show that accurate prior data can greatly improve the reconstruction of the network, but need not harm the reconstruction if wrong. We demonstrate the benefits of the method in an analysis of gene expression data from GEO. In particular, the edges of the recovered network have superior reproducibility (compared to that of competitors) over resampled versions of the data. △ Less

Submitted 24 May, 2016; originally announced May 2016.

arXiv:1510.03771 [pdf, other]

Gene network reconstruction using global-local shrinkage priors

Authors: Gwenaël G. R. Leday, Mathisca C. M. de Gunst, Gino B. Kpogbezan, Aad W. Van der Vaart, Wessel N. Van Wieringen, Mark A. Van de Wiel

Abstract: Reconstructing a gene network from high-throughput molecular data is often a challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters i… ▽ More Reconstructing a gene network from high-throughput molecular data is often a challenging task, as the number of parameters to estimate easily is much larger than the sample size. A conventional remedy is to regularize or penalize the model likelihood. In network models, this is often done locally in the neighbourhood of each node or gene. However, estimation of the many regularization parameters is often difficult and can result in large statistical uncertainties. In this paper we propose to combine local regularization with global shrinkage of the regularization parameters to borrow strength between genes and improve inference. We employ a simple Bayesian model with non-sparse, conjugate priors to facilitate the use of fast variational approximations to posteriors. We discuss empirical Bayes estimation of hyper-parameters of the priors, and propose a novel approach to rank-based posterior thresholding. Using extensive model- and data-based simulations, we demonstrate that the proposed inference strategy outperforms popular (sparse) methods, yields more stable edges, and is more reproducible. △ Less

Submitted 13 October, 2015; originally announced October 2015.

Comments: 27 pages, 5 figures

arXiv:1312.1795 [pdf, ps, other]

doi 10.1214/12-AOAS605

Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines

Authors: Gwenaël G. R. Leday, Aad W. van der Vaart, Wessel N. van Wieringen, Mark A. van de Wiel

Abstract: DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexi… ▽ More DNA copy number and mRNA expression are widely used data types in cancer studies, which combined provide more insight than separately. Whereas in existing literature the form of the relationship between these two types of markers is fixed a priori, in this paper we model their association. We employ piecewise linear regression splines (PLRS), which combine good interpretation with sufficient flexibility to identify any plausible type of relationship. The specification of the model leads to estimation and model selection in a constrained, nonstandard setting. We provide methodology for testing the effect of DNA on mRNA and choosing the appropriate model. Furthermore, we present a novel approach to obtain reliable confidence bands for constrained PLRS, which incorporates model uncertainty. The procedures are applied to colorectal and breast cancer data. Common assumptions are found to be potentially misleading for biologically relevant genes. More flexible models may bring more insight in the interaction between the two markers. △ Less

Submitted 6 December, 2013; originally announced December 2013.

Comments: Published in at http://dx.doi.org/10.1214/12-AOAS605 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS605

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 2, 823-845

Showing 1–7 of 7 results for author: van der Vaart, A W