Search | arXiv e-print repository

Neural Networks beyond explainability: Selective inference for sequence motifs

Authors: Antoine Villié, Philippe Veber, Yohann de Castro, Laurent Jacob

Abstract: Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM,… ▽ More Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally, we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference-a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS). △ Less

Submitted 23 December, 2022; originally announced December 2022.

arXiv:1901.03864 [pdf]

Anatomy of the vertebral column lymphatic network in mice

Authors: Laurent Jacob, Ligia Boisserand, Juliette Pestel, Salli Antila, Jean-Mickael Thomas, Marie-Stephane Aigrot, Thomas Mathivet, Seyoung Lee, Kari Alitalo, Nicolas Renier, Anne Eichmann, Jean-Leon Thomas

Abstract: Cranial lymphatic vessels (LVs) are involved in transport of fluids, macromolecules and CNS immune responses. Little information about spinal LVs is available, because these delicate structures are embedded within vertebral tissues and difficult to visualize using traditional histology. Here we reveal an extended vertebral column LV network using three-dimensional imaging of decalcified iDISCO-cla… ▽ More Cranial lymphatic vessels (LVs) are involved in transport of fluids, macromolecules and CNS immune responses. Little information about spinal LVs is available, because these delicate structures are embedded within vertebral tissues and difficult to visualize using traditional histology. Here we reveal an extended vertebral column LV network using three-dimensional imaging of decalcified iDISCO-clarified spine segments. Spinal LVs are metameric circuits exiting along spinal nerve roots and connecting to lymph nodes and the thoracic duct. They navigate in the epidural space and the dura mater around the spinal cord, and associate with leukocytes, peripheral dorsal root and sympathetic ganglia. Spinal LVs are VEGF-C-dependent and remodel extensively after spinal cord injury. They constitute an extension to cranial circuits for meningeal fluids, but also a route for perineural fluids and a link with peripheral immune and nervous circuits. Vertebral column LVs may be potential targets to improve the maintenance and repair of 32 spinal tissues as well as gatekeepers of CNS immunity. △ Less

Submitted 12 January, 2019; originally announced January 2019.

Comments: 8 figures + 2 supplemental figures

arXiv:1211.4259 [pdf, ps, other]

Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed

Authors: Laurent Jacob, Johann Gagnon-Bartsch, Terence P. Speed

Abstract: When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g., when the goal is to cluster the samples or to build a corrected v… ▽ More When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g., when the goal is to cluster the samples or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data or build estimators for unsupervised problems. The proposed methods are then evaluated on three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state of the art corrections. △ Less

Submitted 18 November, 2012; originally announced November 2012.

arXiv:1206.6980 [pdf, ps, other]

doi 10.1214/11-AOAS528

More power via graph-structured tests for differential expression of gene networks

Authors: Laurent Jacob, Pierre Neuvial, Sandrine Dudoit

Abstract: We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties suc… ▽ More We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of nonhomogeneous subgraphs of a given large graph, which poses both computational and multiple hypothesis testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast and bladder cancer gene expression data analyzed in the context of KEGG and NCI pathways. △ Less

Submitted 29 June, 2012; originally announced June 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS528 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org). arXiv admin note: substantial text overlap with arXiv:1009.5173

Report number: IMS-AOAS-AOAS528

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 2, 561-600

arXiv:1009.5173 [pdf, ps, other]

doi 10.1214/11-AOAS528

Gains in Power from Structured Two-Sample Tests of Means on Graphs

Authors: Laurent Jacob, Pierre Neuvial, Sandrine Dudoit

Abstract: We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties suc… ▽ More We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways. △ Less

Submitted 27 September, 2010; originally announced September 2010.

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 2, 561-600

arXiv:1001.3109 [pdf, ps, other]

Increasing stability and interpretability of gene expression signatures

Authors: Anne-Claire Haury, Laurent Jacob, Jean-Philippe Vert

Abstract: Motivation : Molecular signatures for diagnosis or prognosis estimated from large-scale gene expression data often lack robustness and stability, rendering their biological interpretation challenging. Increasing the signature's interpretability and stability across perturbations of a given dataset and, if possible, across datasets, is urgently needed to ease the discovery of important biological… ▽ More Motivation : Molecular signatures for diagnosis or prognosis estimated from large-scale gene expression data often lack robustness and stability, rendering their biological interpretation challenging. Increasing the signature's interpretability and stability across perturbations of a given dataset and, if possible, across datasets, is urgently needed to ease the discovery of important biological processes and, eventually, new drug targets. Results : We propose a new method to construct signatures with increased stability and easier interpretability. The method uses a gene network as side interpretation and enforces a large connectivity among the genes in the signature, leading to signatures typically made of genes clustered in a few subnetworks. It combines the recently proposed graph Lasso procedure with a stability selection procedure. We evaluate its relevance for the estimation of a prognostic signature in breast cancer, and highlight in particular the increase in interpretability and stability of the signature. △ Less

Submitted 18 January, 2010; originally announced January 2010.

arXiv:0801.4301 [pdf, ps, other]

Virtual screening of GPCRs: an in silico chemogenomics approach

Authors: Laurent Jacob, Brice Hoffmann, Véronique Stoven, Jean-Philippe Vert

Abstract: The G-protein coupled receptor (GPCR) superfamily is currently the largest class of therapeutic targets. \textit{In silico} prediction of interactions between GPCRs and small molecules is therefore a crucial step in the drug discovery process, which remains a daunting task due to the difficulty to characterize the 3D structure of most GPCRs, and to the limited amount of known ligands for some me… ▽ More The G-protein coupled receptor (GPCR) superfamily is currently the largest class of therapeutic targets. \textit{In silico} prediction of interactions between GPCRs and small molecules is therefore a crucial step in the drug discovery process, which remains a daunting task due to the difficulty to characterize the 3D structure of most GPCRs, and to the limited amount of known ligands for some members of the superfamily. Chemogenomics, which attempts to characterize interactions between all members of a target class and all small molecules simultaneously, has recently been proposed as an interesting alternative to traditional docking or ligand-based virtual screening strategies. We propose new methods for in silico chemogenomics and validate them on the virtual screening of GPCRs. The methods represent an extension of a recently proposed machine learning strategy, based on support vector machines (SVM), which provides a flexible framework to incorporate various information sources on the biological space of targets and on the chemical space of small molecules. We investigate the use of 2D and 3D descriptors for small molecules, and test a variety of descriptors for GPCRs. We show fo instance that incorporating information about the known hierarchical classification of the target family and about key residues in their inferred binding pockets significantly improves the prediction accuracy of our model. In particular we are able to predict ligands of orphan GPCRs with an estimated accuracy of 78.1%. △ Less

Submitted 28 January, 2008; originally announced January 2008.

arXiv:0709.3931 [pdf, ps, other]

Kernel methods for in silico chemogenomics

Authors: Laurent Jacob, Jean-Philippe Vert

Abstract: Predicting interactions between small molecules and proteins is a crucial ingredient of the drug discovery process. In particular, accurate predictive models are increasingly used to preselect potential lead compounds from large molecule databases, or to screen for side-effects. While classical in silico approaches focus on predicting interactions with a given specific target, new chemogenomics… ▽ More Predicting interactions between small molecules and proteins is a crucial ingredient of the drug discovery process. In particular, accurate predictive models are increasingly used to preselect potential lead compounds from large molecule databases, or to screen for side-effects. While classical in silico approaches focus on predicting interactions with a given specific target, new chemogenomics approaches adopt cross-target views. Building on recent developments in the use of kernel methods in bio- and chemoinformatics, we present a systematic framework to screen the chemical space of small molecules for interaction with the biological space of proteins. We show that this framework allows information sharing across the targets, resulting in a dramatic improvement of ligand prediction accuracy for three important classes of drug targets: enzymes, GPCR and ion channels. △ Less

Submitted 25 September, 2007; originally announced September 2007.

arXiv:q-bio/0702008 [pdf, ps, other]

Epitope prediction improved by multitask support vector machines

Authors: Laurent Jacob, Jean-Philippe Vert

Abstract: Motivation: In silico methods for the prediction of antigenic peptides binding to MHC class I molecules play an increasingly important role in the identification of T-cell epitopes. Statistical and machine learning methods, in particular, are widely used to score candidate epitopes based on their similarity with known epitopes and non epitopes. The genes coding for the MHC molecules, however, ar… ▽ More Motivation: In silico methods for the prediction of antigenic peptides binding to MHC class I molecules play an increasingly important role in the identification of T-cell epitopes. Statistical and machine learning methods, in particular, are widely used to score candidate epitopes based on their similarity with known epitopes and non epitopes. The genes coding for the MHC molecules, however, are highly polymorphic, and statistical methods have difficulties to build models for alleles with few known epitopes. In this case, recent works have demonstrated the utility of leveraging information across alleles to improve the performance of the prediction. Results: We design a support vector machine algorithm that is able to learn epitope models for all alleles simultaneously, by sharing information across similar alleles. The sharing of information across alleles is controlled by a user-defined measure of similarity between alleles. We show that this similarity can be defined in terms of supertypes, or more directly by comparing key residues known to play a role in the peptide-MHC binding. We illustrate the potential of this approach on various benchmark experiments where it outperforms other state-of-the-art methods. △ Less

Submitted 6 February, 2007; originally announced February 2007.

Journal ref: We use various multitask kernels in order to improve MHC-I-peptide binding prediction, in particular for MHC alleles for which few training data is available. (05/02/2007)

Showing 1–9 of 9 results for author: Jacob, L