Search | arXiv e-print repository

arXiv:2003.06516 [pdf, other]

doi 10.1038/s41746-020-0301-z

Deep Representation Learning of Electronic Health Records to Unlock Patient Stratification at Scale

Authors: Isotta Landi, Benjamin S. Glicksberg, Hao-Chih Lee, Sarah Cherng, Giulia Landi, Matteo Danieletto, Joel T. Dudley, Cesare Furlanello, Riccardo Miotto

Abstract: Deriving disease subtypes from electronic health records (EHRs) can guide next-generation personalized medicine. However, challenges in summarizing and representing patient data prevent widespread practice of scalable EHR-based stratification analysis. Here we present an unsupervised framework based on deep learning to process heterogeneous EHRs and derive patient representations that can efficien… ▽ More Deriving disease subtypes from electronic health records (EHRs) can guide next-generation personalized medicine. However, challenges in summarizing and representing patient data prevent widespread practice of scalable EHR-based stratification analysis. Here we present an unsupervised framework based on deep learning to process heterogeneous EHRs and derive patient representations that can efficiently and effectively enable patient stratification at scale. We considered EHRs of 1,608,741 patients from a diverse hospital cohort comprising of a total of 57,464 clinical concepts. We introduce a representation learning model based on word embeddings, convolutional neural networks, and autoencoders (i.e., ConvAE) to transform patient trajectories into low-dimensional latent vectors. We evaluated these representations as broadly enabling patient stratification by applying hierarchical clustering to different multi-disease and disease-specific patient cohorts. ConvAE significantly outperformed several baselines in a clustering task to identify patients with different complex conditions, with 2.61 entropy and 0.31 purity average scores. When applied to stratify patients within a certain condition, ConvAE led to various clinically relevant subtypes for different disorders, including type 2 diabetes, Parkinson's disease and Alzheimer's disease, largely related to comorbidities, disease progression, and symptom severity. With these results, we demonstrate that ConvAE can generate patient representations that lead to clinically meaningful insights. This scalable framework can help better understand varying etiologies in heterogeneous sub-populations and unlock patterns for EHR-based research in the realm of personalized medicine. △ Less

Submitted 18 July, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

Comments: C.F. and R.M. share senior authorship

Journal ref: npj Digit. Med. 3, 96 (2020)

arXiv:1909.07786 [pdf, other]

High Resolution Forecasting of Heat Waves impacts on Leaf Area Index by Multiscale Multitemporal Deep Learning

Authors: Andrea Gobbi, Marco Cristoforetti, Giuseppe Jurman, Cesare Furlanello

Abstract: Climate change impacts could cause progressive decrease of crop quality and yield, up to harvest failures. In particular, heat waves and other climate extremes can lead to localized food shortages and even threaten food security of communities worldwide. In this study, we apply a deep learning architecture for high resolution forecasting (300 m, 10 days) of the Leaf Area Index (LAI), whose dynamic… ▽ More Climate change impacts could cause progressive decrease of crop quality and yield, up to harvest failures. In particular, heat waves and other climate extremes can lead to localized food shortages and even threaten food security of communities worldwide. In this study, we apply a deep learning architecture for high resolution forecasting (300 m, 10 days) of the Leaf Area Index (LAI), whose dynamics has been widely used to model the growth phase of crops and impact of heat waves. LAI models can be computed at 0.1 degree spatial resolution with an auto regressive component adjusted with weather conditions, validated with remote sensing measurements. However model actionability is poor in regions of varying terrain morphology at this scale (about 8 km at the Alps latitude). Our deep learning model aims instead at forecasting LAI by training multiscale multitemporal (MSMT) data from the Copernicus Global Land Service (CGLS) project for all Europe at 300m resolution and medium-resolution historical weather data. Further, the deep learning model inputs integrate high-resolution land surface features, known to improve forecasts of agricultural productivity. The historical weather data are then replaced with forecast values to predict LAI values at 10 day horizon on Europe. We propose the MSMT model to develop a high resolution crop-specific warning system for mitigating damage due to heat waves and other extreme events. △ Less

Submitted 13 September, 2019; originally announced September 2019.

arXiv:1909.06539 [pdf, other]

AI slipping on tiles: data leakage in digital pathology

Authors: Nicole Bussola, Alessia Marcolini, Valerio Maggio, Giuseppe Jurman, Cesare Furlanello

Abstract: Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set.… ▽ More Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set. In the context of digital pathology, the leakage typically lurks in weakly designed experiments not accounting for the subjects in their data partitioning schemes. This issue is then exacerbated when fractions or subregions of slides (i.e. "tiles") are considered. Despite this aspect is largely recognized by the community, we argue that it is often overlooked. In this study, we assess the impact of data leakage on the performance of machine learning models trained and validated on multiple histology data collection. We prove that, even with a properly designed DAP (10x5 repeated cross-validation), predictive scores can be inflated up to 41% when tiles from the same subject are used both in training and validation sets by deep learning models. We replicate the experiments for $4$ classification tasks on 3 histopathological datasets, for a total of 374 subjects, 556 slides and more than 27,000 tiles. Also, we discuss the effects of data leakage on transfer learning strategies with models pre-trained on general-purpose datasets or off-task digital pathology collections. Finally, we propose a solution that automates the creation of leakage-free deep learning pipelines for digital pathology based on histolab, a novel Python package for histology data preprocessing. We validate the solution on two public datasets (TCGA and GTEx). △ Less

Submitted 17 November, 2020; v1 submitted 14 September, 2019; originally announced September 2019.

arXiv:1711.08198 [pdf, other]

A multiobjective deep learning approach for predictive classification in Neuroblastoma

Authors: Valerio Maggio, Marco Chierici, Giuseppe Jurman, Cesare Furlanello

Abstract: Neuroblastoma is a strongly heterogeneous cancer with very diverse clinical courses that may vary from spontaneous regression to fatal progression; an accurate patient's risk estimation at diagnosis is essential to design appropriate tumor treatment strategies. Neuroblastoma is a paradigm disease where different diagnostic and prognostic endpoints should be predicted from common molecular and clin… ▽ More Neuroblastoma is a strongly heterogeneous cancer with very diverse clinical courses that may vary from spontaneous regression to fatal progression; an accurate patient's risk estimation at diagnosis is essential to design appropriate tumor treatment strategies. Neuroblastoma is a paradigm disease where different diagnostic and prognostic endpoints should be predicted from common molecular and clinical information, with increasing complexity, as shown in the FDA MAQC-II study. Here we introduce the novel multiobjective deep learning architecture CDRP (Concatenated Diagnostic Relapse Prognostic) composed by 8 layers to obtain a combined diagnostic and prognostic prediction from high-throughput transcriptomics data. Two distinct loss functions are optimized for the Event Free Survival (EFS) and Overall Survival (OS) prognosis, respectively. We use the High-Risk (HR) diagnostic information as an additional input generated by an autoencoder embedding. The latter is used as network regulariser, based on a clinical algorithm commonly adopted for stratifying patients from cancer stage, age at insurgence of disease, and MYCN, the specific molecular marker. The architecture was applied to Illumina HiSeq2000 RNA-Seq for 498 neuroblastoma patients (176 at high risk) from the Sequencing Quality Control (SEQC) study, obtaining state-of-art on the diagnostic endpoint and improving prediction of prognosis over the HR cohort. △ Less

Submitted 22 February, 2018; v1 submitted 22 November, 2017; originally announced November 2017.

Comments: NIPS ML4H workshop 2017 & MAQC 2018

arXiv:1710.05918 [pdf, other]

Convolutional neural networks for structured omics: OmicsCNN and the OmicsConv layer

Authors: Giuseppe Jurman, Valerio Maggio, Diego Fioravanti, Ylenia Giarratano, Isotta Landi, Margherita Francescatto, Claudio Agostinelli, Marco Chierici, Manlio De Domenico, Cesare Furlanello

Abstract: Convolutional Neural Networks (CNNs) are a popular deep learning architecture widely applied in different domains, in particular in classifying over images, for which the concept of convolution with a filter comes naturally. Unfortunately, the requirement of a distance (or, at least, of a neighbourhood function) in the input feature space has so far prevented its direct use on data types such as o… ▽ More Convolutional Neural Networks (CNNs) are a popular deep learning architecture widely applied in different domains, in particular in classifying over images, for which the concept of convolution with a filter comes naturally. Unfortunately, the requirement of a distance (or, at least, of a neighbourhood function) in the input feature space has so far prevented its direct use on data types such as omics data. However, a number of omics data are metrizable, i.e., they can be endowed with a metric structure, enabling to adopt a convolutional based deep learning framework, e.g., for prediction. We propose a generalized solution for CNNs on omics data, implemented through a dedicated Keras layer. In particular, for metagenomics data, a metric can be derived from the patristic distance on the phylogenetic tree. For transcriptomics data, we combine Gene Ontology semantic similarity and gene co-expression to define a distance; the function is defined through a multilayer network where 3 layers are defined by the GO mutual semantic similarity while the fourth one by gene co-expression. As a general tool, feature distance on omics data is enabled by OmicsConv, a novel Keras layer, obtaining OmicsCNN, a dedicated deep learning framework. Here we demonstrate OmicsCNN on gut microbiota sequencing data, for Inflammatory Bowel Disease (IBD) 16S data, first on synthetic data and then a metagenomics collection of gut microbiota of 222 IBD patients. △ Less

Submitted 16 October, 2017; originally announced October 2017.

Comments: 7 pages, 3 figures. arXiv admin note: text overlap with arXiv:1709.02268

arXiv:1709.02268 [pdf, other]

Phylogenetic Convolutional Neural Networks in Metagenomics

Authors: Diego Fioravanti, Ylenia Giarratano, Valerio Maggio, Claudio Agostinelli, Marco Chierici, Giuseppe Jurman, Cesare Furlanello

Abstract: Background: Convolutional Neural Networks can be effectively used only when data are endowed with an intrinsic concept of neighbourhood in the input space, as is the case of pixels in images. We introduce here Ph-CNN, a novel deep learning architecture for the classification of metagenomics data based on the Convolutional Neural Networks, with the patristic distance defined on the phylogenetic tre… ▽ More Background: Convolutional Neural Networks can be effectively used only when data are endowed with an intrinsic concept of neighbourhood in the input space, as is the case of pixels in images. We introduce here Ph-CNN, a novel deep learning architecture for the classification of metagenomics data based on the Convolutional Neural Networks, with the patristic distance defined on the phylogenetic tree being used as the proximity measure. The patristic distance between variables is used together with a sparsified version of MultiDimensional Scaling to embed the phylogenetic tree in a Euclidean space. Results: Ph-CNN is tested with a domain adaptation approach on synthetic data and on a metagenomics collection of gut microbiota of 38 healthy subjects and 222 Inflammatory Bowel Disease patients, divided in 6 subclasses. Classification performance is promising when compared to classical algorithms like Support Vector Machines and Random Forest and a baseline fully connected neural network, e.g. the Multi-Layer Perceptron. Conclusion: Ph-CNN represents a novel deep learning approach for the classification of metagenomics data. Operatively, the algorithm has been implemented as a custom Keras layer taking care of passing to the following convolutional layer not only the data but also the ranked list of neighbourhood of each sample, thus mimicking the case of image data, transparently to the user. Keywords: Metagenomics; Deep learning; Convolutional Neural Networks; Phylogenetic trees △ Less

Submitted 6 September, 2017; originally announced September 2017.

Comments: Presented at BMTL 2017, Naples

arXiv:1707.06552 [pdf, other]

Towards a scientific blockchain framework for reproducible data analysis

Authors: C. Furlanello, M. De Domenico, G. Jurman, N. Bussola

Abstract: Publishing reproducible analyses is a long-standing and widespread challenge for the scientific community, funding bodies and publishers. Although a definitive solution is still elusive, the problem is recognized to affect all disciplines and lead to a critical system inefficiency. Here, we propose a blockchain-based approach to enhance scientific reproducibility, with a focus on life science stud… ▽ More Publishing reproducible analyses is a long-standing and widespread challenge for the scientific community, funding bodies and publishers. Although a definitive solution is still elusive, the problem is recognized to affect all disciplines and lead to a critical system inefficiency. Here, we propose a blockchain-based approach to enhance scientific reproducibility, with a focus on life science studies and precision medicine. While the interest of encoding permanently into an immutable ledger all the study key information-including endpoints, data and metadata, protocols, analytical methods and all findings-has been already highlighted, here we apply the blockchain approach to solve the issue of rewarding time and expertise of scientists that commit to verify reproducibility. Our mechanism builds a trustless ecosystem of researchers, funding bodies and publishers cooperating to guarantee digital and permanent access to information and reproducible results. As a natural byproduct, a procedure to quantify scientists' and institutions' reputation for ranking purposes is obtained. △ Less

Submitted 20 July, 2017; originally announced July 2017.

Comments: 8 pages, 1 figure

arXiv:1602.00467 [pdf, ps, other]

Differential network analysis and graph classification: a glocal approach

Authors: Giuseppe Jurman, Michele Filosi, Samantha Riccadonna, Roberto Visintainer, Cesare Furlanello

Abstract: Based on the glocal HIM metric and its induced graph kernel, we propose a novel solution in differential network analysis that integrates network comparison and classification tasks. The HIM distance is defined as the one-parameter family of product metrics linearly combining the normalised Hamming distance H and the normalised Ipsen-Mikhailov spectral distance IM. The combination of the two compo… ▽ More Based on the glocal HIM metric and its induced graph kernel, we propose a novel solution in differential network analysis that integrates network comparison and classification tasks. The HIM distance is defined as the one-parameter family of product metrics linearly combining the normalised Hamming distance H and the normalised Ipsen-Mikhailov spectral distance IM. The combination of the two components within a single metric allows overcoming their drawbacks and obtaining a measure that is simultaneously global and local. Furthermore, plugging the HIM kernel into a Support Vector Machine gives us a classification algorithm based on the HIM distance. First, we outline the theory underlying the metric construction. We introduce two diverse applications of the HIM distance and the HIM kernel to biological datasets. This versatility supports the adoption of the HIM family as a general tool for information extraction, quantifying difference among diverse in- stances of a complex system. An Open Source implementation of the HIM metrics is provided by the R package nettols and in its web interface ReNette. △ Less

Submitted 1 February, 2016; originally announced February 2016.

Comments: Submitted for BMTL 2015 Proceedings

arXiv:1310.6547 [pdf, ps, other]

Sparse Predictive Structure of Deconvolved Functional Brain Networks

Authors: Tommaso Furlanello, Marco Cristoforetti, Cesare Furlanello, Giuseppe Jurman

Abstract: The functional and structural representation of the brain as a complex network is marked by the fact that the comparison of noisy and intrinsically correlated high-dimensional structures between experimental conditions or groups shuns typical mass univariate methods. Furthermore most network estimation methods cannot distinguish between real and spurious correlation arising from the convolution du… ▽ More The functional and structural representation of the brain as a complex network is marked by the fact that the comparison of noisy and intrinsically correlated high-dimensional structures between experimental conditions or groups shuns typical mass univariate methods. Furthermore most network estimation methods cannot distinguish between real and spurious correlation arising from the convolution due to nodes' interaction, which thus introduces additional noise in the data. We propose a machine learning pipeline aimed at identifying multivariate differences between brain networks associated to different experimental conditions. The pipeline (1) leverages the deconvolved individual contribution of each edge and (2) maps the task into a sparse classification problem in order to construct the associated "sparse deconvolved predictive network", i.e., a graph with the same nodes of those compared but whose edge weights are defined by their relevance for out of sample predictions in classification. We present an application of the proposed method by decoding the covert attention direction (left or right) based on the single-trial functional connectivity matrix extracted from high-frequency magnetoencephalography (MEG) data. Our results demonstrate how network deconvolution matched with sparse classification methods outperforms typical approaches for MEG decoding. △ Less

Submitted 24 October, 2013; originally announced October 2013.

arXiv:1210.3149 [pdf, other]

DTW-MIC Coexpression Networks from Time-Course Data

Authors: Samantha Riccadonna, Giuseppe Jurman, Roberto Visintainer, Michele Filosi, Cesare Furlanello

Abstract: When modeling coexpression networks from high-throughput time course data, Pearson Correlation Coefficient (PCC) is one of the most effective and popular similarity functions. However, its reliability is limited since it cannot capture non-linear interactions and time shifts. Here we propose to overcome these two issues by employing a novel similarity function, Dynamic Time Warping Maximal Informa… ▽ More When modeling coexpression networks from high-throughput time course data, Pearson Correlation Coefficient (PCC) is one of the most effective and popular similarity functions. However, its reliability is limited since it cannot capture non-linear interactions and time shifts. Here we propose to overcome these two issues by employing a novel similarity function, Dynamic Time Warping Maximal Information Coefficient (DTW-MIC), combining a measure taking care of functional interactions of signals (MIC) and a measure identifying horizontal displacements (DTW). By using the Hamming-Ipsen-Mikhailov (HIM) metric to quantify network differences, the effectiveness of the DTW-MIC approach is demonstrated on both synthetic and transcriptomic datasets. △ Less

Submitted 16 October, 2014; v1 submitted 11 October, 2012; originally announced October 2012.

arXiv:1209.1654 [pdf, ps, other]

Stability Indicators in Network Reconstruction

Authors: Giuseppe Jurman, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Cesare Furlanello

Abstract: The number of algorithms available to reconstruct a biological network from a dataset of high-throughput measurements is nowadays overwhelming, but evaluating their performance when the gold standard is unknown is a difficult task. Here we propose to use a few reconstruction stability tools as a quantitative solution to this problem. We introduce four indicators to quantitatively assess the stabil… ▽ More The number of algorithms available to reconstruct a biological network from a dataset of high-throughput measurements is nowadays overwhelming, but evaluating their performance when the gold standard is unknown is a difficult task. Here we propose to use a few reconstruction stability tools as a quantitative solution to this problem. We introduce four indicators to quantitatively assess the stability of a reconstructed network in terms of variability with respect to data subsampling. In particular, we give a measure of the mutual distances among the set of networks generated by a collection of data subsets (and from the network generated on the whole dataset) and we rank nodes and edges according to their decreasing variability within the same set of networks. As a key ingredient, we employ a global/local network distance combined with a bootstrap procedure. We demonstrate the use of the indicators in a controlled situation on a toy dataset, and we show their application on a miRNA microarray dataset with paired tumoral and non-tumoral tissues extracted from a cohort of 241 hepatocellular carcinoma patients. △ Less

Submitted 7 September, 2012; originally announced September 2012.

arXiv:1208.4271 [pdf, ps, other]

doi 10.1093/bioinformatics/bts707

Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers

Authors: Davide Albanese, Michele Filosi, Roberto Visintainer, Samantha Riccadonna, Giuseppe Jurman, Cesare Furlanello

Abstract: We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution red… ▽ More We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution reduces the large memory requirement of the original Java implementation, has good upscaling properties, and offers a native parallelization for the R interface. Low memory requirements are demonstrated on the MINE benchmarks as well as on large (n=1340) microarray and Illumina GAII RNA-seq transcriptomics datasets. Availability and Implementation: Source code and binaries are freely available for download under GPL3 licence at http://minepy.sourceforge.net for minepy and through the CRAN repository http://cran.r-project.org for the R package minerva. All software is multiplatform (MS Windows, Linux and OSX). △ Less

Submitted 10 December, 2012; v1 submitted 21 August, 2012; originally announced August 2012.

Comments: Bioinformatics 2012, in press

arXiv:1201.3216 [pdf, ps, other]

Evaluating sources of variability in pathway profiling

Authors: A. Barla, S. Riccadonna, S. Masecchia, M. Squillario, M. Filosi, G. Jurman, C. Furlanello

Abstract: A bioinformatics platform is introduced aimed at identifying models of disease-specific pathways, as well as a set of network measures that can quantify changes in terms of global structure or single link disruptions.The approach integrates a network comparison framework with machine learning molecular profiling. <CA>The platform includes different tools combined in one Open Source pipeline, suppo… ▽ More A bioinformatics platform is introduced aimed at identifying models of disease-specific pathways, as well as a set of network measures that can quantify changes in terms of global structure or single link disruptions.The approach integrates a network comparison framework with machine learning molecular profiling. <CA>The platform includes different tools combined in one Open Source pipeline, supporting reproducibility of the analysis. We describe here the computational pipeline and explore the main sources of variability that can affect the results, namely the classifier, the feature ranking/selection algorithm, the enrichment procedure, the inference method and the networks comparison function. The proposed pipeline is tested on a microarray dataset of late stage Parkinsons' Disease patients together with healty controls. Choosing different machine learning approaches we get low pathway profiling overlapping in terms of common enriched elements. Nevertheless, they identify different but equally meaningful biological aspects of the same process, suggesting the integration of information across different methods as the best overall strategy. All the elements of the proposed pipeline are available as Open Source Software: availability details are provided in the main text. △ Less

Submitted 16 January, 2012; originally announced January 2012.

arXiv:1109.1108 [pdf, ps, other]

Single-base mismatch profiles for NGS samples

Authors: Marco Chierici, Giuseppe Jurman, Marco Roncador, Cesare Furlanello

Abstract: Within the preprocessing pipeline of a Next Generation Sequencing sample, its set of Single-Base Mismatches is one of the first outcomes, together with the number of correctly aligned reads. The union of these two sets provides a 4x4 matrix (called Single Base Indicator, SBI in what follows) representing a blueprint of the sample and its preprocessing ingredients such as the sequencer, the alignme… ▽ More Within the preprocessing pipeline of a Next Generation Sequencing sample, its set of Single-Base Mismatches is one of the first outcomes, together with the number of correctly aligned reads. The union of these two sets provides a 4x4 matrix (called Single Base Indicator, SBI in what follows) representing a blueprint of the sample and its preprocessing ingredients such as the sequencer, the alignment software, the pipeline parameters. In this note we show that, under the same technological conditions, there is a strong relation between the SBI and the biological nature of the sample. To reach this goal we need to introduce a similarity measure between SBIs: we also show how two measures commonly used in machine learning can be of help in this context. △ Less

Submitted 6 September, 2011; originally announced September 2011.

arXiv:1109.0220 [pdf, ps, other]

Biological network comparison via Ipsen-Mikhailov distance

Authors: Giuseppe Jurman, Samantha Riccadonna, Roberto Visintainer, Cesare Furlanello

Abstract: Highlighting similarities and differences between networks is an informative task in investigating many biological processes. Typical examples are detecting differences between an inferred network and the corresponding gold standard, or evaluating changes in a dynamic network along time. Although fruitful insights can be drawn by qualitative or feature-based methods, a distance must be used whenev… ▽ More Highlighting similarities and differences between networks is an informative task in investigating many biological processes. Typical examples are detecting differences between an inferred network and the corresponding gold standard, or evaluating changes in a dynamic network along time. Although fruitful insights can be drawn by qualitative or feature-based methods, a distance must be used whenever a quantitative assessment is required. Here we introduce the Ipsen-Mikhailov metric for biological network comparison, based on the difference of the distributions of the Laplacian eigenvalues of the compared graphs. Being a spectral measure, its focus is on the general structure of the net so it can overcome the issues affecting local metrics such as the edit distances. Relation with the classical Matthews Correlation Coefficient (MCC) is discussed, showing the finer discriminant resolution achieved by the Ipsen-Mikhailov metric. We conclude with three examples of application in functional genomic tasks, including stability of network reconstruction as robustness to data subsampling, variability in dynamical networks and differences in networks associated to a classification task. △ Less

Submitted 1 September, 2011; originally announced September 2011.

arXiv:1105.4486 [pdf, other]

A machine learning pipeline for discriminant pathways identification

Authors: Annalisa Barla, Giuseppe Jurman, Roberto Visintainer, Margherita Squillario, Michele Filosi, Samantha Riccadonna, Cesare Furlanello

Abstract: Motivation: Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, more in general, in systems biology. Results: In this work we propose a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison method. The pipeline can identify changes occurring between specific sub-modul… ▽ More Motivation: Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, more in general, in systems biology. Results: In this work we propose a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison method. The pipeline can identify changes occurring between specific sub-modules of networks built in a case-control biomarker study, discriminating key groups of genes whose interactions are modified by an underlying condition. The proposal is independent from the classification algorithm used. Three applications on genomewide data are presented regarding children susceptibility to air pollution and two neurodegenerative diseases: Parkinson's and Alzheimer's. Availability: Details about the software used for the experiments discussed in this paper are provided in the Appendix. △ Less

Submitted 27 May, 2011; v1 submitted 23 May, 2011; originally announced May 2011.

Journal ref: A. Barla, G. Jurman, R. Visintainer, M. Squillario, M. Filosi, S. Riccadonna, C. Furlanello. A machine learning pipeline for discriminant pathways identification. In Proc. CIBB 2011

arXiv:1005.0103 [pdf, ps, other]

An introduction to spectral distances in networks (extended version)

Authors: Giuseppe Jurman, Roberto Visintainer, Cesare Furlanello

Abstract: Many functions have been recently defined to assess the similarity among networks as tools for quantitative comparison. They stem from very different frameworks - and they are tuned for dealing with different situations. Here we show an overview of the spectral distances, highlighting their behavior in some basic cases of static and dynamic synthetic and real networks. Many functions have been recently defined to assess the similarity among networks as tools for quantitative comparison. They stem from very different frameworks - and they are tuned for dealing with different situations. Here we show an overview of the spectral distances, highlighting their behavior in some basic cases of static and dynamic synthetic and real networks. △ Less

Submitted 26 October, 2010; v1 submitted 1 May, 2010; originally announced May 2010.

Journal ref: G. Jurman, R. Visintainer, C. Furlanello. An introduction to spectral distances in networks. Frontiers in Artificial Intelligence and Applications, 226:227-234, 2011

arXiv:1004.1341 [pdf, ps, other]

doi 10.1371/journal.pone.0036540

Algebraic Comparison of Partial Lists in Bioinformatics

Authors: Giuseppe Jurman, Samantha Riccadonna, Roberto Visintainer, Cesare Furlanello

Abstract: The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained.… ▽ More The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset. △ Less

Submitted 8 April, 2010; originally announced April 2010.

Journal ref: PLoS ONE 7(5): e36540 (2012)

Showing 1–18 of 18 results for author: Furlanello, C