-
Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data
Authors:
Raffaele Marchesi,
Nicolo Micheletti,
Giuseppe Jurman,
Venet Osmani
Abstract:
Several approaches have been developed to mitigate algorithmic bias stemming from health data poverty, where minority groups are underrepresented in training datasets. Augmenting the minority class using resampling (such as SMOTE) is a widely used approach due to the simplicity of the algorithms. However, these algorithms decrease data variability and may introduce correlations between samples, gi…
▽ More
Several approaches have been developed to mitigate algorithmic bias stemming from health data poverty, where minority groups are underrepresented in training datasets. Augmenting the minority class using resampling (such as SMOTE) is a widely used approach due to the simplicity of the algorithms. However, these algorithms decrease data variability and may introduce correlations between samples, giving rise to the use of generative approaches based on GAN. Generation of high-dimensional, time-series, authentic data that provides a wide distribution coverage of the real data, remains a challenging task for both resampling and GAN-based approaches. In this work we propose CA-GAN architecture that addresses some of the shortcomings of the current approaches, where we provide a detailed comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-series, real dataset of 3343 hypotensive Caucasian and Black patients. We show that our approach is better at both generating authentic data of the minority class and remaining within the original distribution of the real data.
△ Less
Submitted 26 October, 2022; v1 submitted 25 October, 2022;
originally announced October 2022.
-
In-field grape berries counting for yield estimation using dilated CNNs
Authors:
L. Coviello,
M. Cristoforetti,
G. Jurman,
C. Furlanello
Abstract:
Digital technologies ignited a revolution in the agrifood domain known as precision agriculture: a main question for enabling precision agriculture at scale is if accurate product quality control can be made available at minimal cost, leveraging existing technologies and agronomists' skills. As a contribution along this direction we demonstrate a tool for accurate fruit yield estimation from smart…
▽ More
Digital technologies ignited a revolution in the agrifood domain known as precision agriculture: a main question for enabling precision agriculture at scale is if accurate product quality control can be made available at minimal cost, leveraging existing technologies and agronomists' skills. As a contribution along this direction we demonstrate a tool for accurate fruit yield estimation from smartphone cameras, by adapting Deep Learning algorithms originally developed for crowd counting.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
Convolutional neural networks for structured omics: OmicsCNN and the OmicsConv layer
Authors:
Giuseppe Jurman,
Valerio Maggio,
Diego Fioravanti,
Ylenia Giarratano,
Isotta Landi,
Margherita Francescatto,
Claudio Agostinelli,
Marco Chierici,
Manlio De Domenico,
Cesare Furlanello
Abstract:
Convolutional Neural Networks (CNNs) are a popular deep learning architecture widely applied in different domains, in particular in classifying over images, for which the concept of convolution with a filter comes naturally. Unfortunately, the requirement of a distance (or, at least, of a neighbourhood function) in the input feature space has so far prevented its direct use on data types such as o…
▽ More
Convolutional Neural Networks (CNNs) are a popular deep learning architecture widely applied in different domains, in particular in classifying over images, for which the concept of convolution with a filter comes naturally. Unfortunately, the requirement of a distance (or, at least, of a neighbourhood function) in the input feature space has so far prevented its direct use on data types such as omics data. However, a number of omics data are metrizable, i.e., they can be endowed with a metric structure, enabling to adopt a convolutional based deep learning framework, e.g., for prediction. We propose a generalized solution for CNNs on omics data, implemented through a dedicated Keras layer. In particular, for metagenomics data, a metric can be derived from the patristic distance on the phylogenetic tree. For transcriptomics data, we combine Gene Ontology semantic similarity and gene co-expression to define a distance; the function is defined through a multilayer network where 3 layers are defined by the GO mutual semantic similarity while the fourth one by gene co-expression. As a general tool, feature distance on omics data is enabled by OmicsConv, a novel Keras layer, obtaining OmicsCNN, a dedicated deep learning framework. Here we demonstrate OmicsCNN on gut microbiota sequencing data, for Inflammatory Bowel Disease (IBD) 16S data, first on synthetic data and then a metagenomics collection of gut microbiota of 222 IBD patients.
△ Less
Submitted 16 October, 2017;
originally announced October 2017.
-
Seasonal Linear Predictivity in National Football Championships
Authors:
Giuseppe Jurman
Abstract:
Predicting the results of sport matches and competitions is an arising research field, benefiting from the growing amount of available data and the novel data analytics techniques. Excellent forecasts can be achieved by advanced machine learning methods applied to detailed historical data, especially in very popular sports such as football (soccer). Here we show that, despite the large number of c…
▽ More
Predicting the results of sport matches and competitions is an arising research field, benefiting from the growing amount of available data and the novel data analytics techniques. Excellent forecasts can be achieved by advanced machine learning methods applied to detailed historical data, especially in very popular sports such as football (soccer). Here we show that, despite the large number of confounding factors, the results of a football team in longer competitions (e.g., a national league) follow a basically linear trend useful for predictive purposes, too. In support of this claim, we present a set of experiments of linear regression on a database collecting the yearly results of 707 teams playing in 22 divisions from 11 countries, in 20 football seasons.
△ Less
Submitted 19 November, 2015;
originally announced November 2015.
-
Convolutional Neural Network for Stereotypical Motor Movement Detection in Autism
Authors:
Nastaran Mohammadian Rad,
Andrea Bizzego,
Seyed Mostafa Kia,
Giuseppe Jurman,
Paola Venuti,
Cesare Furlanello
Abstract:
Autism Spectrum Disorders (ASDs) are often associated with specific atypical postural or motor behaviors, of which Stereotypical Motor Movements (SMMs) have a specific visibility. While the identification and the quantification of SMM patterns remain complex, its automation would provide support to accurate tuning of the intervention in the therapy of autism. Therefore, it is essential to develop…
▽ More
Autism Spectrum Disorders (ASDs) are often associated with specific atypical postural or motor behaviors, of which Stereotypical Motor Movements (SMMs) have a specific visibility. While the identification and the quantification of SMM patterns remain complex, its automation would provide support to accurate tuning of the intervention in the therapy of autism. Therefore, it is essential to develop automatic SMM detection systems in a real world setting, taking care of strong inter-subject and intra-subject variability. Wireless accelerometer sensing technology can provide a valid infrastructure for real-time SMM detection, however such variability remains a problem also for machine learning methods, in particular whenever handcrafted features extracted from accelerometer signal are considered. Here, we propose to employ the deep learning paradigm in order to learn discriminating features from multi-sensor accelerometer signals. Our results provide preliminary evidence that feature learning and transfer learning embedded in the deep architecture achieve higher accurate SMM detectors in longitudinal scenarios.
△ Less
Submitted 7 June, 2016; v1 submitted 5 November, 2015;
originally announced November 2015.
-
Sparse Predictive Structure of Deconvolved Functional Brain Networks
Authors:
Tommaso Furlanello,
Marco Cristoforetti,
Cesare Furlanello,
Giuseppe Jurman
Abstract:
The functional and structural representation of the brain as a complex network is marked by the fact that the comparison of noisy and intrinsically correlated high-dimensional structures between experimental conditions or groups shuns typical mass univariate methods. Furthermore most network estimation methods cannot distinguish between real and spurious correlation arising from the convolution du…
▽ More
The functional and structural representation of the brain as a complex network is marked by the fact that the comparison of noisy and intrinsically correlated high-dimensional structures between experimental conditions or groups shuns typical mass univariate methods. Furthermore most network estimation methods cannot distinguish between real and spurious correlation arising from the convolution due to nodes' interaction, which thus introduces additional noise in the data. We propose a machine learning pipeline aimed at identifying multivariate differences between brain networks associated to different experimental conditions. The pipeline (1) leverages the deconvolved individual contribution of each edge and (2) maps the task into a sparse classification problem in order to construct the associated "sparse deconvolved predictive network", i.e., a graph with the same nodes of those compared but whose edge weights are defined by their relevance for out of sample predictions in classification. We present an application of the proposed method by decoding the covert attention direction (left or right) based on the single-trial functional connectivity matrix extracted from high-frequency magnetoencephalography (MEG) data. Our results demonstrate how network deconvolution matched with sparse classification methods outperforms typical approaches for MEG decoding.
△ Less
Submitted 24 October, 2013;
originally announced October 2013.
-
Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers
Authors:
Davide Albanese,
Michele Filosi,
Roberto Visintainer,
Samantha Riccadonna,
Giuseppe Jurman,
Cesare Furlanello
Abstract:
We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution red…
▽ More
We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution reduces the large memory requirement of the original Java implementation, has good upscaling properties, and offers a native parallelization for the R interface. Low memory requirements are demonstrated on the MINE benchmarks as well as on large (n=1340) microarray and Illumina GAII RNA-seq transcriptomics datasets.
Availability and Implementation: Source code and binaries are freely available for download under GPL3 licence at http://minepy.sourceforge.net for minepy and through the CRAN repository http://cran.r-project.org for the R package minerva. All software is multiplatform (MS Windows, Linux and OSX).
△ Less
Submitted 10 December, 2012; v1 submitted 21 August, 2012;
originally announced August 2012.
-
mlpy: Machine Learning Python
Authors:
Davide Albanese,
Roberto Visintainer,
Stefano Merler,
Samantha Riccadonna,
Giuseppe Jurman,
Cesare Furlanello
Abstract:
mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Pytho…
▽ More
mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under GPL3 at the website http://mlpy.fbk.eu.
△ Less
Submitted 1 March, 2012; v1 submitted 29 February, 2012;
originally announced February 2012.
-
A unifying view for performance measures in multi-class prediction
Authors:
Giuseppe Jurman,
Cesare Furlanello
Abstract:
In the last few years, many different performance measures have been introduced to overcome the weakness of the most natural metric, the Accuracy. Among them, Matthews Correlation Coefficient has recently gained popularity among researchers not only in machine learning but also in several application fields such as bioinformatics. Nonetheless, further novel functions are being proposed in literatu…
▽ More
In the last few years, many different performance measures have been introduced to overcome the weakness of the most natural metric, the Accuracy. Among them, Matthews Correlation Coefficient has recently gained popularity among researchers not only in machine learning but also in several application fields such as bioinformatics. Nonetheless, further novel functions are being proposed in literature. We show that Confusion Entropy, a recently introduced classifier performance measure for multi-class problems, has a strong (monotone) relation with the multi-class generalization of a classical metric, the Matthews Correlation Coefficient. Computational evidence in support of the claim is provided, together with an outline of the theoretical explanation.
△ Less
Submitted 17 August, 2010;
originally announced August 2010.
-
Algebraic Comparison of Partial Lists in Bioinformatics
Authors:
Giuseppe Jurman,
Samantha Riccadonna,
Roberto Visintainer,
Cesare Furlanello
Abstract:
The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained.…
▽ More
The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset.
△ Less
Submitted 8 April, 2010;
originally announced April 2010.