-
MirLibSpark: A Scalable NGS Plant MicroRNA Prediction Pipeline for Multi-Library Functional Annotation
Authors:
Chao-Jung Wu,
Amine M. Remita,
Abdoulaye Baniré Diallo
Abstract:
The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved…
▽ More
The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved pipeline for a high volume data facility by implementing mirLibSpark based on the Apache Spark framework. This pipeline is the fastest actual method, and provides an accuracy improvement compared to the standard. In this paper, we deliver the first distributed functional miRNA predictor as a standalone and fully automated package. It is an efficient and accurate miRNA predictor with functional insight. Furthermore, it compiles with the gold-standard requirement on plant miRNA predictions.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference
Authors:
Amine M. Remita,
Golrokh Vitae,
Abdoulaye Baniré Diallo
Abstract:
The advances in variational inference are providing promising paths in Bayesian estimation problems. These advances make variational phylogenetic inference an alternative approach to Markov Chain Monte Carlo methods for approximating the phylogenetic posterior. However, one of the main drawbacks of such approaches is modelling the prior through fixed distributions, which could bias the posterior a…
▽ More
The advances in variational inference are providing promising paths in Bayesian estimation problems. These advances make variational phylogenetic inference an alternative approach to Markov Chain Monte Carlo methods for approximating the phylogenetic posterior. However, one of the main drawbacks of such approaches is modelling the prior through fixed distributions, which could bias the posterior approximation if they are distant from the current data distribution. In this paper, we propose an approach and an implementation framework to relax the rigidity of the prior densities by learning their parameters using a gradient-based method and a neural network-based parameterization. We applied this approach for branch lengths and evolutionary parameters estimation under several Markov chain substitution models. The results of performed simulations show that the approach is powerful in estimating branch lengths and evolutionary model parameters. They also show that a flexible prior model could provide better results than a predefined prior model. Finally, the results highlight that using neural networks improves the initialization of the optimization of the prior density parameters.
△ Less
Submitted 8 September, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
EvoVGM: a Deep Variational Generative Model for Evolutionary Parameter Estimation
Authors:
Amine M. Remita,
Abdoulaye Baniré Diallo
Abstract:
Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequ…
▽ More
Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69, K80 and GTR. We train the model via a low-variance stochastic estimator and a gradient ascent algorithm. Here, we analyze the consistency and effectiveness of EvoVGM on synthetic sequence alignments simulated with several evolutionary scenarios and different sizes. Finally, we highlight the robustness of a fine-tuned EvoVGM model using a sequence alignment of gene S of coronaviruses.
△ Less
Submitted 30 June, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses
Authors:
Amine M. Remita,
Abdoulaye Baniré Diallo
Abstract:
Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new g…
▽ More
Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genotyping and subtyping partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genotyping and subtyping), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes.
△ Less
Submitted 28 May, 2024; v1 submitted 11 October, 2019;
originally announced October 2019.