Search | arXiv e-print repository

doi 10.1145/3307339.3343463

MirLibSpark: A Scalable NGS Plant MicroRNA Prediction Pipeline for Multi-Library Functional Annotation

Authors: Chao-Jung Wu, Amine M. Remita, Abdoulaye Baniré Diallo

Abstract: The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved… ▽ More The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved pipeline for a high volume data facility by implementing mirLibSpark based on the Apache Spark framework. This pipeline is the fastest actual method, and provides an accuracy improvement compared to the standard. In this paper, we deliver the first distributed functional miRNA predictor as a standalone and fully automated package. It is an efficient and accurate miRNA predictor with functional insight. Furthermore, it compiles with the gold-standard requirement on plant miRNA predictions. △ Less

Submitted 29 January, 2025; originally announced January 2025.

Comments: 13 pages, 4 figures, 2 tables, published in conference proceedings

ACM Class: J.3; I.5.3; D.2.11

Journal ref: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB) pp. 669-674, 2019

arXiv:2405.02374 [pdf, other]

Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL

Authors: Arturo Fiorellini-Bernardis, Sebastien Boyer, Christoph Brunken, Bakary Diallo, Karim Beguir, Nicolas Lopez-Carranza, Oliver Bent

Abstract: Protein-protein interactions (PPIs) play a crucial role in numerous biological processes. Developing methods that predict binding affinity changes under substitution mutations is fundamental for modelling and re-engineering biological systems. Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations. With this c… ▽ More Protein-protein interactions (PPIs) play a crucial role in numerous biological processes. Developing methods that predict binding affinity changes under substitution mutations is fundamental for modelling and re-engineering biological systems. Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations. With this contribution, we propose eGRAL, a novel SE(3) equivariant graph neural network (eGNN) architecture designed for predicting binding affinity changes from multiple amino acid substitutions in protein complexes. eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models. To address the limited availability of large-scale affinity assays with structural information, we generate a simulated dataset comprising approximately 500,000 data points. Our model is pre-trained on this dataset, then fine-tuned and tested on experimental data. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2302.02522 [pdf, other]

doi 10.1007/978-3-031-36911-7_8

Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference

Authors: Amine M. Remita, Golrokh Vitae, Abdoulaye Baniré Diallo

Abstract: The advances in variational inference are providing promising paths in Bayesian estimation problems. These advances make variational phylogenetic inference an alternative approach to Markov Chain Monte Carlo methods for approximating the phylogenetic posterior. However, one of the main drawbacks of such approaches is modelling the prior through fixed distributions, which could bias the posterior a… ▽ More The advances in variational inference are providing promising paths in Bayesian estimation problems. These advances make variational phylogenetic inference an alternative approach to Markov Chain Monte Carlo methods for approximating the phylogenetic posterior. However, one of the main drawbacks of such approaches is modelling the prior through fixed distributions, which could bias the posterior approximation if they are distant from the current data distribution. In this paper, we propose an approach and an implementation framework to relax the rigidity of the prior densities by learning their parameters using a gradient-based method and a neural network-based parameterization. We applied this approach for branch lengths and evolutionary parameters estimation under several Markov chain substitution models. The results of performed simulations show that the approach is powerful in estimating branch lengths and evolutionary model parameters. They also show that a flexible prior model could provide better results than a predefined prior model. Finally, the results highlight that using neural networks improves the initialization of the optimization of the prior density parameters. △ Less

Submitted 8 September, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: Accepted as a full paper for publication at RECOMB-CG 2023 (LNBI proof version). 15 pages (excluding references), 6 tables and 1 figure

Journal ref: In Jahn, K., Vinař, T. (eds) Comparative Genomics. RECOMB-CG 2023. Lecture Notes in Computer Science, vol 13883. Springer, Cham

arXiv:2205.13034 [pdf, other]

doi 10.1145/3535508.3545563

EvoVGM: a Deep Variational Generative Model for Evolutionary Parameter Estimation

Authors: Amine M. Remita, Abdoulaye Baniré Diallo

Abstract: Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequ… ▽ More Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69, K80 and GTR. We train the model via a low-variance stochastic estimator and a gradient ascent algorithm. Here, we analyze the consistency and effectiveness of EvoVGM on synthetic sequence alignments simulated with several evolutionary scenarios and different sizes. Finally, we highlight the robustness of a fine-tuned EvoVGM model using a sequence alignment of gene S of coronaviruses. △ Less

Submitted 30 June, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: Accepted as a full paper for publication in ACM-BCB 2022 (Camera-ready version)

Journal ref: In 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '22), August 7-10, 2022, Northbrook, IL, USA. ACM, New York, NY, USA, 10 pages

arXiv:2201.00126 [pdf]

Etude de classification des bacteriophages

Authors: Dung Nguyen, Alix Boc, Abdoulaye Banire Diallo, Vladimir Makarenkov

Abstract: Phages are one of the most present groups of organisms in the biosphere. Their identification continues and their taxonomies are divergent. However, due to their evolution mode and the complexity of their species ecosystem, their classification is not complete. Here, we present a new approach to the phages classification that combines the methods of horizontal gene transfer detection and ancestral… ▽ More Phages are one of the most present groups of organisms in the biosphere. Their identification continues and their taxonomies are divergent. However, due to their evolution mode and the complexity of their species ecosystem, their classification is not complete. Here, we present a new approach to the phages classification that combines the methods of horizontal gene transfer detection and ancestral sequence reconstruction. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: in French

arXiv:2001.03260 [pdf, other]

doi 10.1109/BIBM47256.2019.8983041

Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets

Authors: Hayda Almeida, Adrian Tsang, Abdoulaye Baniré Diallo

Abstract: Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on deve… ▽ More Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on developing automatic tools to support BGC discovery in plants, fungi, and bacteria. Data-driven methods, as well as probabilistic and supervised learning methods have been explored in identifying BGCs. Most methods applied to identify fungal BGCs were data-driven and presented limited scope. Supervised learning methods have been shown to perform well at identifying BGCs in bacteria, and could be well suited to perform the same task in fungi. But labeled data instances are needed to perform supervised learning. Openly accessible BGC databases contain only a very small portion of previously curated fungal BGCs. Making new fungal BGC datasets available could motivate the development of supervised learning methods for fungal BGCs and potentially improve prediction performance compared to data-driven methods. In this work we propose new publicly available fungal BGC datasets to support the BGC discovery task using supervised learning. These datasets are prepared to perform binary classification and predict candidate BGC regions in fungal genomes. In addition we analyse the performance of a well supported supervised learning tool developed to predict BGCs. △ Less

Submitted 9 January, 2020; originally announced January 2020.

Comments: Accepted to Machine Learning and Artificial Intelligence in Bioinformatics and Medical Informatics (MABM2019) at IEEE BIBM 2019

arXiv:1910.05421 [pdf, other]

doi 10.1109/BIBM47256.2019.8983375

Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

Authors: Amine M. Remita, Abdoulaye Baniré Diallo

Abstract: Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new g… ▽ More Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genotyping and subtyping partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genotyping and subtyping), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes. △ Less

Submitted 28 May, 2024; v1 submitted 11 October, 2019; originally announced October 2019.

Comments: Accepted as a regular paper for publication in IEEE BIBM 2019 [v3: Fix indices in Markov classifier]

Journal ref: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 2019, pp. 474-481

arXiv:1604.00045 [pdf, other]

PGR: A Graph Repository of Protein 3D-Structures

Authors: Wajdi Dhifli, Abdoulaye Baniré Diallo

Abstract: Graph theory and graph mining constitute rich fields of computational techniques to study the structures, topologies and properties of graphs. These techniques constitute a good asset in bioinformatics if there exist efficient methods for transforming biological data into graphs. In this paper, we present Protein Graph Repository (PGR), a novel database of protein 3D-structures transformed into gr… ▽ More Graph theory and graph mining constitute rich fields of computational techniques to study the structures, topologies and properties of graphs. These techniques constitute a good asset in bioinformatics if there exist efficient methods for transforming biological data into graphs. In this paper, we present Protein Graph Repository (PGR), a novel database of protein 3D-structures transformed into graphs allowing the use of the large repertoire of graph theory techniques in protein mining. This repository contains graph representations of all currently known protein 3D-structures described in the Protein Data Bank (PDB). PGR also provides an efficient online converter of protein 3D-structures into graphs, biological and graph-based description, pre-computed protein graph attributes and statistics, visualization of each protein graph, as well as graph-based protein similarity search tool. Such repository presents an enrichment of existing online databases that will help bridging the gap between graph mining and protein structure analysis. PGR data and features are unique and not included in any other protein database. The repository is available at http://wjdi.bioinfo.uqam.ca/. △ Less

Submitted 24 January, 2016; originally announced April 2016.

Showing 1–8 of 8 results for author: Diallo, B