-
Well-supported phylogenies using largest subsets of core-genes by discrete particle swarm optimization
Authors:
Reem Alsrraj,
Bassam AlKindy,
Christophe Guyeux,
Laurent Philippe,
Jean-François Couchot
Abstract:
The number of complete chloroplastic genomes increases day after day, making it possible to rethink plants phylogeny at the biomolecular era. Given a set of close plants sharing in the order of one hundred of core chloroplastic genes, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a discret…
▽ More
The number of complete chloroplastic genomes increases day after day, making it possible to rethink plants phylogeny at the biomolecular era. Given a set of close plants sharing in the order of one hundred of core chloroplastic genes, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a discrete and distributed Particle Swarm Optimization (DPSO) is proposed. It is finally applied to the core genes of Rosales order.
△ Less
Submitted 25 June, 2017;
originally announced June 2017.
-
A Pipeline for Insertion Sequence Detection and Study for Bacterial Genome
Authors:
Huda Al-Nayyef,
Christophe Guyeux,
Jacques M. Bahi
Abstract:
Insertion Sequences (ISs) are small DNA segments that have the ability of moving themselves into genomes. These types of mobile genetic elements (MGEs) seem to play an essential role in genomes rearrangements and evolution of prokaryotic genomes, but the tools that deal with discovering ISs in an efficient and accurate way are still too few and not totally precise. Two main factors have big effect…
▽ More
Insertion Sequences (ISs) are small DNA segments that have the ability of moving themselves into genomes. These types of mobile genetic elements (MGEs) seem to play an essential role in genomes rearrangements and evolution of prokaryotic genomes, but the tools that deal with discovering ISs in an efficient and accurate way are still too few and not totally precise. Two main factors have big effects on IS discovery, namely: genes annotation and functionality prediction. Indeed, some specific genes called "transposases" are enzymes that are responsible of the production and catalysis for such transposition, but there is currently no fully accurate method that could decide whether a given predicted gene is either a real transposase or not. This is why authors of this article aim at designing a novel pipeline for ISs detection and classification, which embeds the most recently available tools developed in this field of research, namely OASIS (Optimized Annotation System for Insertion Sequence) and ISFinder database (an up-to-date and accurate repository of known insertion sequences). As this latter depend on predicted coding sequences, the proposed pipeline will encompass too various kinds of bacterial genes annotation tools (that is, Prokka, BASys, and Prodigal). A complete IS detection and classification pipeline is then proposed and tested on a set of 23 complete genomes of Pseudomonas aeruginosa. This pipeline can also be used as an investigator of annotation tools performance, which has led us to conclude that Prodigal is the best software for IS prediction. A deepen study regarding IS elements in P.aeruginosa has then been conducted, leading to the conclusion that close genomes inside this species have also a close numbers of IS families and groups.
△ Less
Submitted 26 June, 2017;
originally announced June 2017.
-
Relation between Insertion Sequences and Genome Rearrangements in Pseudomonas aeruginosa
Authors:
Huda Al-Nayyef,
Christophe Guyeux,
Marie Petitjean,
Didier Hocquet,
Jacques M. Bahi
Abstract:
During evolution of microorganisms genomes underwork have different changes in their lengths, gene orders, and gene contents. Investigating these structural rearrangements helps to understand how genomes have been modified over time. Some elements that play an important role in genome rearrangements are called insertion sequences (ISs), they are the simplest types of transposable elements (TEs) th…
▽ More
During evolution of microorganisms genomes underwork have different changes in their lengths, gene orders, and gene contents. Investigating these structural rearrangements helps to understand how genomes have been modified over time. Some elements that play an important role in genome rearrangements are called insertion sequences (ISs), they are the simplest types of transposable elements (TEs) that widely spread within prokaryotic genomes. ISs can be defined as DNA segments that have the ability to move (cut and paste) themselves to another location within the same chromosome or not. Due to their ability to move around, they are often presented as responsible of some of these genomic recombination. Authors of this research work have regarded this claim, by checking if a relation between insertion sequences (ISs) and genome rearrangements can be found. To achieve this goal, a new pipeline that combines various tools have firstly been designed, for detecting the distribution of ORFs that belongs to each IS category. Secondly, links between these predicted ISs and observed rearrangements of two close genomes have been investigated, by seeing them with the naked eye, and by using computational approaches. The proposal has been tested on 18 complete bacterial genomes of Pseudomonas aeruginosa, leading to the conclusion that IS3 family of insertion sequences are related to genomic inversions.
△ Less
Submitted 25 June, 2017;
originally announced June 2017.
-
Finding optimal finite biological sequences over finite alphabets: the OptiFin toolbox
Authors:
Régis Garnier,
Christophe Guyeux,
Stéphane Chrétien
Abstract:
In this paper, we present a toolbox for a specific optimization problem that frequently arises in bioinformatics or genomics. In this specific optimisation problem, the state space is a set of words of specified length over a finite alphabet. To each word is associated a score. The overall objective is to find the words which have the lowest possible score. This type of general optimization proble…
▽ More
In this paper, we present a toolbox for a specific optimization problem that frequently arises in bioinformatics or genomics. In this specific optimisation problem, the state space is a set of words of specified length over a finite alphabet. To each word is associated a score. The overall objective is to find the words which have the lowest possible score. This type of general optimization problem is encountered in e.g 3D conformation optimisation for protein structure prediction, or largest core genes subset discovery based on best supported phylogenetic tree for a set of species. In order to solve this problem, we propose a toolbox that can be easily launched using MPI and embeds 3 well-known metaheuristics. The toolbox is fully parametrized and well documented. It has been specifically designed to be easy modified and possibly improved by the user depending on the application, and does not require to be a computer scientist. We show that the toolbox performs very well on two difficult practical problems.
△ Less
Submitted 25 June, 2017;
originally announced June 2017.
-
On the ability to reconstruct ancestral genomes from Mycobacterium genus
Authors:
Christophe Guyeux,
Bashar Al-Nuaimi,
Bassam AlKindy,
Jean-François Couchot,
Michel Salomon
Abstract:
Technical signs of progress during the last decades has led to a situation in which the accumulation of genome sequence data is increasingly fast and cheap. The huge amount of molecular data available nowadays can help addressing new and essential questions in Evolution. However, reconstructing evolution of DNA sequences requires models, algorithms, statistical and computational methods of ever in…
▽ More
Technical signs of progress during the last decades has led to a situation in which the accumulation of genome sequence data is increasingly fast and cheap. The huge amount of molecular data available nowadays can help addressing new and essential questions in Evolution. However, reconstructing evolution of DNA sequences requires models, algorithms, statistical and computational methods of ever increasing complexity. Since most dramatic genomic changes are caused by genome rearrangements (gene duplications, gain/loss events), it becomes crucial to understand their mechanisms and reconstruct ancestors of the given genomes. This problem is known to be NP-complete even in the "simplest" case of three genomes. Heuristic algorithms are usually executed to provide approximations of the exact solution.
We state that, even if the ancestral reconstruction problem is NP-hard in theory, its exact resolution is feasible in various situations, encompassing organelles and some bacteria. Such accurate reconstruction, which identifies too some highly homoplasic mutations whose ancestral status is undecidable, will be initiated in this work-in-progress, to reconstruct ancestral genomes of two Mycobacterium pathogenetic bacterias. By mixing automatic reconstruction of obvious situations with human interventions on signaled problematic cases, we will indicate that it should be possible to achieve a concrete, complete, and really accurate reconstruction of lineages of the Mycobacterium tuberculosis complex. Thus, it is possible to investigate how these genomes have evolved from their last common ancestors.
△ Less
Submitted 30 April, 2017;
originally announced May 2017.
-
Taenia Biomolecular Phylogeny and the Impact of Mitochondrial Genes on this Latter
Authors:
Huda Al-Nayyef,
Christophe Guyeux,
Jacques M. Bahi
Abstract:
Variations in mitochondrial genes are usually considered to infer phylogenies. However some of these genes are lesser constraint than other ones, and thus may blur the phylogenetic signals shared by the majority of the mitochondrial DNA sequences. To investigate such effects, in this research work, the molecular phylogeny of the genus Taenia is studied using 14 coding sequences extracted from mito…
▽ More
Variations in mitochondrial genes are usually considered to infer phylogenies. However some of these genes are lesser constraint than other ones, and thus may blur the phylogenetic signals shared by the majority of the mitochondrial DNA sequences. To investigate such effects, in this research work, the molecular phylogeny of the genus Taenia is studied using 14 coding sequences extracted from mitochondrial genomes of 17 species. We constructed 16,384 trees, using a combination of 1 up to 14 genes. We obtained 131 topologies, and we showed that only four particular instances were relevant. Using further statistical investigations, we then extracted a particular topology, which displays more robustness properties.
△ Less
Submitted 15 March, 2017;
originally announced March 2017.
-
Predicting the Evolution of Gene $ura3$ in the Yeast Saccharomyces Cerevisiae
Authors:
Jacques M. Bahi,
Christophe Guyeux,
Antoine Perasso
Abstract:
Since the late `60s, various genome evolutionary models have been proposed to predict the evolution of a DNA sequence as the generations pass. Most of these models are based on nucleotides evolution, so they use a mutation matrix of size 4x4. They encompass for instance the well-known models of Jukes and Cantor, Kimura, and Tamura. By essence, all of these models relate the evolution of DNA sequen…
▽ More
Since the late `60s, various genome evolutionary models have been proposed to predict the evolution of a DNA sequence as the generations pass. Most of these models are based on nucleotides evolution, so they use a mutation matrix of size 4x4. They encompass for instance the well-known models of Jukes and Cantor, Kimura, and Tamura. By essence, all of these models relate the evolution of DNA sequences to the computation of the successive powers of a mutation matrix. To make this computation possible, particular forms for the mutation matrix are assumed, which are not compatible with mutation rates that have been recently obtained experimentally on gene ura3 of the Yeast Saccharomyces cerevisiae. Using this experimental study, authors of this paper have deduced a simple mutation matrice, compute the future evolution of the rate purine/pyrimidine for ura3, investigate the particular behavior of cytosines and thymines compared to purines, and simulate the evolution of each nucleotide.
△ Less
Submitted 8 February, 2017;
originally announced February 2017.
-
A clustering tool for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Models
Authors:
Marine Bruneau,
Thierry Mottet,
Serge Moulin,
Maël Kerbiriou,
Franz Chouly,
Stéphane Chretien,
Christophe Guyeux
Abstract:
We propose a new procedure for clustering nucleotide sequences based on the "Laplacian Eigenmaps" and Gaussian Mixture modelling. This proposal is then applied to a set of 100 DNA sequences from the mitochondrially encoded NADH dehydrogenase 3 (ND3) gene of a collection of Platyhelminthes and Nematoda species. The resulting clusters are then shown to be consistent with the gene phylogenetic tree c…
▽ More
We propose a new procedure for clustering nucleotide sequences based on the "Laplacian Eigenmaps" and Gaussian Mixture modelling. This proposal is then applied to a set of 100 DNA sequences from the mitochondrially encoded NADH dehydrogenase 3 (ND3) gene of a collection of Platyhelminthes and Nematoda species. The resulting clusters are then shown to be consistent with the gene phylogenetic tree computed using a maximum likelihood approach. This comparison shows in particular that the clustering produced by the methodology combining Laplacian Eigenmaps with Gaussian Mixture models is coherent with the phylogeny as well as with the NCBI taxonomy. We also developed a Python package for this procedure which is available online.
△ Less
Submitted 26 October, 2016;
originally announced October 2016.
-
Relation between Gene Content and Taxonomy in Chloroplasts
Authors:
Bashar Al-Nuaimi,
Christophe Guyeux,
Bassam AlKindy,
Jean-François Couchot,
Michel Salomon
Abstract:
The aim of this study is to investigate the relation that can be found between the phylogeny of a large set of complete chloroplast genomes, and the evolution of gene content inside these sequences. Core and pan genomes have been computed on \textit{de novo} annotation of these 845 genomes, the former being used for producing well-supported phylogenetic tree while the latter provides information r…
▽ More
The aim of this study is to investigate the relation that can be found between the phylogeny of a large set of complete chloroplast genomes, and the evolution of gene content inside these sequences. Core and pan genomes have been computed on \textit{de novo} annotation of these 845 genomes, the former being used for producing well-supported phylogenetic tree while the latter provides information regarding the evolution of gene contents over time. It details too the specificity of some branches of the tree, when specificity is obtained on accessory genes. After having detailed the material and methods, we emphasize some remarkable relation between well-known events of the chloroplast history, like endosymbiosis, and the evolution of gene contents over the phylogenetic tree.
△ Less
Submitted 20 September, 2016;
originally announced September 2016.
-
Binary Particle Swarm Optimization versus Hybrid Genetic Algorithm for Inferring Well Supported Phylogenetic Trees
Authors:
Bassam AlKindy,
Bashar Al-Nuaimi,
Christophe Guyeux,
Jean-François Couchot,
Michel Salomon,
Reem Alsrraj,
Laurent Philippe
Abstract:
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of problematic ge…
▽ More
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of problematic genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur the phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained provided such a number of blurring genes is reduced. The problem is thus to determine the largest subset of core genes that produces the best-supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a distributed Binary Particle Swarm Optimization (BPSO) is proposed in sequential and distributed fashions. Obtained results from both versions of the BPSO are compared with those computed using an hybrid approach embedding both genetic algorithms and statistical tests. The proposal has been applied to different cases of plant families, leading to encouraging results for these families.
△ Less
Submitted 31 August, 2016;
originally announced August 2016.
-
Relaxing the Hypotheses of Symmetry and Time-Reversibility in Genome Evolutionary Models
Authors:
Jacques M. Bahi,
Christophe Guyeux,
Antoine Perasso
Abstract:
Various genome evolutionary models have been proposed these last decades to predict the evolution of a DNA sequence over time, essentially described using a mutation matrix. By essence, all of these models relate the evolution of DNA sequences to the computation of the successive powers of the mutation matrix. To make this computation possible, hypotheses are assumed for the matrix, such as symmet…
▽ More
Various genome evolutionary models have been proposed these last decades to predict the evolution of a DNA sequence over time, essentially described using a mutation matrix. By essence, all of these models relate the evolution of DNA sequences to the computation of the successive powers of the mutation matrix. To make this computation possible, hypotheses are assumed for the matrix, such as symmetry and time-reversibility, which are not compatible with mutation rates that have been recently obtained experimentally on genes ura3 and can1 of the Yeast Saccharomyces cerevisiae. In this work, authors investigate systematically the possibility to relax either the symmetry or the time-reversibility hypothesis of the mutation matrix, by investigating all the possible matrices of size 2*2 and 3*3. As an application example, the experimental study on the Yeast Saccharomyces cerevisiae has been used in order to deduce a simple mutation matrix, and to compute the future evolution of the rate purine/pyrimidine for $ura3$ on the one hand, and of the particular behavior of cytosines and thymines compared to purines on the other hand.
△ Less
Submitted 22 August, 2016;
originally announced August 2016.
-
Protein Folding in the 2D Hydrophobic-Hydrophilic (HP) Square Lattice Model is Chaotic
Authors:
Jacques M. Bahi,
Nathalie Côté,
Christophe Guyeux,
Michel Salomon
Abstract:
Among the unsolved problems in computational biology, protein folding is one of the most interesting challenges. To study this folding, tools like neural networks and genetic algorithms have received a lot of attention, mainly due to the NP-completeness of the folding process. The background idea that has given rise to the use of these algorithms is obviously that the folding process is predictabl…
▽ More
Among the unsolved problems in computational biology, protein folding is one of the most interesting challenges. To study this folding, tools like neural networks and genetic algorithms have received a lot of attention, mainly due to the NP-completeness of the folding process. The background idea that has given rise to the use of these algorithms is obviously that the folding process is predictable. However, this important assumption is disputable as chaotic properties of such a process have been recently highlighted. In this paper, which is an extension of a former work accepted to the 2011 International Joint Conference on Neural Networks (IJCNN11), the topological behavior of a well-known dynamical system used for protein folding prediction is evaluated. It is mathematically established that the folding dynamics in the 2D hydrophobic-hydrophilic (HP) square lattice model, simply called "the 2D model" in this document, is indeed a chaotic dynamical system as defined by Devaney. Furthermore, the chaotic behavior of this model is qualitatively and quantitatively deepened, by studying other mathematical properties of disorder, namely: the indecomposability, instability, strong transitivity, and constants of expansivity and sensitivity. Some consequences for both biological paradigms and structure prediction using this model are then discussed. In particular, it is shown that some neural networks seems to be unable to predict the evolution of this model with accuracy, due to its complex behavior.
△ Less
Submitted 20 August, 2016;
originally announced August 2016.
-
Chaos in DNA Evolution
Authors:
Jacques M. Bahi,
Christophe Guyeux,
Antoine Perasso
Abstract:
In this paper, we explain why the chaotic model (CM) of Bahi and Michel (2008) accurately simulates gene mutations over time. First, we demonstrate that the CM model is a truly chaotic one, as defined by Devaney. Then, we show that mutations occurring in gene mutations have the same chaotic dynamic, thus making the use of chaotic models relevant for genome evolution.
In this paper, we explain why the chaotic model (CM) of Bahi and Michel (2008) accurately simulates gene mutations over time. First, we demonstrate that the CM model is a truly chaotic one, as defined by Devaney. Then, we show that mutations occurring in gene mutations have the same chaotic dynamic, thus making the use of chaotic models relevant for genome evolution.
△ Less
Submitted 20 August, 2016;
originally announced August 2016.
-
Simulation based estimation of branching models for LTR retrotransposons
Authors:
Serge Moulin,
Nicolas Seux,
Stéphane Chrétien,
Christophe Guyeux,
Emmanuelle Lerat
Abstract:
Motivation: LTR retrotransposons are mobile elements that are able, like retroviruses, to copy and move inside eukaryotic genomes. In the present work, we propose a branching model for studying the propagation of LTR retrotransposons in these genomes. This model allows to take into account both positions and degradations of LTR retrotransposons copies. In our model, the duplication rate is also al…
▽ More
Motivation: LTR retrotransposons are mobile elements that are able, like retroviruses, to copy and move inside eukaryotic genomes. In the present work, we propose a branching model for studying the propagation of LTR retrotransposons in these genomes. This model allows to take into account both positions and degradations of LTR retrotransposons copies. In our model, the duplication rate is also allowed to vary with the degradation level.
Results: Various functions have been implemented in order to simulate their spread and visualization tools are proposed. Based on these simulation tools, we show that an accurate estimation of the parameters of this propagation model can be performed. We applied this method to the study of the spread of the transposable elements ROO, GYPSY, and DM412 on a chromosome of \textit{Drosophila melanogaster}.
Availability: Our proposal has been implemented using Python software. Source code is freely available on the web at https://github.com/SergeMOULIN/retrotransposons-spread.
△ Less
Submitted 7 March, 2016;
originally announced March 2016.
-
Chaos of Protein Folding
Authors:
Jacques M. Bahi,
Nathalie M. -L. Cote,
Christophe Guyeux
Abstract:
As protein folding is a NP-complete problem, artificial intelligence tools like neural networks and genetic algorithms are used to attempt to predict the 3D shape of an amino acids sequence. Underlying these attempts, it is supposed that this folding process is predictable. However, to the best of our knowledge, this important assumption has been neither proven, nor studied. In this paper the topo…
▽ More
As protein folding is a NP-complete problem, artificial intelligence tools like neural networks and genetic algorithms are used to attempt to predict the 3D shape of an amino acids sequence. Underlying these attempts, it is supposed that this folding process is predictable. However, to the best of our knowledge, this important assumption has been neither proven, nor studied. In this paper the topological dynamic of protein folding is evaluated. It is mathematically established that protein folding in 2D hydrophobic-hydrophilic (HP) square lattice model is chaotic as defined by Devaney. Consequences for both structure prediction and biology are then outlined.
△ Less
Submitted 31 October, 2015;
originally announced November 2015.
-
Improved Core Genes Prediction for Constructing well-supported Phylogenetic Trees in large sets of Plant Species
Authors:
Bassam AlKindy,
Huda Al-Nayyef,
Christophe Guyeux,
Jean-François Couchot,
Michel Salomon,
Jacques M. Bahi
Abstract:
The way to infer well-supported phylogenetic trees that precisely reflect the evolutionary process is a challenging task that completely depends on the way the related core genes have been found. In previous computational biology studies, many similarity based algorithms, mainly dependent on calculating sequence alignment matrices, have been proposed to find them. In these kinds of approaches, a s…
▽ More
The way to infer well-supported phylogenetic trees that precisely reflect the evolutionary process is a challenging task that completely depends on the way the related core genes have been found. In previous computational biology studies, many similarity based algorithms, mainly dependent on calculating sequence alignment matrices, have been proposed to find them. In these kinds of approaches, a significantly high similarity score between two coding sequences extracted from a given annotation tool means that one has the same genes. In a previous work article, we presented a quality test approach (QTA) that improves the core genes quality by combining two annotation tools (namely NCBI, a partially human-curated database, and DOGMA, an efficient annotation algorithm for chloroplasts). This method takes the advantages from both sequence similarity and gene features to guarantee that the core genome contains correct and well-clustered coding sequences (\emph{i.e.}, genes). We then show in this article how useful are such well-defined core genes for biomolecular phylogenetic reconstructions, by investigating various subsets of core genes at various family or genus levels, leading to subtrees with strong bootstraps that are finally merged in a well-supported supertree.
△ Less
Submitted 23 April, 2015;
originally announced April 2015.
-
Hybrid Genetic Algorithm and Lasso Test Approach for Inferring Well Supported Phylogenetic Trees based on Subsets of Chloroplastic Core Genes
Authors:
Bassam AlKindy,
Christophe Guyeux,
Jean-François Couchot,
Michel Salomon,
Christian Parisod,
Jacques M. Bahi
Abstract:
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of "problematic"…
▽ More
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of "problematic" genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained if the number of problematic genes is low, the problem being to determine the largest subset of core genes that produces the best supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, we propose an hybrid approach that embeds both genetic algorithms and statistical tests. Given a set of organisms, the result is a pipeline of many stages for the production of well supported phylogenetic trees. The proposal has been applied to different cases of plant families, leading to encouraging results for these families.
△ Less
Submitted 20 April, 2015;
originally announced April 2015.
-
Gene Similarity-based Approaches for Determining Core-Genes of Chloroplasts
Authors:
Bassam AlKindy,
Christophe Guyeux,
Jean-François Couchot,
Michel Salomon,
Jacques M. Bahi
Abstract:
In computational biology and bioinformatics, the manner to understand evolution processes within various related organisms paid a lot of attention these last decades. However, accurate methodologies are still needed to discover genes content evolution. In a previous work, two novel approaches based on sequence similarities and genes features have been proposed. More precisely, we proposed to use g…
▽ More
In computational biology and bioinformatics, the manner to understand evolution processes within various related organisms paid a lot of attention these last decades. However, accurate methodologies are still needed to discover genes content evolution. In a previous work, two novel approaches based on sequence similarities and genes features have been proposed. More precisely, we proposed to use genes names, sequence similarities, or both, insured either from NCBI or from DOGMA annotation tools. Dogma has the advantage to be an up-to-date accurate automatic tool specifically designed for chloroplasts, whereas NCBI possesses high quality human curated genes (together with wrongly annotated ones). The key idea of the former proposal was to take the best from these two tools. However, the first proposal was limited by name variations and spelling errors on the NCBI side, leading to core trees of low quality. In this paper, these flaws are fixed by improving the comparison of NCBI and DOGMA results, and by relaxing constraints on gene names while adding a stage of post-validation on gene sequences. The two stages of similarity measures, on names and sequences, are thus proposed for sequence clustering. This improves results that can be obtained using either NCBI or DOGMA alone. Results obtained with this quality control test are further investigated and compared with previously released ones, on both computational and biological aspects, considering a set of 99 chloroplastic genomes.
△ Less
Submitted 17 December, 2014;
originally announced December 2014.
-
Finding the Core-Genes of Chloroplasts
Authors:
Bassam AlKindy,
Jean-François Couchot,
Christophe Guyeux,
Arnaud Mouly,
Michel Salomon,
Jacques M. Bahi
Abstract:
Due to the recent evolution of sequencing techniques, the number of available genomes is rising steadily, leading to the possibility to make large scale genomic comparison between sets of close species. An interesting question to answer is: what is the common functionality genes of a collection of species, or conversely, to determine what is specific to a given species when compared to other ones…
▽ More
Due to the recent evolution of sequencing techniques, the number of available genomes is rising steadily, leading to the possibility to make large scale genomic comparison between sets of close species. An interesting question to answer is: what is the common functionality genes of a collection of species, or conversely, to determine what is specific to a given species when compared to other ones belonging in the same genus, family, etc. Investigating such problem means to find both core and pan genomes of a collection of species, \textit{i.e.}, genes in common to all the species vs. the set of all genes in all species under consideration. However, obtaining trustworthy core and pan genomes is not an easy task, leading to a large amount of computation, and requiring a rigorous methodology. Surprisingly, as far as we know, this methodology in finding core and pan genomes has not really been deeply investigated. This research work tries to fill this gap by focusing only on chloroplastic genomes, whose reasonable sizes allow a deep study. To achieve this goal, a collection of 99 chloroplasts are considered in this article. Two methodologies have been investigated, respectively based on sequence similarities and genes names taken from annotation tools. The obtained results will finally be evaluated in terms of biological relevance.
△ Less
Submitted 22 September, 2014;
originally announced September 2014.
-
Computational investigations of folded self-avoiding walks related to protein folding
Authors:
Jacques M. Bahi,
Christophe Guyeux,
Kamel Mazouzi,
Laurent Philippe
Abstract:
Various subsets of self-avoiding walks naturally appear when investigating existing methods designed to predict the 3D conformation of a protein of interest. Two such subsets, namely the folded and the unfoldable self-avoiding walks, are studied computationally in this article. We show that these two sets are equal and correspond to the whole $n$-step self-avoiding walks for $n\leqslant 14$, but t…
▽ More
Various subsets of self-avoiding walks naturally appear when investigating existing methods designed to predict the 3D conformation of a protein of interest. Two such subsets, namely the folded and the unfoldable self-avoiding walks, are studied computationally in this article. We show that these two sets are equal and correspond to the whole $n$-step self-avoiding walks for $n\leqslant 14$, but that they are different for numerous $n \geqslant 108$, which are common protein lengths. Concrete counterexamples are provided and the computational methods used to discover them are completely detailed. A tool for studying these subsets of walks related to both pivot moves and proteins conformations is finally presented.
△ Less
Submitted 18 June, 2013;
originally announced June 2013.
-
Protein structure prediction software generate two different sets of conformations. Or the study of unfolded self-avoiding walks
Authors:
Jacques M. Bahi,
Christophe Guyeux,
Jean-Marc Nicod,
Laurent Philippe
Abstract:
Self-avoiding walks (SAW) are the source of very difficult problems in probabilities and enumerative combinatorics. They are also of great interest as they are, for instance, the basis of protein structure prediction in bioinformatics. Authors of this article have previously shown that, depending on the prediction algorithm, the sets of obtained conformations differ: all the self-avoiding walks ca…
▽ More
Self-avoiding walks (SAW) are the source of very difficult problems in probabilities and enumerative combinatorics. They are also of great interest as they are, for instance, the basis of protein structure prediction in bioinformatics. Authors of this article have previously shown that, depending on the prediction algorithm, the sets of obtained conformations differ: all the self-avoiding walks can be reached using stretching-based algorithms whereas only the folded SAWs can be attained with methods that iteratively fold the straight line. A first study of (un)folded self-avoiding walks is presented in this article. The contribution is majorly a survey of what is currently known about these sets. In particular we provide clear definitions of various subsets of self-avoiding walks related to pivot moves (folded or unfoldable SAWs, etc.) and the first results we have obtained, theoretically or computationally, on these sets. A list of open questions is provided too, and the consequences on the protein structure prediction problem is finally investigated.
△ Less
Submitted 6 June, 2013;
originally announced June 2013.
-
Is protein folding problem really a NP-complete one ? First investigations
Authors:
Jacques M. Bahi,
Wojciech Bienia,
Nathalie Côté,
Christophe Guyeux
Abstract:
To determine the 3D conformation of proteins is a necessity to understand their functions or interactions with other molecules. It is commonly admitted that, when proteins fold from their primary linear structures to their final 3D conformations, they tend to choose the ones that minimize their free energy. To find the 3D conformation of a protein knowing its amino acid sequence, bioinformaticians…
▽ More
To determine the 3D conformation of proteins is a necessity to understand their functions or interactions with other molecules. It is commonly admitted that, when proteins fold from their primary linear structures to their final 3D conformations, they tend to choose the ones that minimize their free energy. To find the 3D conformation of a protein knowing its amino acid sequence, bioinformaticians use various models of different resolutions and artificial intelligence tools, as the protein folding prediction problem is a NP complete one. More precisely, to determine the backbone structure of the protein using the low resolution models (2D HP square and 3D HP cubic), by finding the conformation that minimize free energy, is intractable exactly. Both the proof of NP-completeness and the 2D prediction consider that acceptable conformations have to satisfy a self-avoiding walk (SAW) requirement, as two different amino acids cannot occupy a same position in the lattice. It is shown in this document that the SAW requirement considered when proving NP-completeness is different from the SAW requirement used in various prediction programs, and that they are different from the real biological requirement. Indeed, the proof of NP completeness and the predictions in silico consider conformations that are not possible in practice. Consequences of this fact are investigated in this research work.
△ Less
Submitted 6 June, 2013;
originally announced June 2013.