-
A frame-based representation of genomic sequences for removing errors and rare variant detection in NGS data
Authors:
Raunaq Malhotra,
Manjari Mukhopadhyay,
Mary Poss,
Raj Acharya
Abstract:
We propose a frame-based representation of k-mers for detecting sequencing errors and rare variants in next generation sequencing data obtained from populations of closely related genomes. Frames are sets of non-orthogonal basis functions, traditionally used in signal processing for noise removal. We define a frame for genomes and sequenced reads to consist of discrete spatial signals of every k-m…
▽ More
We propose a frame-based representation of k-mers for detecting sequencing errors and rare variants in next generation sequencing data obtained from populations of closely related genomes. Frames are sets of non-orthogonal basis functions, traditionally used in signal processing for noise removal. We define a frame for genomes and sequenced reads to consist of discrete spatial signals of every k-mer of a given size. We show that each k-mer in the sequenced data can be projected onto multiple frames and these projections are maximized for spatial signals corresponding to the k-mer's substrings. Our proposed classifier, MultiRes, is trained on the projections of k-mers as features used for marking k-mers as erroneous or true variations in the genome. We evaluate MultiRes on simulated and real viral population datasets and compare it to other error correction methods known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs), fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is freely available from the GitHub link (https://github.com/raunaq-m/MultiRes).
△ Less
Submitted 16 April, 2016;
originally announced April 2016.
-
Separating Putative Pathogens from Background Contamination with Principal Orthogonal Decomposition: Evidence for Leptospira in the Ugandan Neonatal Septisome
Authors:
Steven J. Schiff,
Julius Kiwanuka,
Gina Riggio,
Lan Nguyen,
Kevin Mu,
Emily Sproul,
Joel Bazira,
Juliet Mwanga,
Dickson Tumusiime,
Eunice Nyesigire,
Nkangi Lwanga,
Kaleb T. Bogale,
Vivek Kapur,
James Broach,
Sarah Morton,
Benjamin C. Warf,
Mary Poss
Abstract:
Neonatal sepsis (NS) is responsible for over a 1 million yearly deaths worldwide. In the developing world NS is often treated without an identified microbial pathogen. Amplicon sequencing of the bacterial 16S rRNA gene can be used to identify organisms that are difficult to detect by routine microbiological methods. However, contaminating bacteria are ubiquitous in both hospital settings and resea…
▽ More
Neonatal sepsis (NS) is responsible for over a 1 million yearly deaths worldwide. In the developing world NS is often treated without an identified microbial pathogen. Amplicon sequencing of the bacterial 16S rRNA gene can be used to identify organisms that are difficult to detect by routine microbiological methods. However, contaminating bacteria are ubiquitous in both hospital settings and research reagents, and must be accounted for to make effective use of these data. In the present study, we sequenced the bacterial 16S rRNA gene obtained from blood and cerebrospinal fluid (CSF) of 80 neonates presenting with NS to the Mbarara Regional Hospital in Uganda. Assuming that patterns of background contamination would be independent of pathogenic microorganism DNA, we applied a novel quantitative approach using principal orthogonal decomposition to separate background contamination from potential pathogens in sequencing data. We designed our quantitative approach contrasting blood, CSF, and control specimens, and employed a variety of statistical random matrix bootstrap hypotheses to estimate statistical significance. These analyses demonstrate that Leptospira appears present in some infants presenting within 48 hr of birth, indicative of infection in utero, and up to 28 days of age, suggesting environmental exposure. This organism cannot be cultured in routine bacteriological settings, and is enzootic in the cattle that the rural peoples of western Uganda often live in close proximity. Our findings demonstrate that statistical approaches to remove background organisms common in 16S sequence data can reveal putative pathogens in small volume biological samples from newborns. This computational analysis thus reveals an important medical finding that has the potential to alter therapy and prevention efforts in a critically ill population.
△ Less
Submitted 1 February, 2016;
originally announced February 2016.
-
Maximum Likelihood de novo reconstruction of viral populations using paired end sequencing data
Authors:
Raunaq Malhotra,
Manjari Mukhopadhyay Steven Wu,
Allen Rodrigo,
Mary Poss,
Raj Acharya
Abstract:
We present MLEHaplo, a maximum likelihood de novo assembly algorithm for reconstructing viral haplotypes in a virus population from paired-end next generation sequencing (NGS) data. Using the pairing information of reads in our proposed Viral Path Reconstruction Algorithm (ViPRA), we generate a small subset of paths from a De Bruijn graph of reads that serve as candidate paths for true viral haplo…
▽ More
We present MLEHaplo, a maximum likelihood de novo assembly algorithm for reconstructing viral haplotypes in a virus population from paired-end next generation sequencing (NGS) data. Using the pairing information of reads in our proposed Viral Path Reconstruction Algorithm (ViPRA), we generate a small subset of paths from a De Bruijn graph of reads that serve as candidate paths for true viral haplotypes. Our proposed method MLEHaplo then generates a maximum likelihood estimate of the viral population using the paths reconstructed by ViPRA. We evaluate and compare MLEHaplo on simulated datasets of 1200 base pairs at different sequence coverage, on HCV strains with sequencing errors, and on a lab mixture of five HIV-1 strains. MLEHaplo reconstructs full length viral haplotypes having a 100% sequence identity to the true viral haplotypes in most of the small genome simulated viral populations at 250x sequencing coverage. While reference based methods either under-estimate or over-estimate the viral haplotypes, MLEHaplo limits the over-estimation to 3 times the size of true viral haplotypes, reconstructs the full phylogeny in the HCV to greater than 99% sequencing identity and captures more sequencing variation for the HIV-1 strains dataset compared to their known consensus sequences.
△ Less
Submitted 16 April, 2016; v1 submitted 14 February, 2015;
originally announced February 2015.
-
Clustering pipeline for determining consensus sequences in targeted next-generation sequencing
Authors:
Raunaq Malhotra,
Daniel Elleder,
Le Bao,
David R Hunter,
Raj Acharya,
Mary Poss
Abstract:
Analyses of targeted genomic sequencing data from next-generation-sequencing (NGS) technologies typically involves mapping reads to a reference sequence or clustering reads. For a number of species a reference genome is not available so the analyses of targeted sequencing data, for example polymorphic structural variation caused by mobile elements is difficult; clustering methods are preferred for…
▽ More
Analyses of targeted genomic sequencing data from next-generation-sequencing (NGS) technologies typically involves mapping reads to a reference sequence or clustering reads. For a number of species a reference genome is not available so the analyses of targeted sequencing data, for example polymorphic structural variation caused by mobile elements is difficult; clustering methods are preferred for such data analysis. Clustering of reads requires a clustering threshold parameter, which is used to compare and group reads. However, determining the optimal clustering threshold for a read dataset is challenging because of different sequence composition, the number of sequences present, and also the amount of sequencing errors in the dataset. High values of the clustering threshold parameter can falsely inflate the number of recovered genomic regions, while low values of clustering threshold can merge reads from distinct regions into a single cluster. Thus, an algorithm that can empirically determine clustering threshold is needed. We propose a pipeline for clustering genomic sequences wherein the clustering threshold is empirically determined from the NGS data. The optimal threshold is decided based on two internal clustering measures which assess clusters for small intra-cluster diameters and large inter-cluster distances. We evaluate the pipeline on two simulated datasets derived from human genome sequence simulating different genomic regions and sequencing depth. The total number of clusters obtained from our pipeline is closer to the actual number of reference sequences when compared to single round of clustering. Also, the number of clusters whose consensus sequence matches a corresponding reference sequence is higher in our pipeline. We observe that the presence of repeat regions affects clustering accuracy.
△ Less
Submitted 13 February, 2016; v1 submitted 6 October, 2014;
originally announced October 2014.