-
ntLink: a toolkit for de novo genome assembly scaffolding and mapping using long reads
Authors:
Lauren Coombe,
René L. Warren,
Johnathan Wong,
Vladimir Nikolic,
Inanc Birol
Abstract:
With the increasing affordability and accessibility of genome sequencing data, de novo genome assembly is an important first step to a wide variety of downstream studies and analyses. Therefore, bioinformatics tools that enable the generation of high-quality genome assemblies in a computationally efficient manner are essential. Recent developments in long-read sequencing technologies have greatly…
▽ More
With the increasing affordability and accessibility of genome sequencing data, de novo genome assembly is an important first step to a wide variety of downstream studies and analyses. Therefore, bioinformatics tools that enable the generation of high-quality genome assemblies in a computationally efficient manner are essential. Recent developments in long-read sequencing technologies have greatly benefited genome assembly work, including scaffolding, by providing long-range evidence that can aid in resolving the challenging repetitive regions of complex genomes. ntLink is a flexible and resource-efficient genome scaffolding tool that utilizes long-read sequencing data to improve upon draft genome assemblies built from any sequencing technologies, including the same long reads. Instead of using read alignments to identify candidate joins, ntLink utilizes minimizer-based mappings to infer how input sequences should be ordered and oriented into scaffolds. Recent improvements to ntLink have added important features such as overlap detection, gap-filling and in-code scaffolding iterations. Here, we present three basic protocols demonstrating how to use each of these new features to yield highly contiguous genome assemblies, while still maintaining ntLink's proven computational efficiency. Further, as we illustrate in the alternate protocols, the lightweight minimizer-based mappings that enable ntLink scaffolding can also be utilized for other downstream applications, such as misassembly detection. With its modularity and multiple modes of execution, ntLink has broad benefit to the genomics community, from genome scaffolding and beyond. ntLink is an open-source project and is freely available from https://github.com/bcgsc/ntLink.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
HLA predictions from long sequence read alignments, streamed directly into HLAminer
Authors:
René L. Warren
Abstract:
The rapidly changing landscape of sequencing technologies brings new opportunities to genomics research. Longer sequence reads and higher sequence throughput coupled with ever-improving base accuracy and decreasing per-base cost is now making long reads suitable for analyzing polymorphic regions of the human genome, such as those of the human leucocyte antigen (HLA) gene complex. Here I present a…
▽ More
The rapidly changing landscape of sequencing technologies brings new opportunities to genomics research. Longer sequence reads and higher sequence throughput coupled with ever-improving base accuracy and decreasing per-base cost is now making long reads suitable for analyzing polymorphic regions of the human genome, such as those of the human leucocyte antigen (HLA) gene complex. Here I present a simple protocol for predicting HLA signatures from whole genome shotgun (WGS) long sequencing reads, by directly streaming sequence alignments into HLAminer. The method is as simple as running minimap2, it scales with the number of sequences to align, and can be used with any read aligner capable of sam format output without the need to store bulky alignment files to disk. I show how the predictions are robust even with older and less [base] accurate WGS nanopore datasets and relatively low (10X) sequence coverage and present a step-by-step protocol to predict HLA class I and II genes from the long sequencing reads of modern third-generation technologies.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
PASS: De novo assembler for short peptide sequences
Authors:
René L. Warren
Abstract:
The ability to characterize proteins at sequence-level resolution is vital to biological research. Currently, the leading method for protein sequencing is by liquid chromatography mass spectrometry (LC-MS) whereas proteins are reduced to their constituent peptides by enzymatic digest and subsequently analyzed on an LC-MS instrument. The short peptide sequences that result from this analysis are us…
▽ More
The ability to characterize proteins at sequence-level resolution is vital to biological research. Currently, the leading method for protein sequencing is by liquid chromatography mass spectrometry (LC-MS) whereas proteins are reduced to their constituent peptides by enzymatic digest and subsequently analyzed on an LC-MS instrument. The short peptide sequences that result from this analysis are used to characterize the original protein content of the sample. Here we present PASS, a de novo assembler for short peptide sequences that can be used to reconstruct large portions of protein targets, a step that can facilitate downstream sample characterization efforts. We show how, with adequate peptide sequence coverage and little-to-no additional sequence processing, PASS reconstructs protein sequences into relatively large (100 amino acid or longer) contigs having high (93.1 - 99.1%) sequence identity to reference antibody light and heavy chain proteins. Availability: PASS is released under the GNU General Public License Version 3 (GPLv3) and is publicly available from https://github.com/warrenlr/PASS
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
GapPredict: A Language Model for Resolving Gaps in Draft Genome Assemblies
Authors:
Eric Chen,
Justin Chu,
Jessica Zhang,
Rene L. Warren,
Inanc Birol
Abstract:
Short-read DNA sequencing instruments can yield over 1e+12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding…
▽ More
Short-read DNA sequencing instruments can yield over 1e+12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as "gaps". Here, we introduce GapPredict, a tool that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome sequence assembly.
△ Less
Submitted 24 May, 2021; v1 submitted 21 May, 2021;
originally announced May 2021.
-
Interactive SARS-CoV-2 mutation timemaps
Authors:
Rene L. Warren,
Inanc Birol
Abstract:
As the year 2020 draws to an end, several new strains have been reported for the SARS-CoV-2 coronavirus, the agent responsible for the COVID-19 pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evoluti…
▽ More
As the year 2020 draws to an end, several new strains have been reported for the SARS-CoV-2 coronavirus, the agent responsible for the COVID-19 pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evolution of the coronavirus, we built interactive scalable vector graphics maps that show daily nucleotide variations in genomes from the six most populated continents compared to that of the initial, ground-zero SARS-CoV-2 isolate sequenced at the beginning of the year. Availability: Mutation time maps are available from https://bcgsc.github.io/SARS2/
△ Less
Submitted 31 December, 2020;
originally announced December 2020.
-
HLA predictions from the bronchoalveolar lavage fluid samples of five patients at the early stage of the Wuhan seafood market COVID-19 outbreak
Authors:
Rene L Warren,
Inanc Birol
Abstract:
We are in the midst of a global viral pandemic, one with no cure and a high mortality rate. The Human Leukocyte Antigen (HLA) gene complex plays a critical role in host immunity. We predicted HLA class I and II alleles from the transcriptome sequencing data prepared from the bronchoalveolar lavage fluid samples of five patients at the early stage of the COVID-19 outbreak. We identified the HLA-I a…
▽ More
We are in the midst of a global viral pandemic, one with no cure and a high mortality rate. The Human Leukocyte Antigen (HLA) gene complex plays a critical role in host immunity. We predicted HLA class I and II alleles from the transcriptome sequencing data prepared from the bronchoalveolar lavage fluid samples of five patients at the early stage of the COVID-19 outbreak. We identified the HLA-I allele A*24:02 in four out of five patients, which is higher than the expected frequency (17.2%) in the South Han Chinese population. The difference is statistically significant with a p-value less than $10^{-4}$. Our analysis results may help provide future insights on disease susceptibility.
△ Less
Submitted 27 April, 2020; v1 submitted 15 April, 2020;
originally announced April 2020.
-
Measuring the viscous and elastic properties of single cells using video particle tracking microrheology
Authors:
Rebecca Louisa Warren,
Manlio Tassieri,
Xiang Li,
Andrew Glidle,
Allan Carlsson,
Jonathan M. Cooper
Abstract:
We present a simple and \emph{non-invasive} experimental procedure to measure the linear viscoelastic properties of cells by passive video particle tracking microrheology. In order to do this, a generalised Langevin equation is adopted to relate the time-dependent thermal fluctuations of a bead, chemically bound to the cell's \emph{exterior}, to the frequency-dependent viscoelastic moduli of the c…
▽ More
We present a simple and \emph{non-invasive} experimental procedure to measure the linear viscoelastic properties of cells by passive video particle tracking microrheology. In order to do this, a generalised Langevin equation is adopted to relate the time-dependent thermal fluctuations of a bead, chemically bound to the cell's \emph{exterior}, to the frequency-dependent viscoelastic moduli of the cell. It is shown that these moduli are related to the cell's cytoskeletal structure, which in this work is changed by varying the solution osmolarity from iso- to hypo-osmotic conditions. At high frequencies, the viscoelastic moduli frequency dependence changes from $\propto ω^{3/4}$ found in iso-osmotic solutions to $\propto ω^{1/2}$ in hypo--osmotic solutions; the first situation is typical of bending modes in isotropic \textit{in vitro} reconstituted F--actin networks, and the second could indicate that the restructured cytoskeleton behaves as a gel with "\textit{dangling branches}". The insights gained from this form of rheological analysis could prove to be a valuable addition to studies that address cellular physiology and pathology.
△ Less
Submitted 11 November, 2011;
originally announced November 2011.