-
Deep learning models for predicting RNA degradation via dual crowdsourcing
Authors:
Hannah K. Wayment-Steele,
Wipapat Kladwang,
Andrew M. Watkins,
Do Soon Kim,
Bojan Tunguz,
Walter Reade,
Maggie Demkin,
Jonathan Romano,
Roger Wellington-Oguri,
John J. Nicol,
Jiayang Gao,
Kazuki Onodera,
Kazuki Fujikawa,
Hanfei Mao,
Gilles Vandewiele,
Michele Tinti,
Bram Steenwinckel,
Takuya Ito,
Taiga Noumi,
Shujun He,
Keiichiro Ishi,
Youhan Lee,
Fatih Öztürk,
Anthony Chiu,
Emin Öztürk
, et al. (4 additional authors not shown)
Abstract:
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a ke…
▽ More
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales.
△ Less
Submitted 22 April, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Correcting a SHAPE-directed RNA structure by a mutate-map-rescue approach
Authors:
Siqi Tian,
Pablo Cordero,
Wipapat Kladwang,
Rhiju Das
Abstract:
The three-dimensional conformations of non-coding RNAs underpin their biochemical functions but have largely eluded experimental characterization. Here, we report that integrating a classic mutation/rescue strategy with high-throughput chemical mapping enables rapid RNA structure inference with unusually strong validation. We revisit a paradigmatic 16S rRNA domain for which SHAPE (selective 2`-hyd…
▽ More
The three-dimensional conformations of non-coding RNAs underpin their biochemical functions but have largely eluded experimental characterization. Here, we report that integrating a classic mutation/rescue strategy with high-throughput chemical mapping enables rapid RNA structure inference with unusually strong validation. We revisit a paradigmatic 16S rRNA domain for which SHAPE (selective 2`-hydroxyl acylation with primer extension) suggested a conformational change between apo- and holo-ribosome conformations. Computational support estimates, data from alternative chemical probes, and mutate-and-map (M2) experiments expose limitations of prior methodology and instead give a near-crystallographic secondary structure. Systematic interrogation of single base pairs via a high-throughput mutation/rescue approach then permits incisive validation and refinement of the M2-based secondary structure and further uncovers the functional conformation as an excited state (25+/-5% population) accessible via a single-nucleotide register shift. These results correct an erroneous SHAPE inference of a ribosomal conformational change and suggest a general mutate-map-rescue approach for dissecting RNA dynamic structure landscapes.
△ Less
Submitted 21 January, 2014;
originally announced January 2014.
-
Massively Parallel RNA Chemical Mapping with a Reduced Bias MAP-seq Protocol
Authors:
Matthew G. Seetin,
Wipapat Kladwang,
J. P. Bida,
Rhiju Das
Abstract:
Chemical mapping methods probe RNA structure by revealing and leveraging correlations of a nucleotide's structural accessibility or flexibility with its reactivity to various chemical probes. Pioneering work by Lucks and colleagues has expanded this method to probe hundreds of molecules at once on an Illumina sequencing platform, obviating the use of slab gels or capillary electrophoresis on one m…
▽ More
Chemical mapping methods probe RNA structure by revealing and leveraging correlations of a nucleotide's structural accessibility or flexibility with its reactivity to various chemical probes. Pioneering work by Lucks and colleagues has expanded this method to probe hundreds of molecules at once on an Illumina sequencing platform, obviating the use of slab gels or capillary electrophoresis on one molecule at a time. Here, we describe optimizations to this method from our lab, resulting in the MAP-seq protocol (Multiplexed Accessibility Probing read out through sequencing), version 1.0. The protocol permits the quantitative probing of thousands of RNAs at once, by several chemical modification reagents, on the time scale of a day using a table-top Illumina machine. This method and a software package MAPseeker (http://simtk.org/home/map_seeker) address several potential sources of bias, by eliminating PCR steps, improving ligation efficiencies of ssDNA adapters, and avoiding problematic heuristics in prior algorithms. We hope that the step-by-step description of MAP-seq 1.0 will help other RNA mapping laboratories to transition from electrophoretic to next-generation sequencing methods and to further reduce the turnaround time and any remaining biases of the protocol.
△ Less
Submitted 3 April, 2013;
originally announced April 2013.
-
A mutate-and-map protocol for inferring base pairs in structured RNA
Authors:
Pablo Cordero,
Wipapat Kladwang,
Christopher C. VanLang,
Rhiju Das
Abstract:
Chemical mapping is a widespread technique for structural analysis of nucleic acids in which a molecule's reactivity to different probes is quantified at single-nucleotide resolution and used to constrain structural modeling. This experimental framework has been extensively revisited in the past decade with new strategies for high-throughput read-outs, chemical modification, and rapid data analysi…
▽ More
Chemical mapping is a widespread technique for structural analysis of nucleic acids in which a molecule's reactivity to different probes is quantified at single-nucleotide resolution and used to constrain structural modeling. This experimental framework has been extensively revisited in the past decade with new strategies for high-throughput read-outs, chemical modification, and rapid data analysis. Recently, we have coupled the technique to high-throughput mutagenesis. Point mutations of a base-paired nucleotide can lead to exposure of not only that nucleotide but also its interaction partner. Carrying out the mutation and mapping for the entire system gives an experimental approximation of the molecules contact map. Here, we give our in-house protocol for this mutate-and-map strategy, based on 96-well capillary electrophoresis, and we provide practical tips on interpreting the data to infer nucleic acid structure.
△ Less
Submitted 31 January, 2013;
originally announced January 2013.
-
Quantitative DMS mapping for automated RNA secondary structure inference
Authors:
Pablo Cordero,
Wipapat Kladwang,
Christopher C. VanLang,
Rhiju Das
Abstract:
For decades, dimethyl sulfate (DMS) mapping has informed manual modeling of RNA structure in vitro and in vivo. Here, we incorporate DMS data into automated secondary structure inference using a pseudo-energy framework developed for 2'-OH acylation (SHAPE) mapping. On six non-coding RNAs with crystallographic models, DMS- guided modeling achieves overall false negative and false discovery rates of…
▽ More
For decades, dimethyl sulfate (DMS) mapping has informed manual modeling of RNA structure in vitro and in vivo. Here, we incorporate DMS data into automated secondary structure inference using a pseudo-energy framework developed for 2'-OH acylation (SHAPE) mapping. On six non-coding RNAs with crystallographic models, DMS- guided modeling achieves overall false negative and false discovery rates of 9.5% and 11.6%, comparable or better than SHAPE-guided modeling; and non-parametric bootstrapping provides straightforward confidence estimates. Integrating DMS/SHAPE data and including CMCT reactivities give small additional improvements. These results establish DMS mapping - an already routine technique - as a quantitative tool for unbiased RNA structure modeling.
△ Less
Submitted 5 July, 2012;
originally announced July 2012.
-
Ultraviolet Shadowing of RNA Causes Substantial Non-Poissonian Chemical Damage in Seconds
Authors:
Wipapat Kladwang,
Justine Hum,
Rhiju Das
Abstract:
Chemical purity of RNA samples is critical for high-precision studies of RNA folding and catalytic behavior, but such purity may be compromised by photodamage accrued during ultraviolet (UV) visualization of gel-purified samples. Here, we quantitatively assess the breadth and extent of such damage by using reverse transcription followed by single-nucleotide-resolution capillary electrophoresis. We…
▽ More
Chemical purity of RNA samples is critical for high-precision studies of RNA folding and catalytic behavior, but such purity may be compromised by photodamage accrued during ultraviolet (UV) visualization of gel-purified samples. Here, we quantitatively assess the breadth and extent of such damage by using reverse transcription followed by single-nucleotide-resolution capillary electrophoresis. We detected UV-induced lesions across a dozen natural and artificial RNAs including riboswitch domains, other non-coding RNAs, and artificial sequences; across multiple sequence contexts, dominantly at but not limited to pyrimidine doublets; and from multiple lamps that are recommended for UV shadowing in the literature. Most strikingly, irradiation time-courses reveal detectable damage within a few seconds of exposure, and these data can be quantitatively fit to a 'skin effect' model that accounts for the increased exposure of molecules near the top of irradiated gel slices. The results indicate that 200-nucleotide RNAs subjected to 20 seconds or less of UV shadowing can incur damage to 20% of molecules, and the molecule-by-molecule distribution of these lesions is more heterogeneous than a Poisson distribution. Photodamage from UV shadowing is thus likely a widespread but unappreciated cause of artifactual heterogeneity in quantitative and single-molecule-resolution RNA biophysical measurements.
△ Less
Submitted 21 February, 2012;
originally announced February 2012.
-
Automated RNA structure prediction uncovers a missing link in double glycine riboswitches
Authors:
Wipapat Kladwang,
Fang-Chieh Chou,
Rhiju Das
Abstract:
The tertiary structures of functional RNA molecules remain difficult to decipher. A new generation of automated RNA structure prediction methods may help address these challenges but have not yet been experimentally validated. Here we apply four prediction tools to a remarkable class of double glycine riboswitches that exhibit ligand-binding cooperativity. A novel method (BPPalign), RMdetect, JAR3…
▽ More
The tertiary structures of functional RNA molecules remain difficult to decipher. A new generation of automated RNA structure prediction methods may help address these challenges but have not yet been experimentally validated. Here we apply four prediction tools to a remarkable class of double glycine riboswitches that exhibit ligand-binding cooperativity. A novel method (BPPalign), RMdetect, JAR3D, and Rosetta 3D modeling give consistent predictions for a new stem P0 and kink-turn motif. These elements structure the linker between the RNAs' double aptamers. Chemical mapping on the F. nucleatum riboswitch with SHAPE, DMS, and CMCT probing, mutate-and-map studies, and mutation/rescue experiments all provide strong evidence for the structured linker. Under solution conditions that separate two glycine binding transitions, disrupting this helix-junction-helix structure gives 120-fold and 6- to 30-fold poorer association constants for the two transitions, corresponding to an overall energetic impact of 4.3 \pm 0.5 kcal/mol. Prior biochemical and crystallography studies from several labs did not include this critical element due to over-truncation of the RNA. We argue that several further undiscovered elements are likely to exist in the flanking regions of this and other RNA switches, and automated prediction tools can now play a powerful role in their detection and dissection.
△ Less
Submitted 4 October, 2011;
originally announced October 2011.
-
Can biopolymer structures be sampled enumeratively? Atomic-accuracy RNA loop modeling by a stepwise ansatz
Authors:
Parin Sripakdeevong,
Wipapat Kladwang,
Rhiju Das
Abstract:
Atomic-accuracy structure prediction of macromolecules is a long-sought goal of computational biophysics. Accurate modeling should be achievable by optimizing a physically realistic energy function but is presently precluded by incomplete sampling of a biopolymer's many degrees of freedom. We present herein a working hypothesis, called the "stepwise ansatz", for recursively constructing well-packe…
▽ More
Atomic-accuracy structure prediction of macromolecules is a long-sought goal of computational biophysics. Accurate modeling should be achievable by optimizing a physically realistic energy function but is presently precluded by incomplete sampling of a biopolymer's many degrees of freedom. We present herein a working hypothesis, called the "stepwise ansatz", for recursively constructing well-packed atomic-detail models in small steps, enumerating several million conformations for each monomer and covering all build-up paths. By implementing the strategy in Rosetta and making use of high-performance computing, we provide first tests of this hypothesis on a benchmark of fifteen RNA loop modeling problems drawn from riboswitches, ribozymes, and the ribosome, including ten cases that were not solvable by prior knowledge based modeling approaches. For each loop problem, this deterministic stepwise assembly (SWA) method either reaches atomic accuracy or exposes flaws in Rosetta's all-atom energy function, indicating the resolution of the conformational sampling bottleneck. To our knowledge, SWA is the first enumerative, ab initio build-up method to systematically outperform existing Monte Carlo and knowledge-based methods for 3D structure prediction. As a rigorous experimental test, we have applied SWA to a small RNA motif of previously unknown structure, the C7.2 tetraloop/tetraloop-receptor, and stringently tested this blind prediction with nucleotide-resolution structure mapping data.
△ Less
Submitted 27 April, 2011;
originally announced April 2011.
-
HiTRACE: High-throughput robust analysis for capillary electrophoresis
Authors:
Sungroh Yoon,
Jinkyu Kim,
Justine Hum,
Hanjoo Kim,
Seunghyun Park,
Wipapat Kladwang,
Rhiju Das
Abstract:
Motivation: Capillary electrophoresis (CE) of nucleic acids is a workhorse technology underlying high-throughput genome analysis and large-scale chemical mapping for nucleic acid structural inference. Despite the wide availability of CE-based instruments, there remain challenges in leveraging their full power for quantitative analysis of RNA and DNA structure, thermodynamics, and kinetics. In part…
▽ More
Motivation: Capillary electrophoresis (CE) of nucleic acids is a workhorse technology underlying high-throughput genome analysis and large-scale chemical mapping for nucleic acid structural inference. Despite the wide availability of CE-based instruments, there remain challenges in leveraging their full power for quantitative analysis of RNA and DNA structure, thermodynamics, and kinetics. In particular, the slow rate and poor automation of available analysis tools have bottlenecked a new generation of studies involving hundreds of CE profiles per experiment.
Results: We propose a computational method called high-throughput robust analysis for capillary electrophoresis (HiTRACE) to automate the key tasks in large-scale nucleic acid CE analysis, including the profile alignment that has heretofore been a rate-limiting step in the highest throughput experiments. We illustrate the application of HiTRACE on thirteen data sets representing 4 different RNAs, three chemical modification strategies, and up to 480 single mutant variants; the largest data sets each include 87,360 bands. By applying a series of robust dynamic programming algorithms, HiTRACE outperforms prior tools in terms of alignment and fitting quality, as assessed by measures including the correlation between quantified band intensities between replicate data sets. Furthermore, while the smallest of these data sets required 7 to 10 hours of manual intervention using prior approaches, HiTRACE quantitation of even the largest data sets herein was achieved in 3 to 12 minutes. The HiTRACE method therefore resolves a critical barrier to the efficient and accurate analysis of nucleic acid structure in experiments involving tens of thousands of electrophoretic bands.
△ Less
Submitted 12 May, 2011; v1 submitted 21 April, 2011;
originally announced April 2011.
-
Two-dimensional chemical mapping for non-coding RNAs
Authors:
Wipapat Kladwang,
Christopher C. VanLang,
Pablo Cordero,
Rhiju Das
Abstract:
Non-coding RNA molecules fold into precise base pairing patterns to carry out critical roles in genetic regulation and protein synthesis. We show here that coupling systematic mutagenesis with high-throughput SHAPE chemical mapping enables accurate base pair inference of domains from ribosomal RNA, ribozymes, and riboswitches. For a six-RNA benchmark that challenged prior chemical/computational me…
▽ More
Non-coding RNA molecules fold into precise base pairing patterns to carry out critical roles in genetic regulation and protein synthesis. We show here that coupling systematic mutagenesis with high-throughput SHAPE chemical mapping enables accurate base pair inference of domains from ribosomal RNA, ribozymes, and riboswitches. For a six-RNA benchmark that challenged prior chemical/computational methods, this mutate-and-map strategy gives secondary structures in agreement with crystallographic data (2 % error rates), including a blind test on a double-glycine riboswitch. Through modeling of partially ordered RNA states, the method enables the first test of an 'interdomain helix-swap' hypothesis for ligand-binding cooperativity in a glycine riboswitch. Finally, the mutate-and-map data report on tertiary contacts within non-coding RNAs; coupled with the Rosetta/FARFAR algorithm, these data give nucleotide-resolution three-dimensional models (5.7 Å helix RMSD) of an adenine riboswitch. These results highlight the promise of a two-dimensional chemical strategy for inferring the secondary and tertiary structures that underlie non-coding RNA behavior.
△ Less
Submitted 5 April, 2011;
originally announced April 2011.
-
Understanding the errors of SHAPE-directed RNA structure modeling
Authors:
Wipapat Kladwang,
Christopher C. VanLang,
Pablo Cordero,
Rhiju Das
Abstract:
Single-nucleotide-resolution chemical mapping for structured RNA is being rapidly advanced by new chemistries, faster readouts, and coupling to computational algorithms. Recent tests have shown that selective 2'-hydroxyl acylation by primer extension (SHAPE) can give near-zero error rates (0-2%) in modeling the helices of RNA secondary structure. Here, we benchmark the method using six molecules f…
▽ More
Single-nucleotide-resolution chemical mapping for structured RNA is being rapidly advanced by new chemistries, faster readouts, and coupling to computational algorithms. Recent tests have shown that selective 2'-hydroxyl acylation by primer extension (SHAPE) can give near-zero error rates (0-2%) in modeling the helices of RNA secondary structure. Here, we benchmark the method using six molecules for which crystallographic data are available: tRNA(phe) and 5S rRNA from Escherichia coli, the P4-P6 domain of the Tetrahymena group I ribozyme, and ligand-bound domains from riboswitches for adenine, cyclic di-GMP, and glycine. SHAPE-directed modeling of these highly structured RNAs gave an overall false negative rate (FNR) of 17% and a false discovery rate (FDR) of 21%, with at least one helix prediction error in five of the six cases. Extensive variations of data processing, normalization, and modeling parameters did not significantly mitigate modeling errors. Only one varation, filtering out data collected with deoxyinosine triphosphate during primer extension, gave a modest improvement (FNR = 12%, and FDR = 14%). The residual structure modeling errors are explained by the insufficient information content of these RNAs' SHAPE data, as evaluated by a nonparametric bootstrapping analysis. Beyond these benchmark cases, bootstrapping suggests a low level of confidence (<50%) in the majority of helices in a previously proposed SHAPE-directed model for the HIV-1 RNA genome. Thus, SHAPE-directed RNA modeling is not always unambiguous, and helix-by-helix confidence estimates, as described herein, may be critical for interpreting results from this powerful methodology.
△ Less
Submitted 7 September, 2011; v1 submitted 28 March, 2011;
originally announced March 2011.