Evaluation of tools for differential gene expression analysis by RNA-seq on a 48 biological replicate experiment
Authors:
Nicholas J. Schurch,
Pieta Schofield,
Marek Gierliński,
Christian Cole,
Alexander Sherstnev,
Vijender Singh,
Nicola Wrobel,
Karim Gharbi,
Gordon G. Simpson,
Tom Owen-Hughes,
Mark Blaxter,
Geoffrey J. Barton
Abstract:
An RNA-seq experiment with 48 biological replicates in each of 2 conditions was performed to determine the number of biological replicates ($n_r$) required, and to identify the most effective statistical analysis tools for identifying differential gene expression (DGE). When $n_r=3$, seven of the nine tools evaluated give true positive rates (TPR) of only 20 to 40 percent. For high fold-change gen…
▽ More
An RNA-seq experiment with 48 biological replicates in each of 2 conditions was performed to determine the number of biological replicates ($n_r$) required, and to identify the most effective statistical analysis tools for identifying differential gene expression (DGE). When $n_r=3$, seven of the nine tools evaluated give true positive rates (TPR) of only 20 to 40 percent. For high fold-change genes ($|log_{2}(FC)|\gt2$) the TPR is $\gt85$ percent. Two tools performed poorly; over- or under-predicting the number of differentially expressed genes. Increasing replication gives a large increase in TPR when considering all DE genes but only a small increase for high fold-change genes. Achieving a TPR $\gt85$% across all fold-changes requires $n_r\gt20$. For future RNA-seq experiments these results suggest $n_r\gt6$, rising to $n_r\gt12$ when identifying DGE irrespective of fold-change is important. For $6 \lt n_r \lt 12$, superior TPR makes edgeR the leading tool tested. For $n_r \ge12$, minimizing false positives is more important and DESeq outperforms the other tools.
△ Less
Submitted 8 June, 2015; v1 submitted 8 May, 2015;
originally announced May 2015.
Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
Authors:
Marek Gierliński,
Christian Cole,
Pietà Schofield,
Nicholas J. Schurch,
Alexander Sherstnev,
Vijender Singh,
Nicola Wrobel,
Karim Gharbi,
Gordon Simpson,
Tom Owen-Hughes,
Mark Blaxter,
Geoffrey J. Barton
Abstract:
High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of t…
▽ More
High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations. A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ~0.01. The high-replicate data also allowed for strict quality control and screening of bad replicates, which can drastically affect the gene read-count distribution. RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.
△ Less
Submitted 4 May, 2015;
originally announced May 2015.