Search | arXiv e-print repository

arXiv:2010.04665 [pdf, other]

Scaling Systematic Literature Reviews with Machine Learning Pipelines

Authors: Seraphina Goldfarb-Tarrant, Alexander Robertson, Jasmina Lazic, Theodora Tsouloufi, Louise Donnison, Karen Smyth

Abstract: Systematic reviews, which entail the extraction of data from large numbers of scientific documents, are an ideal avenue for the application of machine learning. They are vital to many fields of science and philanthropy, but are very time-consuming and require experts. Yet the three main stages of a systematic review are easily done automatically: searching for documents can be done via APIs and sc… ▽ More Systematic reviews, which entail the extraction of data from large numbers of scientific documents, are an ideal avenue for the application of machine learning. They are vital to many fields of science and philanthropy, but are very time-consuming and require experts. Yet the three main stages of a systematic review are easily done automatically: searching for documents can be done via APIs and scrapers, selection of relevant documents can be done via binary classification, and extraction of data can be done via sequence-labelling classification. Despite the promise of automation for this field, little research exists that examines the various ways to automate each of these tasks. We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs. We test the ability of classifiers to work well on small amounts of data and to generalise to data from countries not represented in the training data. We test different types of data extraction with varying difficulty in annotation, and five different neural architectures to do the extraction. We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation, which is only 15% of the time it takes to do the whole review manually and can be repeated and extended to new data with no additional effort. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: In EMNLP 2020 Scholarly Document Processing Workshop

arXiv:1603.06687 [pdf, other]

statmod: Probability Calculations for the Inverse Gaussian Distribution

Authors: Göknur Giner, Gordon K. Smyth

Abstract: The inverse Gaussian distribution (IGD) is a well known and often used probability distribution for which fully reliable numerical algorithms have not been available. Our aim in this article is to develop software for this distribution for the R programming environment. We develop fast, reliable basic probability functions (dinvgauss, pinvgauss, qinvgauss and rinvgauss) that work for all possible… ▽ More The inverse Gaussian distribution (IGD) is a well known and often used probability distribution for which fully reliable numerical algorithms have not been available. Our aim in this article is to develop software for this distribution for the R programming environment. We develop fast, reliable basic probability functions (dinvgauss, pinvgauss, qinvgauss and rinvgauss) that work for all possible parameter values and which achieve close to full machine accuracy. The most challenging task is to compute quantiles for given cumulative probabilities and we develop a simple but elegant mathematical solution to this problem. We show that Newton's method for finding the quantiles of a IGD always converges monotonically when started from the mode of the distribution. Simple Taylor series expansions are used to improve accuracy on the log-scale. The IGD probability functions provide the same options and obey the same conventions as do probability functions provided in the standard R stats package. The IGD functions are part of the statmod package available from the CRAN repository. △ Less

Submitted 27 July, 2016; v1 submitted 22 March, 2016; originally announced March 2016.

Comments: 18 pages, 2 figures. Accepted for publication in The R Journal, Volume 8 (2016)

MSC Class: 60-04

arXiv:1603.05766 [pdf, other]

doi 10.2202/1544-6115.1585

Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn

Authors: Belinda Phipson, Gordon K. Smyth

Abstract: Permutation tests are amongst the most commonly used statistical tools in modern genomic research, a process by which p-values are attached to a test statistic by randomly permuting the sample or gene labels. Yet permutation p-values published in the genomic literature are often computed incorrectly, understated by about 1/m, where m is the number of permutations. The same is often true in the mor… ▽ More Permutation tests are amongst the most commonly used statistical tools in modern genomic research, a process by which p-values are attached to a test statistic by randomly permuting the sample or gene labels. Yet permutation p-values published in the genomic literature are often computed incorrectly, understated by about 1/m, where m is the number of permutations. The same is often true in the more general situation when Monte Carlo simulation is used to assign p-values. Although the p-value understatement is usually small in absolute terms, the implications can be serious in a multiple testing context. The understatement arises from the intuitive but mistaken idea of using permutation to estimate the tail probability of the test statistic. We argue instead that permutation should be viewed as generating an exact discrete null distribution. The relevant literature, some of which is likely to have been relatively inaccessible to the genomic community, is reviewed and summarized. A computation strategy is developed for exact p-values when permutations are randomly drawn. The strategy is valid for any number of permutations and samples. Some simple recommendations are made for the implementation of permutation tests in practice. △ Less

Submitted 18 March, 2016; originally announced March 2016.

Comments: 12 pages, 2 figures

MSC Class: 62G09; 62G10

Journal ref: Stat. Appl. Genet. Molec. Biol., Volume 9 (2010), Issue 1, Article 39

arXiv:1602.08678 [pdf, other]

doi 10.1214/16-AOAS920

Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression

Authors: Belinda Phipson, Stanley Lee, Ian J. Majewski, Warren S. Alexander, Gordon K. Smyth

Abstract: One of the most common analysis tasks in genomic research is to identify genes that are differentially expressed (DE) between experimental conditions. Empirical Bayes (EB) statistical tests using moderated genewise variances have been very effective for this purpose, especially when the number of biological replicate samples is small. The EB procedures can however be heavily influenced by a small… ▽ More One of the most common analysis tasks in genomic research is to identify genes that are differentially expressed (DE) between experimental conditions. Empirical Bayes (EB) statistical tests using moderated genewise variances have been very effective for this purpose, especially when the number of biological replicate samples is small. The EB procedures can however be heavily influenced by a small number of genes with very large or very small variances. This article improves the differential expression tests by robustifying the hyperparameter estimation procedure. The robust procedure has the effect of decreasing the informativeness of the prior distribution for outlier genes while increasing its informativeness for other genes. This effect has the double benefit of reducing the chance that hypervariable genes will be spuriously identified as DE while increasing statistical power for the main body of genes. The robust EB algorithm is fast and numerically stable. The procedure allows exact small-sample null distributions for the test statistics and reduces exactly to the original EB procedure when no outlier genes are present. Simulations show that the robustified tests have similar performance to the original tests in the absence of outlier genes but have greater power and robustness when outliers are present. The article includes case studies for which the robust method correctly identifies and downweights genes associated with hidden covariates and detects more genes likely to be scientifically relevant to the experimental conditions. The new procedure is implemented in the limma software package freely available from the Bioconductor repository. △ Less

Submitted 27 July, 2016; v1 submitted 28 February, 2016; originally announced February 2016.

Comments: 23 pages, 4 figures

MSC Class: 62F35 (primary); 62P10 (secondary)

Journal ref: Ann. Appl. Stat., Volume 10, Number 2 (2016), 946-963

arXiv:1406.4893 [pdf]

doi 10.1038/ncomms6125

Assessing Technical Performance in Differential Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures

Authors: Sarah A. Munro, Steve P. Lund, P. Scott Pine, Hans Binder, Djork-Arné Clevert, Ana Conesa, Joaquin Dopazo, Mario Fasold, Sepp Hochreiter, Huixiao Hong, Nederah Jafari, David P. Kreil, Paweł P. Łabaj, Sheng Li, Yang Liao, Simon Lin, Joseph Meehan, Christopher E. Mason, Javier Santoyo, Robert A. Setterquist, Leming Shi, Wei Shi, Gordon K. Smyth, Nancy Stralis-Pavese, Zhenqiang Su , et al. (8 additional authors not shown)

Abstract: There is a critical need for standard approaches to assess, report, and compare the technical performance of genome-scale differential gene expression experiments. We assess technical performance with a proposed "standard" dashboard of metrics derived from analysis of external spike-in RNA control ratio mixtures. These control ratio mixtures with defined abundance ratios enable assessment of diagn… ▽ More There is a critical need for standard approaches to assess, report, and compare the technical performance of genome-scale differential gene expression experiments. We assess technical performance with a proposed "standard" dashboard of metrics derived from analysis of external spike-in RNA control ratio mixtures. These control ratio mixtures with defined abundance ratios enable assessment of diagnostic performance of differentially expressed transcript lists, limit of detection of ratio (LODR) estimates, and expression ratio variability and measurement bias. The performance metrics suite is applicable to analysis of a typical experiment, and here we also apply these metrics to evaluate technical performance among laboratories. An interlaboratory study using identical samples shared amongst 12 laboratories with three different measurement processes demonstrated generally consistent diagnostic power across 11 laboratories. Ratio measurement variability and bias were also comparable amongst laboratories for the same measurement process. Different biases were observed for measurement processes using different mRNA enrichment protocols. △ Less

Submitted 18 June, 2014; originally announced June 2014.

Comments: 65 pages, 6 Main Figures, 33 Supplementary Figures

Journal ref: Nat. Commun. (2014) 5:5125

arXiv:1305.3347 [pdf, other]

doi 10.1093/bioinformatics/btt656

featureCounts: An efficient general-purpose program for assigning sequence reads to genomic features

Authors: Yang Liao, Gordon K Smyth, Wei Shi

Abstract: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a g… ▽ More Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages. △ Less

Submitted 14 November, 2013; v1 submitted 14 May, 2013; originally announced May 2013.

Comments: This manuscript has now been published on Bioinformatics Yang Liao, Gordon K Smyth and Wei Shi. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics 2013

Journal ref: Bioinformatics 30 (2014), 923-930

arXiv:1302.3685 [pdf, other]

doi 10.1038/nprot.2013.099

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Authors: Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson

Abstract: RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection p… ▽ More RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, and there is a need for guidance on current best practices. This protocol presents a "state-of-the-art" computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day using a standard desktop PC. △ Less

Submitted 20 June, 2013; v1 submitted 15 February, 2013; originally announced February 2013.

Showing 1–7 of 7 results for author: Smyth, K