-
FAIR Data Pipeline: provenance-driven data management for traceable scientific workflows
Authors:
Sonia Natalie Mitchell,
Andrew Lahiff,
Nathan Cummings,
Jonathan Hollocombe,
Bram Boskamp,
Ryan Field,
Dennis Reddyhoff,
Kristian Zarebski,
Antony Wilson,
Bruno Viola,
Martin Burke,
Blair Archibald,
Paul Bessell,
Richard Blackwell,
Lisa A Boden,
Alys Brett,
Sam Brett,
Ruth Dundas,
Jessica Enright,
Alejandra N. Gonzalez-Beltran,
Claire Harris,
Ian Hinder,
Christopher David Hughes,
Martin Knight,
Vino Mano
, et al. (13 additional authors not shown)
Abstract:
Modern epidemiological analyses to understand and combat the spread of disease depend critically on access to, and use of, data. Rapidly evolving data, such as data streams changing during a disease outbreak, are particularly challenging. Data management is further complicated by data being imprecisely identified when used. Public trust in policy decisions resulting from such analyses is easily da…
▽ More
Modern epidemiological analyses to understand and combat the spread of disease depend critically on access to, and use of, data. Rapidly evolving data, such as data streams changing during a disease outbreak, are particularly challenging. Data management is further complicated by data being imprecisely identified when used. Public trust in policy decisions resulting from such analyses is easily damaged and is often low, with cynicism arising where claims of "following the science" are made without accompanying evidence. Tracing the provenance of such decisions back through open software to primary data would clarify this evidence, enhancing the transparency of the decision-making process. Here, we demonstrate a Findable, Accessible, Interoperable and Reusable (FAIR) data pipeline developed during the COVID-19 pandemic that allows easy annotation of data as they are consumed by analyses, while tracing the provenance of scientific outputs back through the analytical source code to data sources. Such a tool provides a mechanism for the public, and fellow scientists, to better assess the trust that should be placed in scientific evidence, while allowing scientists to support policy-makers in openly justifying their decisions. We believe that tools such as this should be promoted for use across all areas of policy-facing research.
△ Less
Submitted 4 May, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
A hypothesis testing framework for the ratio of means of two negative binomial distributions: classifying the efficacy of anthelmintic treatment against intestinal parasites
Authors:
Matthew Denwood,
Giles Innocent,
Jamie Prentice,
Louise Matthews,
Stuart Reid,
Christian Pipper,
Bruno Levecke,
Ray Kaplan,
Andrew Kotze,
Jennifer Keiser,
Marta Palmeirim,
Iain McKendrick
Abstract:
Over-dispersed count data typically pose a challenge to analysis using standard statistical methods, particularly when evaluating the efficacy of an intervention through the observed effect on the mean. We outline a novel statistical method for analysing such data, along with a statistically coherent framework within which the observed efficacy is assigned one of four easily interpretable classifi…
▽ More
Over-dispersed count data typically pose a challenge to analysis using standard statistical methods, particularly when evaluating the efficacy of an intervention through the observed effect on the mean. We outline a novel statistical method for analysing such data, along with a statistically coherent framework within which the observed efficacy is assigned one of four easily interpretable classifications relative to a target efficacy: "adequate", "reduced", "borderline" or "inconclusive". We illustrate our approach by analysing the anthelmintic efficacy of mebendazole using a dataset of egg reduction rates relating to three intestinal parasites from a treatment arm of a randomised controlled trial involving 91 children on Pemba Island, Tanzania. Numerical validation of the type I error rates of the novel method indicate that it performs as well as the best existing computationally-simple method, but with the additional advantage of providing valid inference in the case of an observed efficacy of 100%. The framework and statistical analysis method presented also allow the required sample size of a prospective study to be determined via simulation. Both the framework and method presented have high potential utility within medical parasitology, as well as other fields where over-dispersed count datasets are commonplace. In order to facilitate the use of these methods within the wider medical community, user interfaces for both study planning and analysis of existing datasets are freely provided along with our open-source code via: http://www.fecrt.com/framework
△ Less
Submitted 15 October, 2019;
originally announced October 2019.
-
How to partition diversity
Authors:
Richard Reeve,
Tom Leinster,
Christina A. Cobbold,
Jill Thompson,
Neil Brummitt,
Sonia N. Mitchell,
Louise Matthews
Abstract:
Diversity measurement underpins the study of biological systems, but measures used vary across disciplines. Despite their common use and broad utility, no unified framework has emerged for measuring, comparing and partitioning diversity. The introduction of information theory into diversity measurement has laid the foundations, but the framework is incomplete without the ability to partition diver…
▽ More
Diversity measurement underpins the study of biological systems, but measures used vary across disciplines. Despite their common use and broad utility, no unified framework has emerged for measuring, comparing and partitioning diversity. The introduction of information theory into diversity measurement has laid the foundations, but the framework is incomplete without the ability to partition diversity, which is central to fundamental questions across the life sciences: How do we prioritise communities for conservation? How do we identify reservoirs and sources of pathogenic organisms? How do we measure ecological disturbance arising from climate change?
The lack of a common framework means that diversity measures from different fields have conflicting fundamental properties, allowing conclusions reached to depend on the measure chosen. This conflict is unnecessary and unhelpful. A mathematically consistent framework would transform disparate fields by delivering scientific insights in a common language. It would also allow the transfer of theoretical and practical developments between fields.
We meet this need, providing a versatile unified framework for partitioning biological diversity. It encompasses any kind of similarity between individuals, from functional to genetic, allowing comparisons between qualitatively different kinds of diversity. Where existing partitioning measures aggregate information across the whole population, our approach permits the direct comparison of subcommunities, allowing us to pinpoint distinct, diverse or representative subcommunities and investigate population substructure. The framework is provided as a ready-to-use R package to easily test our approach.
△ Less
Submitted 8 December, 2016; v1 submitted 25 April, 2014;
originally announced April 2014.