-
Packaging research artefacts with RO-Crate
Authors:
Stian Soiland-Reyes,
Peter Sefton,
Mercè Crosas,
Leyla Jael Castro,
Frederik Coppens,
José M. Fernández,
Daniel Garijo,
Björn Grüning,
Marco La Rosa,
Simone Leo,
Eoghan Ó Carragáin,
Marc Portier,
Ana Trisovic,
RO-Crate Community,
Paul Groth,
Carole Goble
Abstract:
An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with thei…
▽ More
An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with their metadata in a machine readable manner. RO-Crate is based on Schema$.$org annotations in JSON-LD, aiming to establish best practices to formally describe metadata in an accessible and practical way for their use in a wide variety of situations.
An RO-Crate is a structured archive of all the items that contributed to a research outcome, including their identifiers, provenance, relations and annotations. As a general purpose packaging approach for data and their metadata, RO-Crate is used across multiple areas, including bioinformatics, digital humanities and regulatory sciences. By applying "just enough" Linked Data standards, RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility.
An RO-Crate for this article is available at https://w3id.org/ro/doi/10.5281/zenodo.5146227
△ Less
Submitted 6 December, 2021; v1 submitted 14 August, 2021;
originally announced August 2021.
-
A large-scale study on research code quality and execution
Authors:
Ana Trisovic,
Matthew K. Lau,
Thomas Pasquier,
Mercè Crosas
Abstract:
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research rep…
▽ More
This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74\% of R files crashed in the initial execution, while 56\% crashed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Advancing computational reproducibility in the Dataverse data repository platform
Authors:
Ana Trisovic,
Philip Durbin,
Tania Schlatter,
Gustavo Durand,
Sonia Barbosa,
Danny Brooke,
Mercè Crosas
Abstract:
Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computatio…
▽ More
Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computational environments for code encapsulation, thus enabling research portability and reproducibility. However, they do not often enable research discoverability, standardized data citation, or long-term archival like data repositories do. This paper addresses the shortcomings of data repositories and reproducibility tools and how they could be overcome to improve the current lack of computational reproducibility in published and archived research outputs.
△ Less
Submitted 16 June, 2020; v1 submitted 6 May, 2020;
originally announced May 2020.
-
Software Citation Implementation Challenges
Authors:
Daniel S. Katz,
Daina Bouquin,
Neil P. Chue Hong,
Jessica Hausman,
Catherine Jones,
Daniel Chivvis,
Tim Clark,
Mercè Crosas,
Stephan Druskat,
Martin Fenner,
Tom Gillespie,
Alejandra Gonzalez-Beltran,
Morane Gruenpeter,
Ted Habermann,
Robert Haines,
Melissa Harrison,
Edwin Henneken,
Lorraine Hwang,
Matthew B. Jones,
Alastair A. Kelly,
David N. Kennedy,
Katrin Leinweber,
Fernando Rios,
Carly B. Robinson,
Ilian Todorov
, et al. (2 additional authors not shown)
Abstract:
The main output of the FORCE11 Software Citation working group (https://www.force11.org/group/software-citation-working-group) was a paper on software citation principles (https://doi.org/10.7717/peerj-cs.86) published in September 2016. This paper laid out a set of six high-level principles for software citation (importance, credit and attribution, unique identification, persistence, accessibilit…
▽ More
The main output of the FORCE11 Software Citation working group (https://www.force11.org/group/software-citation-working-group) was a paper on software citation principles (https://doi.org/10.7717/peerj-cs.86) published in September 2016. This paper laid out a set of six high-level principles for software citation (importance, credit and attribution, unique identification, persistence, accessibility, and specificity) and discussed how they could be used to implement software citation in the scholarly community. In a series of talks and other activities, we have promoted software citation using these increasingly accepted principles. At the time the initial paper was published, we also provided guidance and examples on how to make software citable, though we now realize there are unresolved problems with that guidance. The purpose of this document is to provide an explanation of current issues impacting scholarly attribution of research software, organize updated implementation guidance, and identify where best practices and solutions are still needed.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Sharing and Preserving Computational Analyses for Posterity with encapsulator
Authors:
Thomas Pasquier,
Matthew K. Lau,
Xueyuan Han,
Elizabeth Fong,
Barbara S. Lerner,
Emery Boose,
Merce Crosas,
Aaron M. Ellison,
Margo Seltzer
Abstract:
Open data and open-source software may be part of the solution to science's "reproducibility crisis", but they are insufficient to guarantee reproducibility. Requiring minimal end-user expertise, encapsulator creates a "time capsule" with reproducible code in a self-contained computational environment. encapsulator provides end-users with a fully-featured desktop environment for reproducible resea…
▽ More
Open data and open-source software may be part of the solution to science's "reproducibility crisis", but they are insufficient to guarantee reproducibility. Requiring minimal end-user expertise, encapsulator creates a "time capsule" with reproducible code in a self-contained computational environment. encapsulator provides end-users with a fully-featured desktop environment for reproducible research.
△ Less
Submitted 6 May, 2018; v1 submitted 15 March, 2018;
originally announced March 2018.
-
An Open Science Platform for the Next Generation of Data
Authors:
Latanya Sweeney,
Merce Crosas
Abstract:
Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from…
▽ More
Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable.
△ Less
Submitted 18 June, 2015;
originally announced June 2015.
-
10 Simple Rules for the Care and Feeding of Scientific Data
Authors:
Alyssa Goodman,
Alberto Pepe,
Alexander W. Blocker,
Christine L. Borgman,
Kyle Cranmer,
Mercè Crosas,
Rosanne Di Stefano,
Yolanda Gil,
Paul Groth,
Margaret Hedstrom,
David W. Hogg,
Vinay Kashyap,
Ashish Mahabal,
Aneta Siemiginowska,
Aleksandra Slavkovic
Abstract:
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review…
▽ More
This article offers a short guide to the steps scientists can take to ensure that their data and associated analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more, but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that.
△ Less
Submitted 9 January, 2014;
originally announced January 2014.