-
Making Software FAIR: A machine-assisted workflow for the research software lifecycle
Authors:
Petr Knoth,
Laurent Romary,
Patrice Lopez,
Roberto Di Cosmo,
Pavel Smrz,
Tomasz Umerle,
Melissa Harrison,
Alain Monteil,
Matteo Cancellieri,
David Pride
Abstract:
A key issue hindering discoverability, attribution and reusability of open research software is that its existence often remains hidden within the manuscript of research papers. For these resources to become first-class bibliographic records, they first need to be identified and subsequently registered with persistent identifiers (PIDs) to be made FAIR (Findable, Accessible, Interoperable and Reus…
▽ More
A key issue hindering discoverability, attribution and reusability of open research software is that its existence often remains hidden within the manuscript of research papers. For these resources to become first-class bibliographic records, they first need to be identified and subsequently registered with persistent identifiers (PIDs) to be made FAIR (Findable, Accessible, Interoperable and Reusable). To this day, much open research software fails to meet FAIR principles and software resources are mostly not explicitly linked from the manuscripts that introduced them or used them. SoFAIR is a 2-year international project (2024-2025) which proposes a solution to the above problem realised over the content available through the global network of open repositories. SoFAIR will extend the capabilities of widely used open scholarly infrastructures (CORE, Software Heritage, HAL) and tools (GROBID) operated by the consortium partners, delivering and deploying an effective solution for the management of the research software lifecycle, including: 1) ML-assisted identification of research software assets from within the manuscripts of scholarly papers, 2) validation of the identified assets by authors, 3) registration of software assets with PIDs and their archival.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
The Software Heritage Open Science Ecosystem
Authors:
Roberto Di Cosmo,
Stefano Zacchiroli
Abstract:
Software Heritage is the largest public archive of software source code and associated development history, as captured by modern version control systems. As of July 2023, it has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects. In this chapter, we describe the Software Heritage ecosystem, focusing on research and open scie…
▽ More
Software Heritage is the largest public archive of software source code and associated development history, as captured by modern version control systems. As of July 2023, it has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects. In this chapter, we describe the Software Heritage ecosystem, focusing on research and open science use cases.On the one hand, Software Heritage supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code. This giant graph of source code artifacts (files, directories, and commits) can be used-and has been used-to study repository forks, open source contributors, vulnerability propagation, software provenance tracking, source code indexing, and more.On the other hand, Software Heritage ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments, contributing to making research reproducible. The source code used in scientific experiments can be archived-e.g., via integration with open-access repositories-referenced using persistent identifiers that allow downstream integrity checks and linked to/from other scholarly digital artifacts.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Dependency Solving Is Still Hard, but We Are Getting Better at It
Authors:
Pietro Abate,
Roberto Di Cosmo,
Georgios Gousios,
Stefano Zacchiroli
Abstract:
Dependency solving is a hard (NP-complete) problem in all non-trivial component models due to either mutually incompatible versions of the same packages or explicitly declared package conflicts. As such, software upgrade planning needs to rely on highly specialized dependency solvers, lest falling into pitfalls such as incompleteness-a combination of package versions that satisfy dependency constr…
▽ More
Dependency solving is a hard (NP-complete) problem in all non-trivial component models due to either mutually incompatible versions of the same packages or explicitly declared package conflicts. As such, software upgrade planning needs to rely on highly specialized dependency solvers, lest falling into pitfalls such as incompleteness-a combination of package versions that satisfy dependency constraints does exist, but the package manager is unable to find it. In this paper we look back at proposals from dependency solving research dating back a few years. Specifically, we review the idea of treating dependency solving as a separate concern in package manager implementations, relying on generic dependency solvers based on tried and tested techniques such as SAT solving, PBO, MILP, etc. By conducting a census of dependency solving capabilities in state-of-the-art package managers we conclude that some proposals are starting to take off (e.g., SAT-based dependency solving) while-with few exceptions-others have not (e.g., out-sourcing dependency solving to reusable components). We reflect on why that has been the case and look at novel challenges for dependency solving that have emerged since.
△ Less
Submitted 16 November, 2020;
originally announced November 2020.
-
Archiving and referencing source code with Software Heritage
Authors:
Roberto Di Cosmo
Abstract:
Software, and software source code in particular, is widely used in modern research. It must be properly archived, referenced, described and cited in order to build a stable and long lasting corpus of scientic knowledge. In this article we show how the Software Heritage universal source code archive provides a means to fully address the first two concerns, by archiving seamlessly all publicly avai…
▽ More
Software, and software source code in particular, is widely used in modern research. It must be properly archived, referenced, described and cited in order to build a stable and long lasting corpus of scientic knowledge. In this article we show how the Software Heritage universal source code archive provides a means to fully address the first two concerns, by archiving seamlessly all publicly available software source code, and by providing intrinsic persistent identifiers that allow to reference it at various granularities in a way that is at the same time convenient and effective. We call upon the research community to adopt widely this approach.
△ Less
Submitted 31 March, 2020;
originally announced April 2020.
-
Referencing Source Code Artifacts: a Separate Concern in Software Citation
Authors:
Roberto Di Cosmo,
Morane Gruenpeter,
Stefano Zacchiroli
Abstract:
Among the entities involved in software citation, software source code requires special attention, due to the role it plays in ensuring scientific reproducibility. To reference source code we need identifiers that are not only unique and persistent, but also support \emph{integrity} checking intrinsically. Suitable identifiers must guarantee that denotedobjects will always stay the same, without r…
▽ More
Among the entities involved in software citation, software source code requires special attention, due to the role it plays in ensuring scientific reproducibility. To reference source code we need identifiers that are not only unique and persistent, but also support \emph{integrity} checking intrinsically. Suitable identifiers must guarantee that denotedobjects will always stay the same, without relying on external third parties and administrative processes. We analyze the role of identifiers for digital objects (IDOs), whose properties are different from, and complementary to, those of the various digital identifiers of objects (DIOs) that are today popular building blocks of software and data citation toolchains.We argue that both kinds of identifiers are needed and detail the syntax, semantics, and practical implementation of the persistent identifiers (PIDs) adopted by the Software Heritage project to reference billions of softwaresource code artifacts such as source code files, directories, and commits.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
How to use Software Heritage for archiving and referencing your source code: guidelines and walkthrough
Authors:
Roberto Di Cosmo
Abstract:
Software source code is an essential research output, and many research communities strongly encourage making the source code of the artefact available by archiving it in publicly-accessible long-term archives.Software Heritage is a non profit, long term universal archive specifically designed for software source code, and able to store not only a software artifact, but also its full development h…
▽ More
Software source code is an essential research output, and many research communities strongly encourage making the source code of the artefact available by archiving it in publicly-accessible long-term archives.Software Heritage is a non profit, long term universal archive specifically designed for software source code, and able to store not only a software artifact, but also its full development history. It provides the ideal place to preserve research software artifacts, and offers powerful mechanisms to enhance research articles with precise references to relevant fragments of your source code.Using Software Heritage for your research software artifacts is straightforward and involves three simple steps. This document details each of these three steps, providing guidelines for making the most out of Software Heritage for your research.
△ Less
Submitted 24 September, 2019;
originally announced September 2019.
-
Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale
Authors:
Guillaume Rousseau,
Roberto Di Cosmo,
Stefano Zacchiroli
Abstract:
We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of mor…
▽ More
We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
Attributing and Referencing (Research) Software: Best Practices and Outlook from Inria
Authors:
Pierre Alliez,
Roberto Di Cosmo,
Benjamin Guedj,
Alain Girault,
Mohand-Said Hacid,
Arnaud Legrand,
Nicolas P. Rougier
Abstract:
Software is a fundamental pillar of modern scientiic research, not only in computer science, but actually across all elds and disciplines. However, there is a lack of adequate means to cite and reference software, for many reasons. An obvious rst reason is software authorship, which can range from a single developer to a whole team, and can even vary in time. The panorama is even more complex than…
▽ More
Software is a fundamental pillar of modern scientiic research, not only in computer science, but actually across all elds and disciplines. However, there is a lack of adequate means to cite and reference software, for many reasons. An obvious rst reason is software authorship, which can range from a single developer to a whole team, and can even vary in time. The panorama is even more complex than that, because many roles can be involved in software development: software architect, coder, debugger, tester, team manager, and so on. Arguably, the researchers who have invented the key algorithms underlying the software can also claim a part of the authorship. And there are many other reasons that make this issue complex. We provide in this paper a contribution to the ongoing eeorts to develop proper guidelines and recommendations for software citation, building upon the internal experience of Inria, the French research institute for digital sciences. As a central contribution, we make three key recommendations. (1) We propose a richer taxonomy for software contributions with a qualitative scale. (2) We claim that it is essential to put the human at the heart of the evaluation. And (3) we propose to distinguish citation from reference.
△ Less
Submitted 25 November, 2019; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Sources of Inter-package Conflicts in Debian
Authors:
Cyrille Artho,
Roberto Di Cosmo,
Kuniyasu Suzaki,
Stefano Zacchiroli
Abstract:
Inter-package conflicts require the presence of two or more packages in a particular configuration, and thus tend to be harder to detect and localize than conventional (intra-package) defects. Hundreds of such inter-package conflicts go undetected by the normal testing and distribution process until they are later reported by a user. The reason for this is that current meta-data is not fine-graine…
▽ More
Inter-package conflicts require the presence of two or more packages in a particular configuration, and thus tend to be harder to detect and localize than conventional (intra-package) defects. Hundreds of such inter-package conflicts go undetected by the normal testing and distribution process until they are later reported by a user. The reason for this is that current meta-data is not fine-grained and accurate enough to cover all common types of conflicts. A case study of inter-package conflicts in Debian has shown that with more detailed package meta-data, at least one third of all package conflicts could be prevented relatively easily, while another one third could be found by targeted testing of packages that share common resources or characteristics. This paper reports the case study and proposes ideas to detect inter-package conflicts in the future.
△ Less
Submitted 6 October, 2011;
originally announced October 2011.
-
Aligning component upgrades
Authors:
Roberto Di Cosmo,
Olivier Lhomme,
Claude Michel
Abstract:
Modern software systems, like GNU/Linux distributions or Eclipse-based development environment, are often deployed by selecting components out of large component repositories. Maintaining such software systems by performing component upgrades is a complex task, and the users need to have an expressive preferences language at their disposal to specify the kind of upgrades they are interested in. R…
▽ More
Modern software systems, like GNU/Linux distributions or Eclipse-based development environment, are often deployed by selecting components out of large component repositories. Maintaining such software systems by performing component upgrades is a complex task, and the users need to have an expressive preferences language at their disposal to specify the kind of upgrades they are interested in. Recent research has shown that it is possible to develop solvers that handle preferences expressed as a combination of a few basic criteria used in the MISC competition, ranging from the number of new components to the freshness of the final configuration. In this work we introduce a set of new criteria that allow the users to specify their preferences for solutions with components aligned to the same upstream sources, provide an efficient encoding and report on the experimental results that prove that optimising these alignment criteria is a tractable problem in practice.
△ Less
Submitted 1 September, 2011;
originally announced September 2011.
-
Strong Dependencies between Software Components
Authors:
Pietro Abate,
Jaap Boender,
Roberto Di Cosmo,
Stefano Zacchiroli
Abstract:
Component-based systems often describe context requirements in terms of explicit inter-component dependencies. Studying large instances of such systems?such as free and open source software (FOSS) distributions?in terms of declared dependencies between packages is appealing. It is however also misleading when the language to express dependencies is as expressive as boolean formulae, which is oft…
▽ More
Component-based systems often describe context requirements in terms of explicit inter-component dependencies. Studying large instances of such systems?such as free and open source software (FOSS) distributions?in terms of declared dependencies between packages is appealing. It is however also misleading when the language to express dependencies is as expressive as boolean formulae, which is often the case. In such settings, a more appropriate notion of component dependency exists: strong dependency. This paper introduces such notion as a first step towards modeling semantic, rather then syntactic, inter-component relationships. Furthermore, a notion of component sensitivity is derived from strong dependencies, with ap- plications to quality assurance and to the evaluation of upgrade risks. An empirical study of strong dependencies and sensitivity is presented, in the context of one of the largest, freely available, component-based system.
△ Less
Submitted 26 May, 2009;
originally announced May 2009.
-
Package upgrades in FOSS distributions: details and challenges
Authors:
Roberto Di Cosmo,
Stefano Zacchiroli,
Paulo Trezentos
Abstract:
The upgrade problems faced by Free and Open Source Software distributions have characteristics not easily found elsewhere. We describe the structure of packages and their role in the upgrade process. We show that state of the art package managers have shortcomings inhibiting their ability to cope with frequent upgrade failures. We survey current countermeasures to such failures, argue that they…
▽ More
The upgrade problems faced by Free and Open Source Software distributions have characteristics not easily found elsewhere. We describe the structure of packages and their role in the upgrade process. We show that state of the art package managers have shortcomings inhibiting their ability to cope with frequent upgrade failures. We survey current countermeasures to such failures, argue that they are not satisfactory, and sketch alternative solutions.
△ Less
Submitted 10 February, 2009;
originally announced February 2009.