-
An Ecosystem of Services for FAIR Computational Workflows
Authors:
Sean R. Wilkinson,
Johan Gustafsson,
Finn Bacall,
Khalid Belhajjame,
Salvador Capella,
Jose Maria Fernandez Gonzalez,
Jacob Fosso Tande,
Luiz Gadelha,
Daniel Garijo,
Patricia Grubel,
Bjorn Grüning,
Farah Zaib Khan,
Sehrish Kanwal,
Simone Leo,
Stuart Owen,
Luca Pireddu,
Line Pouchard,
Laura Rodríguez-Navas,
Beatriz Serrano-Solano,
Stian Soiland-Reyes,
Baiba Vilne,
Alan Williams,
Merridee Ann Wouters,
Frederik Coppens,
Carole Goble
Abstract:
Computational workflows, regardless of their portability or maturity, represent major investments of both effort and expertise. They are first class, publishable research objects in their own right. They are key to sharing methodological know-how for reuse, reproducibility, and transparency. Consequently, the application of the FAIR principles to workflows is inevitable to enable them to be Findab…
▽ More
Computational workflows, regardless of their portability or maturity, represent major investments of both effort and expertise. They are first class, publishable research objects in their own right. They are key to sharing methodological know-how for reuse, reproducibility, and transparency. Consequently, the application of the FAIR principles to workflows is inevitable to enable them to be Findable, Accessible, Interoperable, and Reusable. Making workflows FAIR would reduce duplication of effort, assist in the reuse of best practice approaches and community-supported standards, and ensure that workflows as digital objects can support reproducible and robust science. FAIR workflows also encourage interdisciplinary collaboration, enabling workflows developed in one field to be repurposed and adapted for use in other research domains. FAIR workflows draw from both FAIR data and software principles. Workflows propose explicit method abstractions and tight bindings to data, hence making many of the data principles apply. Meanwhile, as executable pipelines with a strong emphasis on code composition and data flow between steps, the software principles apply, too. As workflows are chiefly concerned with the processing and creation of data, they also have an important role to play in ensuring and supporting data FAIRification.
The FAIR Principles for software and data mandate the use of persistent identifiers (PID) and machine actionable metadata associated with workflows to enable findability, reusability, interoperability and reusability. To implement the principles requires a PID and metadata framework with appropriate programmatic protocols, an accompanying ecosystem of services, tools, guidelines, policies, and best practices, as well the buy-in of existing workflow systems such that they adapt in order to adopt. The European EOSC-Life Workflow Collaboratory is an example of such a ...
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Applying the FAIR Principles to computational workflows
Authors:
Sean R. Wilkinson,
Meznah Aloqalaa,
Khalid Belhajjame,
Michael R. Crusoe,
Bruno de Paula Kinoshita,
Luiz Gadelha,
Daniel Garijo,
Ove Johan Ragnar Gustafsson,
Nick Juty,
Sehrish Kanwal,
Farah Zaib Khan,
Johannes Köster,
Karsten Peters-von Gehlen,
Line Pouchard,
Randy K. Rannow,
Stian Soiland-Reyes,
Nicola Soranzo,
Shoaib Sufi,
Ziheng Sun,
Baiba Vilne,
Merridee A. Wouters,
Denis Yuen,
Carole Goble
Abstract:
Recent trends within computational and data sciences show an increasing recognition and adoption of computational workflows as tools for productivity and reproducibility that also democratize access to platforms and processing know-how. As digital objects to be shared, discovered, and reused, computational workflows benefit from the FAIR principles, which stand for Findable, Accessible, Interopera…
▽ More
Recent trends within computational and data sciences show an increasing recognition and adoption of computational workflows as tools for productivity and reproducibility that also democratize access to platforms and processing know-how. As digital objects to be shared, discovered, and reused, computational workflows benefit from the FAIR principles, which stand for Findable, Accessible, Interoperable, and Reusable. The Workflows Community Initiative's FAIR Workflows Working Group (WCI-FW), a global and open community of researchers and developers working with computational workflows across disciplines and domains, has systematically addressed the application of both FAIR data and software principles to computational workflows. We present recommendations with commentary that reflects our discussions and justifies our choices and adaptations. These are offered to workflow users and authors, workflow management system developers, and providers of workflow services as guidelines for adoption and fodder for discussion. The FAIR recommendations for workflows that we propose in this paper will maximize their value as research assets and facilitate their adoption by the wider community.
△ Less
Submitted 24 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Workflow Mini-Apps: Portable, Scalable, Tunable & Faithful Representations of Scientific Workflows
Authors:
Ozgur Ozan Kilic,
Tianle Wang,
Matteo Turilli,
Mikhail Titov,
Andre Merzky,
Line Pouchard,
Shantenu Jha
Abstract:
Workflows are critical for scientific discovery. However, the sophistication, heterogeneity, and scale of workflows make building, testing, and optimizing them increasingly challenging. Furthermore, their complexity and heterogeneity make performance reproducibility hard. In this paper, we propose workflow mini-apps as a tool to address the challenges in building and testing workflows while contro…
▽ More
Workflows are critical for scientific discovery. However, the sophistication, heterogeneity, and scale of workflows make building, testing, and optimizing them increasingly challenging. Furthermore, their complexity and heterogeneity make performance reproducibility hard. In this paper, we propose workflow mini-apps as a tool to address the challenges in building and testing workflows while controlling the fidelity of representing realworld workflows. Workflow mini-apps are deployed and run on various HPC systems and architectures without workflow-specific constraints. We offer insight into their design and implementation, providing an analysis of their performance and reproducibility. Workflow mini-apps thus advance the science of workflows by providing simple, portable, and managed (fidelity) representations of otherwise complex and difficult-to-control real workflows.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Workflows Community Summit 2022: A Roadmap Revolution
Authors:
Rafael Ferreira da Silva,
Rosa M. Badia,
Venkat Bala,
Debbie Bard,
Peer-Timo Bremer,
Ian Buckley,
Silvina Caino-Lores,
Kyle Chard,
Carole Goble,
Shantenu Jha,
Daniel S. Katz,
Daniel Laney,
Manish Parashar,
Frederic Suter,
Nick Tyler,
Thomas Uram,
Ilkay Altintas,
Stefan Andersson,
William Arndt,
Juan Aznar,
Jonathan Bader,
Bartosz Balis,
Chris Blanton,
Kelly Rosa Braghetto,
Aharon Brodutch
, et al. (80 additional authors not shown)
Abstract:
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and t…
▽ More
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and the evolving needs of emerging scientific applications, it is paramount that the development of novel scientific workflows and system functionalities seek to increase the efficiency, resilience, and pervasiveness of existing systems and applications. Specifically, the proliferation of machine learning/artificial intelligence (ML/AI) workflows, need for processing large scale datasets produced by instruments at the edge, intensification of near real-time data processing, support for long-term experiment campaigns, and emergence of quantum computing as an adjunct to HPC, have significantly changed the functional and operational requirements of workflow systems. Workflow systems now need to, for example, support data streams from the edge-to-cloud-to-HPC enable the management of many small-sized files, allow data reduction while ensuring high accuracy, orchestrate distributed services (workflows, instruments, data movement, provenance, publication, etc.) across computing and user facilities, among others. Further, to accelerate science, it is also necessary that these systems implement specifications/standards and APIs for seamless (horizontal and vertical) integration between systems and applications, as well as enabling the publication of workflows and their associated products according to the FAIR principles. This document reports on discussions and findings from the 2022 international edition of the Workflows Community Summit that took place on November 29 and 30, 2022.
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
A Rigorous Uncertainty-Aware Quantification Framework Is Essential for Reproducible and Replicable Machine Learning Workflows
Authors:
Line Pouchard,
Kristofer G. Reyes,
Francis J. Alexander,
Byung-Jun Yoon
Abstract:
The ability to replicate predictions by machine learning (ML) or artificial intelligence (AI) models and results in scientific workflows that incorporate such ML/AI predictions is driven by numerous factors. An uncertainty-aware metric that can quantitatively assess the reproducibility of quantities of interest (QoI) would contribute to the trustworthiness of results obtained from scientific workf…
▽ More
The ability to replicate predictions by machine learning (ML) or artificial intelligence (AI) models and results in scientific workflows that incorporate such ML/AI predictions is driven by numerous factors. An uncertainty-aware metric that can quantitatively assess the reproducibility of quantities of interest (QoI) would contribute to the trustworthiness of results obtained from scientific workflows involving ML/AI models. In this article, we discuss how uncertainty quantification (UQ) in a Bayesian paradigm can provide a general and rigorous framework for quantifying reproducibility for complex scientific workflows. Such as framework has the potential to fill a critical gap that currently exists in ML/AI for scientific workflows, as it will enable researchers to determine the impact of ML/AI model prediction variability on the predictive outcomes of ML/AI-powered workflows. We expect that the envisioned framework will contribute to the design of more reproducible and trustworthy workflows for diverse scientific applications, and ultimately, accelerate scientific discoveries.
△ Less
Submitted 23 August, 2023; v1 submitted 13 January, 2023;
originally announced January 2023.
-
Figure Descriptive Text Extraction using Ontological Representation
Authors:
Gilchan Park,
Julia Rayz,
Line Pouchard
Abstract:
Experimental research publications provide figure form resources including graphs, charts, and any type of images to effectively support and convey methods and results. To describe figures, authors add captions, which are often incomplete, and more descriptions reside in body text. This work presents a method to extract figure descriptive text from the body of scientific articles. We adopted ontol…
▽ More
Experimental research publications provide figure form resources including graphs, charts, and any type of images to effectively support and convey methods and results. To describe figures, authors add captions, which are often incomplete, and more descriptions reside in body text. This work presents a method to extract figure descriptive text from the body of scientific articles. We adopted ontological semantics to aid concept recognition of figure-related information, which generates human- and machine-readable knowledge representations from sentences. Our results show that conceptual models bring an improvement in figure descriptive sentence classification over word-based approaches.
△ Less
Submitted 11 August, 2022;
originally announced August 2022.
-
Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
Authors:
Sungsoo Ha,
Wonyong Jeong,
Gyorgy Matyasfalvi,
Cong Xie,
Kevin Huck,
Jong Youl Choi,
Abid Malik,
Li Tang,
Hubertus Van Dam,
Line Pouchard,
Wei Xu,
Shinjae Yoo,
Nicholas D'Imperio,
Kerstin Kleese Van Dam
Abstract:
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance tra…
▽ More
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace data needed to detect potential problems. This work introduces Chimbuko, a performance analysis framework that provides real-time, distributed, in situ anomaly detection. Data volumes are reduced for human-level processing without losing necessary details. Chimbuko supports online performance monitoring via a visualization module that presents the overall workflow anomaly distribution, call stacks, and timelines. Chimbuko also supports the capture and reduction of performance provenance. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework, and we demonstrate the tool's usefulness on Oak Ridge National Laboratory's Summit system.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
-
Use Cases of Computational Reproducibility for Scientific Workflows at Exascale
Authors:
Line Pouchard,
Sterling Baldwin,
Todd Elsethagen,
Carlos Gamboa,
Shantenu Jha,
Bibi Raju,
Eric Stephan,
Li Tang,
Kerstin Kleese Van Dam
Abstract:
We propose an approach for improved reproducibility that includes capturing and relating provenance characteristics and performance metrics, in a hybrid queriable system, the ProvEn server. The system capabilities are illustrated on two use cases: scientific reproducibility of results in the ACME climate simulations and performance reproducibility in molecular dynamics workflows on HPC computing p…
▽ More
We propose an approach for improved reproducibility that includes capturing and relating provenance characteristics and performance metrics, in a hybrid queriable system, the ProvEn server. The system capabilities are illustrated on two use cases: scientific reproducibility of results in the ACME climate simulations and performance reproducibility in molecular dynamics workflows on HPC computing platforms.
△ Less
Submitted 20 April, 2018;
originally announced May 2018.
-
Standing Together for Reproducibility in Large-Scale Computing: Report on reproducibility@XSEDE
Authors:
Doug James,
Nancy Wilkins-Diehr,
Victoria Stodden,
Dirk Colbry,
Carlos Rosales,
Mark Fahey,
Justin Shi,
Rafael F. Silva,
Kyo Lee,
Ralph Roskies,
Laurence Loewe,
Susan Lindsey,
Rob Kooper,
Lorena Barba,
David Bailey,
Jonathan Borwein,
Oscar Corcho,
Ewa Deelman,
Michael Dietze,
Benjamin Gilbert,
Jan Harkes,
Seth Keele,
Praveen Kumar,
Jong Lee,
Erika Linke
, et al. (30 additional authors not shown)
Abstract:
This is the final report on reproducibility@xsede, a one-day workshop held in conjunction with XSEDE14, the annual conference of the Extreme Science and Engineering Discovery Environment (XSEDE). The workshop's discussion-oriented agenda focused on reproducibility in large-scale computational research. Two important themes capture the spirit of the workshop submissions and discussions: (1) organiz…
▽ More
This is the final report on reproducibility@xsede, a one-day workshop held in conjunction with XSEDE14, the annual conference of the Extreme Science and Engineering Discovery Environment (XSEDE). The workshop's discussion-oriented agenda focused on reproducibility in large-scale computational research. Two important themes capture the spirit of the workshop submissions and discussions: (1) organizational stakeholders, especially supercomputer centers, are in a unique position to promote, enable, and support reproducible research; and (2) individual researchers should conduct each experiment as though someone will replicate that experiment. Participants documented numerous issues, questions, technologies, practices, and potentially promising initiatives emerging from the discussion, but also highlighted four areas of particular interest to XSEDE: (1) documentation and training that promotes reproducible research; (2) system-level tools that provide build- and run-time information at the level of the individual job; (3) the need to model best practices in research collaborations involving XSEDE staff; and (4) continued work on gateways and related technologies. In addition, an intriguing question emerged from the day's interactions: would there be value in establishing an annual award for excellence in reproducible research?
△ Less
Submitted 2 January, 2015; v1 submitted 17 December, 2014;
originally announced December 2014.
-
The Earth System Grid: Supporting the Next Generation of Climate Modeling Research
Authors:
David Bernholdt,
Shishir Bharathi,
David Brown,
Kasidit Chanchio,
Meili Chen,
Ann Chervenak,
Luca Cinquini,
Bob Drach,
Ian Foster,
Peter Fox,
Jose Garcia,
Carl Kesselman,
Rob Markel,
Don Middleton,
Veronika Nefedova,
Line Pouchard,
Arie Shoshani,
Alex Sim,
Gary Strand,
Dean Williams
Abstract:
Understanding the earth's climate system and how it might be changing is a preeminent scientific challenge. Global climate models are used to simulate past, present, and future climates, and experiments are executed continuously on an array of distributed supercomputers. The resulting data archive, spread over several sites, currently contains upwards of 100 TB of simulation data and is growing…
▽ More
Understanding the earth's climate system and how it might be changing is a preeminent scientific challenge. Global climate models are used to simulate past, present, and future climates, and experiments are executed continuously on an array of distributed supercomputers. The resulting data archive, spread over several sites, currently contains upwards of 100 TB of simulation data and is growing rapidly. Looking toward mid-decade and beyond, we must anticipate and prepare for distributed climate research data holdings of many petabytes. The Earth System Grid (ESG) is a collaborative interdisciplinary project aimed at addressing the challenge of enabling management, discovery, access, and analysis of these critically important datasets in a distributed and heterogeneous computational environment. The problem is fundamentally a Grid problem. Building upon the Globus toolkit and a variety of other technologies, ESG is developing an environment that addresses authentication, authorization for data access, large-scale data transport and management, services and abstractions for high-performance remote data access, mechanisms for scalable data replication, cataloging with rich semantic and syntactic information, data discovery, distributed monitoring, and Web-based portals for using the system.
△ Less
Submitted 13 December, 2007;
originally announced December 2007.