-
A Terminology for Scientific Workflow Systems
Authors:
Frédéric Suter,
Tainã Coleman,
İlkay Altintaş,
Rosa M. Badia,
Bartosz Balis,
Kyle Chard,
Iacopo Colonnelli,
Ewa Deelman,
Paolo Di Tommaso,
Thomas Fahringer,
Carole Goble,
Shantenu Jha,
Daniel S. Katz,
Johannes Köster,
Ulf Leser,
Kshitij Mehta,
Hilary Oliver,
J. -Luc Peterson,
Giovanni Pizzi,
Loïc Pottier,
Raül Sirvent,
Eric Suchyta,
Douglas Thain,
Sean R. Wilkinson,
Justin M. Wozniak
, et al. (1 additional authors not shown)
Abstract:
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, and hundreds of workflow management systems (WMSs)…
▽ More
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, and hundreds of workflow management systems (WMSs) have been developed to manage and run these workflows. However, no turnkey solution has emerged to address the diversity of scientific processes and the infrastructure on which they are implemented. Instead, new research problems requiring the execution of scientific workflows with some novel feature often lead to the development of an entirely new WMS. A direct consequence is that many existing WMSs share some salient features, offer similar functionalities, and can manage the same categories of workflows but also have some distinct capabilities. This situation makes researchers who develop workflows face the complex question of selecting a WMS. This selection can be driven by technical considerations, to find the system that is the most appropriate for their application and for the resources available to them, or other factors such as reputation, adoption, strong community support, or long-term sustainability. To address this problem, a group of WMS developers and practitioners joined their efforts to produce a community-based terminology of WMSs. This paper summarizes their findings and introduces this new terminology to characterize WMSs. This terminology is composed of fives axes: workflow characteristics, composition, orchestration, data management, and metadata capture. Each axis comprises several concepts that capture the prominent features of WMSs. Based on this terminology, this paper also presents a classification of 23 existing WMSs according to the proposed axes and terms.
△ Less
Submitted 9 July, 2025; v1 submitted 9 June, 2025;
originally announced June 2025.
-
A Python workflow definition for computational materials design
Authors:
Jan Janssen,
Janine George,
Julian Geiger,
Marnik Bercx,
Xing Wang,
Christina Ertural,
Joerg Schaarschmidt,
Alex M. Ganose,
Giovanni Pizzi,
Tilmann Hickel,
Joerg Neugebauer
Abstract:
Numerous Workflow Management Systems (WfMS) have been developed in the field of computational materials science with different workflow formats, hindering interoperability and reproducibility of workflows in the field. To address this challenge, we introduce here the Python Workflow Definition (PWD) as a workflow exchange format to share workflows between Python-based WfMS, currently AiiDA, jobflo…
▽ More
Numerous Workflow Management Systems (WfMS) have been developed in the field of computational materials science with different workflow formats, hindering interoperability and reproducibility of workflows in the field. To address this challenge, we introduce here the Python Workflow Definition (PWD) as a workflow exchange format to share workflows between Python-based WfMS, currently AiiDA, jobflow, and pyiron. This development is motivated by the similarity of these three Python-based WfMS, that represent the different workflow steps and data transferred between them as nodes and edges in a graph. With the PWD, we aim at fostering the interoperability and reproducibility between the different WfMS in the context of Findable, Accessible, Interoperable, Reusable (FAIR) workflows. To separate the scientific from the technical complexity, the PWD consists of three components: (1) a conda environment that specifies the software dependencies, (2) a Python module that contains the Python functions represented as nodes in the workflow graph, and (3) a workflow graph stored in the JavaScript Object Notation (JSON). The first version of the PWD supports directed acyclic graph (DAG)-based workflows. Thus, any DAG-based workflow defined in one of the three WfMS can be exported to the PWD and afterwards imported from the PWD to one of the other WfMS. After the import, the input parameters of the workflow can be adjusted and computing resources can be assigned to the workflow, before it is executed with the selected WfMS. This import from and export to the PWD is enabled by the PWD Python library that implements the PWD in AiiDA, jobflow, and pyiron.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Machine-learning accelerated identification of exfoliable two-dimensional materials
Authors:
Mohammad Tohidi Vahdat,
Kumar Agrawal Varoon,
Giovanni Pizzi
Abstract:
Two-dimensional (2D) materials have been a central focus of recent research because they host a variety of properties, making them attractive both for fundamental science and for applications. It is thus crucial to be able to identify accurately and efficiently if bulk three-dimensional (3D) materials are formed by layers held together by a weak binding energy that, thus, can be potentially exfoli…
▽ More
Two-dimensional (2D) materials have been a central focus of recent research because they host a variety of properties, making them attractive both for fundamental science and for applications. It is thus crucial to be able to identify accurately and efficiently if bulk three-dimensional (3D) materials are formed by layers held together by a weak binding energy that, thus, can be potentially exfoliated into 2D materials. In this work, we develop a machine-learning (ML) approach that, combined with a fast preliminary geometrical screening, is able to efficiently identify potentially exfoliable materials. Starting from a combination of descriptors for crystal structures, we work out a subset of them that are crucial for accurate predictions. Our final ML model, based on a random forest classifier, has a very high recall of 98\%. Using a SHapely Additive exPlanations (SHAP) analysis, we also provide an intuitive explanation of the five most important variables of the model. Finally, we compare the performance of our best ML model with a deep neural network architecture using the same descriptors. To make our algorithms and models easily accessible, we publish an online tool on the Materials Cloud portal that only requires a bulk 3D crystal structure as input. Our tool thus provides a practical yet straightforward approach to assess whether any 3D compound can be exfoliated into 2D layers.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Workflows in AiiDA: Engineering a high-throughput, event-based engine for robust and modular computational workflows
Authors:
Martin Uhrin,
Sebastiaan P. Huber,
Jusong Yu,
Nicola Marzari,
Giovanni Pizzi
Abstract:
Over the last two decades, the field of computational science has seen a dramatic shift towards incorporating high-throughput computation and big-data analysis as fundamental pillars of the scientific discovery process. This has necessitated the development of tools and techniques to deal with the generation, storage and processing of large amounts of data. In this work we present an in-depth look…
▽ More
Over the last two decades, the field of computational science has seen a dramatic shift towards incorporating high-throughput computation and big-data analysis as fundamental pillars of the scientific discovery process. This has necessitated the development of tools and techniques to deal with the generation, storage and processing of large amounts of data. In this work we present an in-depth look at the workflow engine powering AiiDA, a widely adopted, highly flexible and database-backed informatics infrastructure with an emphasis on data reproducibility. We detail many of the design choices that were made which were informed by several important goals: the ability to scale from running on individual laptops up to high-performance supercomputers, managing jobs with runtimes spanning from fractions of a second to weeks and scaling up to thousands of jobs concurrently, and all this while maximising robustness. In short, AiiDA aims to be a Swiss army knife for high-throughput computational science. As well as the architecture, we outline important API design choices made to give workflow writers a great deal of liberty whilst guiding them towards writing robust and modular workflows, ultimately enabling them to encode their scientific knowledge to the benefit of the wider scientific community.
△ Less
Submitted 21 July, 2020; v1 submitted 17 July, 2020;
originally announced July 2020.
-
AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance
Authors:
Sebastiaan. P. Huber,
Spyros Zoupanos,
Martin Uhrin,
Leopold Talirz,
Leonid Kahle,
Rico Häuselmann,
Dominik Gresch,
Tiziano Müller,
Aliaksandr V. Yakutovich,
Casper W. Andersen,
Francisco F. Ramirez,
Carl S. Adorf,
Fernando Gargiulo,
Snehal Kumbhar,
Elsa Passaro,
Conrad Johnston,
Andrius Merkys,
Andrea Cepellotti,
Nicolas Mounet,
Nicola Marzari,
Boris Kozinsky,
Giovanni Pizzi
Abstract:
The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial.…
▽ More
The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial. In recent years, we have been developing AiiDA (http://www.aiida.net), a robust open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording. Here, we introduce developments and capabilities required to reach sustained performance, with AiiDA supporting throughputs of tens of thousands processes/hour, while automatically preserving and storing the full data provenance in a relational database making it queryable and traversable, thus enabling high-performance data analytics. AiiDA's workflow language provides advanced automation, error handling features and a flexible plugin model to allow interfacing with any simulation software. The associated plugin registry enables seamless sharing of extensions, empowering a vibrant user community dedicated to making simulations more robust, user-friendly and reproducible.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
AiiDA: Automated Interactive Infrastructure and Database for Computational Science
Authors:
Giovanni Pizzi,
Andrea Cepellotti,
Riccardo Sabatini,
Nicola Marzari,
Boris Kozinsky
Abstract:
Computational science has seen in the last decades a spectacular rise in the scope, breadth, and depth of its efforts. Notwithstanding this prevalence and impact, it is often still performed using the renaissance model of individual artisans gathered in a workshop, under the guidance of an established practitioner. Great benefits could follow instead from adopting concepts and tools coming from co…
▽ More
Computational science has seen in the last decades a spectacular rise in the scope, breadth, and depth of its efforts. Notwithstanding this prevalence and impact, it is often still performed using the renaissance model of individual artisans gathered in a workshop, under the guidance of an established practitioner. Great benefits could follow instead from adopting concepts and tools coming from computer science to manage, preserve, and share these computational efforts. We illustrate here our paradigm sustaining such vision, based around the four pillars of Automation, Data, Environment, and Sharing. We then discuss its implementation in the open-source AiiDA platform (http://www.aiida.net), that has been tuned first to the demands of computational materials science. AiiDA's design is based on directed acyclic graphs to track the provenance of data and calculations, and ensure preservation and searchability. Remote computational resources are managed transparently, and automation is coupled with data storage to ensure reproducibility. Last, complex sequences of calculations can be encoded into scientific workflows. We believe that AiiDA's design and its sharing capabilities will encourage the creation of social ecosystems to disseminate codes, data, and scientific workflows.
△ Less
Submitted 8 September, 2015; v1 submitted 5 April, 2015;
originally announced April 2015.