-
Cloud Infrastructure Provenance Collection and Management to Reproduce Scientific Workflow Execution
Authors:
Khawar Hasham,
Kamran Munir,
Richard McClatchey
Abstract:
The emergence of Cloud computing provides a new computing paradigm for scientific workflow execution. It provides dynamic, on-demand and scalable resources that enable the processing of complex workflow-based experiments. With the ever growing size of the experimental data and increasingly complex processing workflows, the need for reproducibility has also become essential. Provenance has been tho…
▽ More
The emergence of Cloud computing provides a new computing paradigm for scientific workflow execution. It provides dynamic, on-demand and scalable resources that enable the processing of complex workflow-based experiments. With the ever growing size of the experimental data and increasingly complex processing workflows, the need for reproducibility has also become essential. Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. One of the obstacles in reproducing an experiment execution is the lack of information about the execution infrastructure in the collected provenance. This information becomes critical in the context of Cloud in which resources are provisioned on-demand and by specifying resource configurations. Therefore, a mechanism is required that enables capturing of infrastructure information along with the provenance of workflows executing on the Cloud to facilitate the re-creation of execution environment on the Cloud. This paper presents a framework, ReCAP, along with the proposed mapping approaches that aid in capturing the Cloud-aware provenance information and help in re-provisioning the execution resource on the Cloud with similar configurations. Experimental evaluation has shown the impact of different resource configurations on the workflow execution performance, therefore justifies the need for collecting such provenance information in the context of Cloud. The evaluation has also demonstrated that the proposed mapping approaches can capture Cloud information in various Cloud usage scenarios without causing performance overhead and can also enable the re-provisioning of resources on Cloud. Experiments were conducted using workflows from different scientific domains such as astronomy and neuroscience to demonstrate the applicability of this research for different workflows.
△ Less
Submitted 19 March, 2018;
originally announced March 2018.
-
Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud
Authors:
Khawar Hasham,
Kamran Munir,
Richard McClatchey
Abstract:
Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect prov…
▽ More
Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information from an outside Cloud client along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud for workflow execution. The reproducibility of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them and (c) by comparing the outputs of workflows. The evaluation of the prototype suggests that the proposed approach is feasible and can be investigated further. Moreover, there is no reference reproducibility model exists in literature that can provide guidelines to achieve this goal in Cloud. This paper also attempts to present a model that is used in the proposed design to achieve workflow reproducibility in the Cloud environment.
△ Less
Submitted 29 November, 2015;
originally announced November 2015.
-
Scientific Workflow Repeatability through Cloud-Aware Provenance
Authors:
Khawar Hasham,
Kamran Munir,
Jetendr Shamdasani,
Richard McClatchey
Abstract:
The transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing ap…
▽ More
The transformations, analyses and interpretations of data in scientific workflows are vital for the repeatability and reliability of scientific workflows. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect provenance information from the Cloud and to utilize it for workflow repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in comparison to the Grid makes it difficult because resources are provisioned on-demand unlike the Grid. This paper presents a novel approach that can assist in mitigating this challenge. This approach can collect Cloud infrastructure information along with workflow provenance and can establish a mapping between them. This mapping is later used to re-provision resources on the Cloud. The repeatability of the workflow execution is performed by: (a) capturing the Cloud infrastructure information (virtual machine configuration) along with the workflow provenance, and (b) re-provisioning the similar resources on the Cloud and re-executing the workflow on them. The evaluation of an initial prototype suggests that the proposed approach is feasible and can be investigated further.
△ Less
Submitted 5 February, 2015;
originally announced February 2015.
-
An Integrated e-science Analysis Base for Computation Neuroscience Experiments and Analysis
Authors:
Kamran Munir,
Saad Liaquat Kiani,
Khawar Hasham,
Richard McClatchey,
Andrew Branson,
Jetendr Shamdasani,
the N4U Consortium
Abstract:
Recent developments in data management and imaging technologies have significantly affected diagnostic and extrapolative research in the understanding of neurodegenerative diseases. However, the impact of these new technologies is largely dependent on the speed and reliability with which the medical data can be visualised, analysed and interpreted. The EUs neuGRID for Users (N4U) is a follow-on pr…
▽ More
Recent developments in data management and imaging technologies have significantly affected diagnostic and extrapolative research in the understanding of neurodegenerative diseases. However, the impact of these new technologies is largely dependent on the speed and reliability with which the medical data can be visualised, analysed and interpreted. The EUs neuGRID for Users (N4U) is a follow-on project to neuGRID, which aims to provide an integrated environment to carry out computational neuroscience experiments. This paper reports on the design and development of the N4U Analysis Base and related Information Services, which addresses existing research and practical challenges by offering an integrated medical data analysis environment with the necessary building blocks for neuroscientists to optimally exploit neuroscience workflows, large image datasets and algorithms in order to conduct analyses. The N4U Analysis Base enables such analyses by indexing and interlinking the neuroimaging and clinical study datasets stored on the N4U Grid infrastructure, algorithms and scientific workflow definitions along with their associated provenance information.
△ Less
Submitted 24 February, 2014;
originally announced February 2014.
-
CMS Workflow Execution using Intelligent Job Scheduling and Data Access Strategies
Authors:
Khawar Hasham,
Antonio Delgado Peris,
Ashiq Anjum,
Dave Evans,
Dirk Hufnagel,
Eduardo Huedo,
José M. Hernández,
Richard McClatchey,
Stephen Gowdy,
Simon Metson
Abstract:
Complex scientific workflows can process large amounts of data using thousands of tasks. The turnaround times of these workflows are often affected by various latencies such as the resource discovery, scheduling and data access latencies for the individual workflow processes or actors. Minimizing these latencies will improve the overall execution time of a workflow and thus lead to a more efficien…
▽ More
Complex scientific workflows can process large amounts of data using thousands of tasks. The turnaround times of these workflows are often affected by various latencies such as the resource discovery, scheduling and data access latencies for the individual workflow processes or actors. Minimizing these latencies will improve the overall execution time of a workflow and thus lead to a more efficient and robust processing environment. In this paper, we propose a pilot job based infrastructure that has intelligent data reuse and job execution strategies to minimize the scheduling, queuing, execution and data access latencies. The results have shown that significant improvements in the overall turnaround time of a workflow can be achieved with this approach. The proposed approach has been evaluated, first using the CMS Tier0 data processing workflow, and then simulating the workflows to evaluate its effectiveness in a controlled environment.
△ Less
Submitted 24 February, 2012;
originally announced February 2012.
-
DIANA Scheduling Hierarchies for Optimizing Bulk Job Scheduling
Authors:
A. Anjum,
R. McClatchey,
H. Stockinger,
A. Ali,
I. Willers,
M. Thomas,
M. Sagheer,
K. Hasham,
O. Alvi
Abstract:
The use of meta-schedulers for resource management in large-scale distributed systems often leads to a hierarchy of schedulers. In this paper, we discuss why existing meta-scheduling hierarchies are sometimes not sufficient for Grid systems due to their inability to re-organise jobs already scheduled locally. Such a job re-organisation is required to adapt to evolving loads which are common in h…
▽ More
The use of meta-schedulers for resource management in large-scale distributed systems often leads to a hierarchy of schedulers. In this paper, we discuss why existing meta-scheduling hierarchies are sometimes not sufficient for Grid systems due to their inability to re-organise jobs already scheduled locally. Such a job re-organisation is required to adapt to evolving loads which are common in heavily used Grid infrastructures. We propose a peer-to-peer scheduling model and evaluate it using case studies and mathematical modelling. We detail the DIANA (Data Intensive and Network Aware) scheduling algorithm and its queue management system for coping with the load distribution and for supporting bulk job scheduling. We demonstrate that such a system is beneficial for dynamic, distributed and self-organizing resource management and can assist in optimizing load or job distribution in complex Grid infrastructures.
△ Less
Submitted 5 July, 2007;
originally announced July 2007.