-
Low-level I/O Monitoring for Scientific Workflows
Authors:
Joel Witzke,
Ansgar Lößer,
Vasilis Bountris,
Florian Schintke,
Björn Scheuermann
Abstract:
While detailed resource usage monitoring is possible on the low-level using proper tools, associating such usage with higher-level abstractions in the application layer that actually cause the resource usage in the first place presents a number of challenges. Suppose a large-scale scientific data analysis workflow is run using a distributed execution environment such as a compute cluster or cloud…
▽ More
While detailed resource usage monitoring is possible on the low-level using proper tools, associating such usage with higher-level abstractions in the application layer that actually cause the resource usage in the first place presents a number of challenges. Suppose a large-scale scientific data analysis workflow is run using a distributed execution environment such as a compute cluster or cloud environment and we want to analyze the I/O behaviour of it to find and alleviate potential bottlenecks. Different tasks of the workflow can be assigned to arbitrary compute nodes and may even share the same compute nodes. Thus, locally observed resource usage is not directly associated with the individual workflow tasks. By acquiring resource usage profiles of the involved nodes, we seek to correlate the trace data to the workflow and its individual tasks. To accomplish that, we select the proper set of metadata associated with low-level traces that let us associate them with higher-level task information obtained from log files of the workflow execution as well as the job management using a task orchestrator such as Kubernetes with its container management. Ensuring a proper information chain allows the classification of observed I/O on a logical task level and may reveal the most costly or inefficient tasks of a scientific workflow that are most promising for optimization.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
Towards Advanced Monitoring for Scientific Workflows
Authors:
Jonathan Bader,
Joel Witzke,
Soeren Becker,
Ansgar Lößer,
Fabian Lehmann,
Leon Doehler,
Anh Duc Vu,
Odej Kao
Abstract:
Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring o…
▽ More
Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring of scientific workflows involves many components, the cluster infrastructure, its resource manager, the workflow, and the workflow tasks. All components in such an execution environment access different monitoring metrics and provide metrics on different abstraction levels. The combination and analysis of observed metrics from different components and their interdependencies are still widely unregarded.
We specify four different monitoring layers that can serve as an architectural blueprint for the monitoring responsibilities and the interactions of components in the scientific workflow execution context. We describe the different monitoring metrics subject to the four layers and how the layers interact. Finally, we examine five state-of-the-art scientific workflow management systems (SWMS) in order to assess which steps are needed to enable our four-layer-based approach.
△ Less
Submitted 18 July, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
BottleMod: Modeling Data Flows and Tasks for Fast Bottleneck Analysis
Authors:
Ansgar Lößer,
Joel Witzke,
Florian Schintke,
Björn Scheuermann
Abstract:
In the recent years, scientific workflows gained more and more popularity. In scientific workflows, tasks are typically treated as black boxes. Dealing with their complex interrelations to identify optimization potentials and bottlenecks is therefore inherently hard. The progress of a scientific workflow depends on several factors, including the available input data, the available computational po…
▽ More
In the recent years, scientific workflows gained more and more popularity. In scientific workflows, tasks are typically treated as black boxes. Dealing with their complex interrelations to identify optimization potentials and bottlenecks is therefore inherently hard. The progress of a scientific workflow depends on several factors, including the available input data, the available computational power, and the I/O and network bandwidth. Here, we tackle the problem of predicting the workflow progress with very low overhead. To this end, we look at suitable formalizations for the key parameters and their interactions which are sufficiently flexible to describe the input data consumption, the computational effort and the output production of the workflow's tasks. At the same time they allow for computationally simple and fast performance predictions, including a bottleneck analysis over the workflow runtime. A piecewise-defined bottleneck function is derived from the discrete intersections of the task models' limiting functions. This allows to estimate potential performance gains from overcoming the bottlenecks and can be used as a basis for optimized resource allocation and workflow execution.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.