-
Accelerating Cloud-Based Transcriptomics: Performance Analysis and Optimization of the STAR Aligner Workflow
Authors:
Piotr Kica,
Sabina Lichołai,
Michał Orzechowski,
Maciej Malawski
Abstract:
In this work, we explore the Transcriptomics Atlas pipeline adapted for cost-efficient and high-throughput computing in the cloud. We propose a scalable, cloud-native architecture designed for running a resource-intensive aligner -- STAR -- and processing tens or hundreds of terabytes of RNA-sequencing data. We implement multiple optimization techniques that give significant execution time and cos…
▽ More
In this work, we explore the Transcriptomics Atlas pipeline adapted for cost-efficient and high-throughput computing in the cloud. We propose a scalable, cloud-native architecture designed for running a resource-intensive aligner -- STAR -- and processing tens or hundreds of terabytes of RNA-sequencing data. We implement multiple optimization techniques that give significant execution time and cost reduction. The impact of particular optimizations is measured in medium-scale experiments followed by a large-scale experiment that leverages all of them and validates the current design. Early stopping optimization allows a reduction in total alignment time by 23%. We analyze the scalability and efficiency of one of the most widely used sequence aligners. For the cloud environment, we identify one of the most suitable EC2 instance types and verify the applicability of spot instances usage.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Serverless Approach to Running Resource-Intensive STAR Aligner
Authors:
Piotr Kica,
Michał Orzechowski,
Maciej Malawski
Abstract:
The application of serverless computing for alignment of RNA-sequences can improve many existing bioinformatics workflows by reducing operational costs and execution times. This work analyzes the applicability of serverless services for running the STAR aligner, which is known for its accuracy and large memory requirement. This presents a challenge, as serverless services were designed for light a…
▽ More
The application of serverless computing for alignment of RNA-sequences can improve many existing bioinformatics workflows by reducing operational costs and execution times. This work analyzes the applicability of serverless services for running the STAR aligner, which is known for its accuracy and large memory requirement. This presents a challenge, as serverless services were designed for light and short tasks. Nevertheless, we successfully deploy a STAR-based pipeline on AWS ECS service, propose multiple optimizations, and perform experiment with 17 TBs of data. Results are compared against standard virtual machine (VM) based solution showing that serverless is a valid alternative for small-scale batch processing. However, in large-scale where efficiency matters the most, VMs are still recommended.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
MICCAI-CDMRI 2023 QuantConn Challenge Findings on Achieving Robust Quantitative Connectivity through Harmonized Preprocessing of Diffusion MRI
Authors:
Nancy R. Newlin,
Kurt Schilling,
Serge Koudoro,
Bramsh Qamar Chandio,
Praitayini Kanakaraj,
Daniel Moyer,
Claire E. Kelly,
Sila Genc,
Jian Chen,
Joseph Yuan-Mou Yang,
Ye Wu,
Yifei He,
Jiawei Zhang,
Qingrun Zeng,
Fan Zhang,
Nagesh Adluru,
Vishwesh Nath,
Sudhir Pathak,
Walter Schneider,
Anurag Gade,
Yogesh Rathi,
Tom Hendriks,
Anna Vilanova,
Maxime Chamberland,
Tomasz Pieciak
, et al. (11 additional authors not shown)
Abstract:
White matter alterations are increasingly implicated in neurological diseases and their progression. International-scale studies use diffusion-weighted magnetic resonance imaging (DW-MRI) to qualitatively identify changes in white matter microstructure and connectivity. Yet, quantitative analysis of DW-MRI data is hindered by inconsistencies stemming from varying acquisition protocols. There is a…
▽ More
White matter alterations are increasingly implicated in neurological diseases and their progression. International-scale studies use diffusion-weighted magnetic resonance imaging (DW-MRI) to qualitatively identify changes in white matter microstructure and connectivity. Yet, quantitative analysis of DW-MRI data is hindered by inconsistencies stemming from varying acquisition protocols. There is a pressing need to harmonize the preprocessing of DW-MRI datasets to ensure the derivation of robust quantitative diffusion metrics across acquisitions. In the MICCAI-CDMRI 2023 QuantConn challenge, participants were provided raw data from the same individuals collected on the same scanner but with two different acquisitions and tasked with preprocessing the DW-MRI to minimize acquisition differences while retaining biological variation. Submissions are evaluated on the reproducibility and comparability of cross-acquisition bundle-wise microstructure measures, bundle shape features, and connectomics. The key innovations of the QuantConn challenge are that (1) we assess bundles and tractography in the context of harmonization for the first time, (2) we assess connectomics in the context of harmonization for the first time, and (3) we have 10x additional subjects over prior harmonization challenge, MUSHAC and 100x over SuperMUDI. We find that bundle surface area, fractional anisotropy, connectome assortativity, betweenness centrality, edge count, modularity, nodal strength, and participation coefficient measures are most biased by acquisition and that machine learning voxel-wise correction, RISH mapping, and NeSH methods effectively reduce these biases. In addition, microstructure measures AD, MD, RD, bundle length, connectome density, efficiency, and path length are least biased by these acquisition differences.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Optimizing STAR Aligner for High Throughput Computing in the Cloud
Authors:
Piotr Kica,
Sabina Lichołai,
Michał Orzechowski,
Maciej Malawski
Abstract:
We propose a scalable, cloud-native architecture designed for Transcriptomics Atlas Pipeline, using a resource-intensive STAR aligner and processing tens or hundreds of terabytes of RNA-seq data. We implement the pipeline using AWS cloud services, introduce performance optimizations and perform experimental evaluation in the cloud. Our optimization techniques result in computational savings thanks…
▽ More
We propose a scalable, cloud-native architecture designed for Transcriptomics Atlas Pipeline, using a resource-intensive STAR aligner and processing tens or hundreds of terabytes of RNA-seq data. We implement the pipeline using AWS cloud services, introduce performance optimizations and perform experimental evaluation in the cloud. Our optimization techniques result in computational savings thanks to the "early stopping" approach, selection of right-sized resources, and using newer version of Ensembl genome.
△ Less
Submitted 26 August, 2024;
originally announced September 2024.
-
Serverless Computing for Scientific Applications
Authors:
Maciej Malawski,
Bartosz Balis
Abstract:
Serverless computing has become an important model in cloud computing and influenced the design of many applications. Here, we provide our perspective on how the recent landscape of serverless computing for scientific applications looks like. We discuss the advantages and problems with serverless computing for scientific applications, and based on the analysis of existing solutions and approaches,…
▽ More
Serverless computing has become an important model in cloud computing and influenced the design of many applications. Here, we provide our perspective on how the recent landscape of serverless computing for scientific applications looks like. We discuss the advantages and problems with serverless computing for scientific applications, and based on the analysis of existing solutions and approaches, we propose a science-oriented architecture for a serverless computing framework that is based on the existing designs. Finally, we provide an outlook of current trends and future directions.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
Serverless Approach to Sensitivity Analysis of Computational Models
Authors:
Piotr Kica,
Magdalena Otta,
Krzysztof Czechowicz,
Karol Zając,
Piotr Nowakowski,
Andrew Narracott,
Ian Halliday,
Maciej Malawski
Abstract:
Digital twins are virtual representations of physical objects or systems used for the purpose of analysis, most often via computer simulations, in many engineering and scientific disciplines. Recently, this approach has been introduced to computational medicine, within the concept of Digital Twin in Healthcare (DTH). Such research requires verification and validation of its models, as well as the…
▽ More
Digital twins are virtual representations of physical objects or systems used for the purpose of analysis, most often via computer simulations, in many engineering and scientific disciplines. Recently, this approach has been introduced to computational medicine, within the concept of Digital Twin in Healthcare (DTH). Such research requires verification and validation of its models, as well as the corresponding sensitivity analysis and uncertainty quantification (VVUQ).
From the computing perspective, VVUQ is a computationally intensive process, as it requires numerous runs with variations of input parameters. Researchers often use high-performance computing (HPC) solutions to run VVUQ studies where the number of parameter combinations can easily reach tens of thousands. However, there is a viable alternative to HPC for a substantial subset of computational models - serverless computing.
In this paper we hypothesize that using the serverless computing model can be a practical and efficient approach to selected cases of running VVUQ calculations. We show this on the example of the EasyVVUQ library, which we extend by providing support for many serverless services. The resulting library - CloudVVUQ - is evaluated using two real-world applications from the computational medicine domain adapted for serverless execution. Our experiments demonstrate the scalability of the proposed approach.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Can gamification reduce the burden of self-reporting in mHealth applications? A feasibility study using machine learning from smartwatch data to estimate cognitive load
Authors:
Michal K. Grzeszczyk,
Paulina Adamczyk,
Sylwia Marek,
Ryszard Pręcikowski,
Maciej Kuś,
M. Patrycja Lelujko,
Rosmary Blanco,
Tomasz Trzciński,
Arkadiusz Sitek,
Maciej Malawski,
Aneta Lisowska
Abstract:
The effectiveness of digital treatments can be measured by requiring patients to self-report their state through applications, however, it can be overwhelming and causes disengagement. We conduct a study to explore the impact of gamification on self-reporting. Our approach involves the creation of a system to assess cognitive load (CL) through the analysis of photoplethysmography (PPG) signals. Th…
▽ More
The effectiveness of digital treatments can be measured by requiring patients to self-report their state through applications, however, it can be overwhelming and causes disengagement. We conduct a study to explore the impact of gamification on self-reporting. Our approach involves the creation of a system to assess cognitive load (CL) through the analysis of photoplethysmography (PPG) signals. The data from 11 participants is utilized to train a machine learning model to detect CL. Subsequently, we create two versions of surveys: a gamified and a traditional one. We estimate the CL experienced by other participants (13) while completing surveys. We find that CL detector performance can be enhanced via pre-training on stress detection tasks. For 10 out of 13 participants, a personalized CL detector can achieve an F1 score above 0.7. We find no difference between the gamified and non-gamified surveys in terms of CL but participants prefer the gamified version.
△ Less
Submitted 21 December, 2023; v1 submitted 7 February, 2023;
originally announced February 2023.
-
Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk
Authors:
Bartłomiej Przybylski,
Maciej Pawlik,
Paweł Żuk,
Bartłomiej Łagosz,
Maciej Malawski,
Krzysztof Rzadca
Abstract:
Modern HPC workload managers and their careful tuning contribute to the high utilization of HPC clusters. However, due to inevitable uncertainty it is impossible to completely avoid node idleness. Although such idle slots are usually too short for any HPC job, they are too long to ignore them. Function-as-a-Service (FaaS) paradigm promisingly fills this gap, and can be a good match, as typical Faa…
▽ More
Modern HPC workload managers and their careful tuning contribute to the high utilization of HPC clusters. However, due to inevitable uncertainty it is impossible to completely avoid node idleness. Although such idle slots are usually too short for any HPC job, they are too long to ignore them. Function-as-a-Service (FaaS) paradigm promisingly fills this gap, and can be a good match, as typical FaaS functions last seconds, not hours. Here we show how to build a FaaS infrastructure on idle nodes in an HPC cluster in such a way that it does not affect the performance of the HPC jobs significantly. We dynamically adapt to a changing set of idle physical machines, by integrating open-source software Slurm and OpenWhisk.
We designed and implemented a prototype solution that allowed us to cover up to 90\% of the idle time slots on a 50k-core cluster that runs production workloads.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
A Serverless Engine for High Energy Physics Distributed Analysis
Authors:
Jacek Kuśnierz,
Vincenzo Eduardo Padulano,
Maciej Malawski,
Kamil Burkiewicz,
Enric Tejedor Saavedra,
Pedro Alonso-Jordá,
Michael Pitt,
Valentina Avati
Abstract:
The Large Hadron Collider (LHC) at CERN has generated in the last decade an unprecedented volume of data for the High-Energy Physics (HEP) field. Scientific collaborations interested in analysing such data very often require computing power beyond a single machine. This issue has been tackled traditionally by running analyses in distributed environments using stateful, managed batch computing syst…
▽ More
The Large Hadron Collider (LHC) at CERN has generated in the last decade an unprecedented volume of data for the High-Energy Physics (HEP) field. Scientific collaborations interested in analysing such data very often require computing power beyond a single machine. This issue has been tackled traditionally by running analyses in distributed environments using stateful, managed batch computing systems. While this approach has been effective so far, current estimates for future computing needs of the field present large scaling challenges. Such a managed approach may not be the only viable way to tackle them and an interesting alternative could be provided by serverless architectures, to enable an even larger scaling potential.
This work describes a novel approach to running real HEP scientific applications through a distributed serverless computing engine. The engine is built upon ROOT, a well-established HEP data analysis software, and distributes its computations to a large pool of concurrent executions on Amazon Web Services Lambda Serverless Platform. Thanks to the developed tool, physicists are able to access datasets stored at CERN (also those that are under restricted access policies) and process it on remote infrastructures outside of their typical environment. The analysis of the serverless functions is monitored at runtime to gather performance metrics, both for data- and computation-intensive workloads.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
CXR-FL: Deep Learning-Based Chest X-ray Image Analysis Using Federated Learning
Authors:
Filip Ślazyk,
Przemysław Jabłecki,
Aneta Lisowska,
Maciej Malawski,
Szymon Płotka
Abstract:
Federated learning enables building a shared model from multicentre data while storing the training data locally for privacy. In this paper, we present an evaluation (called CXR-FL) of deep learning-based models for chest X-ray image analysis using the federated learning method. We examine the impact of federated learning parameters on the performance of central models. Additionally, we show that…
▽ More
Federated learning enables building a shared model from multicentre data while storing the training data locally for privacy. In this paper, we present an evaluation (called CXR-FL) of deep learning-based models for chest X-ray image analysis using the federated learning method. We examine the impact of federated learning parameters on the performance of central models. Additionally, we show that classification models perform worse if trained on a region of interest reduced to segmentation of the lung compared to the full image. However, focusing training of the classification model on the lung area may result in improved pathology interpretability during inference. We also find that federated learning helps maintain model generalizability. The pre-trained weights and code are publicly available at (https://github.com/SanoScience/CXR-FL).
△ Less
Submitted 8 August, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
A Community Roadmap for Scientific Workflows Research and Development
Authors:
Rafael Ferreira da Silva,
Henri Casanova,
Kyle Chard,
Ilkay Altintas,
Rosa M Badia,
Bartosz Balis,
Tainã Coleman,
Frederik Coppens,
Frank Di Natale,
Bjoern Enders,
Thomas Fahringer,
Rosa Filgueira,
Grigori Fursin,
Daniel Garijo,
Carole Goble,
Dorran Howell,
Shantenu Jha,
Daniel S. Katz,
Daniel Laney,
Ulf Leser,
Maciej Malawski,
Kshitij Mehta,
Loïc Pottier,
Jonathan Ozik,
J. Luc Peterson
, et al. (4 additional authors not shown)
Abstract:
The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows…
▽ More
The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows community together. This paper reports on discussions and findings from two virtual "Workflows Community Summits" (January and April, 2021). The overarching goals of these workshops were to develop a view of the state of the art, identify crucial research challenges in the workflows community, articulate a vision for potential community efforts, and discuss technical approaches for realizing this vision. To this end, participants identified six broad themes: FAIR computational workflows; AI workflows; exascale challenges; APIs, interoperability, reuse, and standards; training and education; and building a workflows community. We summarize discussions and recommendations for each of these themes.
△ Less
Submitted 8 October, 2021; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Workflows Community Summit: Advancing the State-of-the-art of Scientific Workflows Management Systems Research and Development
Authors:
Rafael Ferreira da Silva,
Henri Casanova,
Kyle Chard,
Tainã Coleman,
Dan Laney,
Dong Ahn,
Shantenu Jha,
Dorran Howell,
Stian Soiland-Reys,
Ilkay Altintas,
Douglas Thain,
Rosa Filgueira,
Yadu Babuji,
Rosa M. Badia,
Bartosz Balis,
Silvina Caino-Lores,
Scott Callaghan,
Frederik Coppens,
Michael R. Crusoe,
Kaushik De,
Frank Di Natale,
Tu M. A. Do,
Bjoern Enders,
Thomas Fahringer,
Anne Fouilloux
, et al. (33 additional authors not shown)
Abstract:
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role i…
▽ More
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role in the data-oriented and post-Moore's computing landscape as they democratize the application of cutting-edge research techniques, computationally intensive methods, and use of new computing platforms. As workflows continue to be adopted by scientific projects and user communities, they are becoming more complex. Workflows are increasingly composed of tasks that perform computations such as short machine learning inference, multi-node simulations, long-running machine learning model training, amongst others, and thus increasingly rely on heterogeneous architectures that include CPUs but also GPUs and accelerators. The workflow management system (WMS) technology landscape is currently segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. Another fundamental problem is that there are conflicting theoretical bases and abstractions for a WMS. Systems that use the same underlying abstractions can likely be translated between, which is not the case for systems that use different abstractions. More information: https://workflowsri.org/summits/technical
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Serverless Containers -- rising viable approach to Scientific Workflows
Authors:
Krzysztof Burkat,
Maciej Pawlik,
Bartosz Balis,
Maciej Malawski,
Karan Vahi,
Mats Rynge,
Rafael Ferreira da Silva,
Ewa Deelman
Abstract:
Increasing popularity of the serverless computing approach has led to the emergence of new cloud infrastructures working in Container-as-a-Service (CaaS) model like AWS Fargate, Google Cloud Run, or Azure Container Instances. They introduce an innovative approach to running cloud containers where developers are freed from managing underlying resources. In this paper, we focus on evaluating capabil…
▽ More
Increasing popularity of the serverless computing approach has led to the emergence of new cloud infrastructures working in Container-as-a-Service (CaaS) model like AWS Fargate, Google Cloud Run, or Azure Container Instances. They introduce an innovative approach to running cloud containers where developers are freed from managing underlying resources. In this paper, we focus on evaluating capabilities of elastic containers and their usefulness for scientific computing in the scientific workflow paradigm using AWS Fargate and Google Cloud Run infrastructures. For experimental evaluation of our approach, we extended HyperFlow engine to support these CaaS platform, together with adapting four real-world scientific workflows composed of several dozen to over a hundred of tasks organized into a dependency graph. We used these workflows to create cost-performance benchmarks and flow execution plots, measuring delays, elasticity, and scalability. The experiments proved that serverless containers can be successfully applied for scientific workflows. Also, the results allow us to gain insights on specific advantages and limits of such platforms.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Performance considerations on execution of large scale workflow applications on cloud functions
Authors:
Maciej Pawlik,
Kamil Figiela,
Maciej Malawski
Abstract:
Function-as-a-Service is a novel type of cloud service used for creating distributed applications and utilizing computing resources. Application developer supplies source code of cloud functions, which are small applications or application components, while the service provider is responsible for provisioning the infrastructure, scaling and exposing a REST style API. This environment seems to be a…
▽ More
Function-as-a-Service is a novel type of cloud service used for creating distributed applications and utilizing computing resources. Application developer supplies source code of cloud functions, which are small applications or application components, while the service provider is responsible for provisioning the infrastructure, scaling and exposing a REST style API. This environment seems to be adequate for running scientific workflows, which in recent years, have become an established paradigm for implementing and preserving complex scientific processes. In this paper, we present work done on evaluating three major FaaS providers (Amazon, Google, IBM) as a platform for running scientific workflows. The experiments were performed with a dedicated benchmarking framework, which consisted of instrumented workflow execution engine. The testing load was implemented as a large scale bag-of-tasks style workflow, where task count reached 5120 running in parallel. The studied parameters include raw performance, efficiency of infrastructure provisioning, overhead introduced by the API and network layers, as well as aspects related to run time accounting. Conclusions include insights into available performance, expressed as raw GFlops values and charts depicting relation of performance to function size. The infrastructure provisioning proved to be governed by parallelism and rate limits, which can be deducted from included charts. The overhead imposed by using a REST API proved to be a significant contribution to overall run time of individual tasks, and possibly the whole workflow. The paper ends with pointing out possible future work, which includes building performance models and designing a dedicated scheduling algorithms for running scientific workflows on FaaS.
△ Less
Submitted 8 September, 2019;
originally announced September 2019.