-
Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow Applications
Authors:
Andre Merzky,
Mikhail Titov,
Matteo Turilli,
Ozgur Kilic,
Tianle Wang,
Shantenu Jha
Abstract:
Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling a…
▽ More
Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Deep RC: A Scalable Data Engineering and Deep Learning Pipeline
Authors:
Arup Kumar Sarker,
Aymen Alsaadi,
Alexander James Halpern,
Prabhath Tangella,
Mikhail Titov,
Niranda Perera,
Mills Staylor,
Gregor von Laszewski,
Shantenu Jha,
Geoffrey Fox
Abstract:
Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing…
▽ More
Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.
△ Less
Submitted 22 April, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Exascale Workflow Applications and Middleware: An ExaWorks Retrospective
Authors:
Aymen Alsaadi,
Mihael Hategan-Marandiuc,
Ketan Maheshwari,
Andre Merzky,
Mikhail Titov,
Matteo Turilli,
Andreas Wilke,
Justin M. Wozniak,
Kyle Chard,
Rafael Ferreira da Silva,
Shantenu Jha,
Daniel Laney
Abstract:
Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We pre…
▽ More
Exascale computers offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. However, these software combinations and integrations are difficult to achieve due to the challenges of coordinating and deploying heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which addresses many of these challenges. We developed a workflow Software Development Toolkit (SDK), a curated collection of workflow technologies that can be composed and interoperated through a common interface, engineered following current best practices, and specifically designed to work on HPC platforms. ExaWorks also developed PSI/J, a job management abstraction API, to simplify the construction of portable software components and applications that can be used over various HPC schedulers. The PSI/J API is a minimal interface for submitting and monitoring jobs and their execution state across multiple and commonly used HPC schedulers. We also describe several leading and innovative workflow examples of ExaWorks tools used on DOE leadership platforms. Furthermore, we discuss how our project is working with the workflow community, large computing facilities, and HPC platform vendors to address the requirements of workflows sustainably at the exascale.
△ Less
Submitted 15 November, 2024;
originally announced November 2024.
-
Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows
Authors:
Rafael Ferreira da Silva,
Deborah Bard,
Kyle Chard,
Shaun de Witt,
Ian T. Foster,
Tom Gibbs,
Carole Goble,
William Godoy,
Johan Gustafsson,
Utz-Uwe Haus,
Stephen Hudson,
Shantenu Jha,
Laila Los,
Drew Paine,
Frédéric Suter,
Logan Ward,
Sean Wilkinson,
Marcos Amaris,
Yadu Babuji,
Jonathan Bader,
Riccardo Balin,
Daniel Balouek,
Sarah Beecroft,
Khalid Belhajjame,
Rajat Bhattarai
, et al. (86 additional authors not shown)
Abstract:
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific w…
▽ More
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific workflows, enabling higher-fidelity models and complex, time-sensitive processes, while introducing challenges in managing heterogeneous environments and multi-facility data dependencies. The rise of large language models is driving computational demands to zettaflop scales, necessitating modular, adaptable systems and cloud-service models to optimize resource utilization and ensure reproducibility. Multi-facility workflows present challenges in data movement, curation, and overcoming institutional silos, while diverse hardware architectures require integrating workflow considerations into early system design and developing standardized resource management tools. The summit emphasized improving user experience in workflow systems and ensuring FAIR workflows to enhance collaboration and accelerate scientific discovery. Key recommendations include developing standardized metrics for time-sensitive workflows, creating frameworks for cloud-HPC integration, implementing distributed-by-design workflow modeling, establishing multi-facility authentication protocols, and accelerating AI integration in HPC workflow management. The summit also called for comprehensive workflow benchmarks, workflow-specific UX principles, and a FAIR workflow maturity model, highlighting the need for continued collaboration in addressing the complex challenges posed by the convergence of AI, HPC, and multi-facility research environments.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies
Authors:
Matteo Turilli,
Mihael Hategan-Marandiuc,
Mikhail Titov,
Ketan Maheshwari,
Aymen Alsaadi,
Andre Merzky,
Ramon Arambula,
Mikhail Zakharchanka,
Matt Cowan,
Justin M. Wozniak,
Andreas Wilke,
Ozgur Ozan Kilic,
Kyle Chard,
Rafael Ferreira da Silva,
Shantenu Jha,
Daniel Laney
Abstract:
Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resour…
▽ More
Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Scaling on Frontier: Uncertainty Quantification Workflow Applications using ExaWorks to Enable Full System Utilization
Authors:
Mikhail Titov,
Robert Carson,
Matthew Rolchigo,
John Coleman,
James Belak,
Matthew Bement,
Daniel Laney,
Matteo Turilli,
Shantenu Jha
Abstract:
When running at scale, modern scientific workflows require middleware to handle allocated resources, distribute computing payloads and guarantee a resilient execution. While individual steps might not require sophisticated control methods, bringing them together as a whole workflow requires advanced management mechanisms. In this work, we used RADICAL-EnTK (Ensemble Toolkit) - one of the SDK compo…
▽ More
When running at scale, modern scientific workflows require middleware to handle allocated resources, distribute computing payloads and guarantee a resilient execution. While individual steps might not require sophisticated control methods, bringing them together as a whole workflow requires advanced management mechanisms. In this work, we used RADICAL-EnTK (Ensemble Toolkit) - one of the SDK components of the ECP ExaWorks project - to implement and execute the novel Exascale Additive Manufacturing (ExaAM) workflows on up to 8000 compute nodes of the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. EnTK allowed us to address challenges such as varying resource requirements (e.g., heterogeneity, size, and runtime), different execution environment per workflow, and fault tolerance. And a native portability feature of the developed EnTK applications allowed us to adjust these applications for Frontier runs promptly, while ensuring an expected level of resource utilization (up to 90%).
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Workflow Mini-Apps: Portable, Scalable, Tunable & Faithful Representations of Scientific Workflows
Authors:
Ozgur Ozan Kilic,
Tianle Wang,
Matteo Turilli,
Mikhail Titov,
Andre Merzky,
Line Pouchard,
Shantenu Jha
Abstract:
Workflows are critical for scientific discovery. However, the sophistication, heterogeneity, and scale of workflows make building, testing, and optimizing them increasingly challenging. Furthermore, their complexity and heterogeneity make performance reproducibility hard. In this paper, we propose workflow mini-apps as a tool to address the challenges in building and testing workflows while contro…
▽ More
Workflows are critical for scientific discovery. However, the sophistication, heterogeneity, and scale of workflows make building, testing, and optimizing them increasingly challenging. Furthermore, their complexity and heterogeneity make performance reproducibility hard. In this paper, we propose workflow mini-apps as a tool to address the challenges in building and testing workflows while controlling the fidelity of representing realworld workflows. Workflow mini-apps are deployed and run on various HPC systems and architectures without workflow-specific constraints. We offer insight into their design and implementation, providing an analysis of their performance and reproducibility. Workflow mini-apps thus advance the science of workflows by providing simple, portable, and managed (fidelity) representations of otherwise complex and difficult-to-control real workflows.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Design and Implementation of an Analysis Pipeline for Heterogeneous Data
Authors:
Arup Kumar Sarker,
Aymen Alsaadi,
Niranda Perera,
Mills Staylor,
Gregor von Laszewski,
Matteo Turilli,
Ozgur Ozan Kilic,
Mikhail Titov,
Andre Merzky,
Shantenu Jha,
Geoffrey Fox
Abstract:
Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. In…
▽ More
Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
△ Less
Submitted 7 April, 2024; v1 submitted 23 March, 2024;
originally announced March 2024.
-
A Semi-Supervised Deep Learning Approach to Dataset Collection for Query-By-Humming Task
Authors:
Amantur Amatov,
Dmitry Lamanov,
Maksim Titov,
Ivan Vovk,
Ilya Makarov,
Mikhail Kudinov
Abstract:
Query-by-Humming (QbH) is a task that involves finding the most relevant song based on a hummed or sung fragment. Despite recent successful commercial solutions, implementing QbH systems remains challenging due to the lack of high-quality datasets for training machine learning models. In this paper, we propose a deep learning data collection technique and introduce Covers and Hummings Aligned Data…
▽ More
Query-by-Humming (QbH) is a task that involves finding the most relevant song based on a hummed or sung fragment. Despite recent successful commercial solutions, implementing QbH systems remains challenging due to the lack of high-quality datasets for training machine learning models. In this paper, we propose a deep learning data collection technique and introduce Covers and Hummings Aligned Dataset (CHAD), a novel dataset that contains 18 hours of short music fragments, paired with time-aligned hummed versions. To expand our dataset, we employ a semi-supervised model training pipeline that leverages the QbH task as a specialized case of cover song identification (CSI) task. Starting with a model trained on the initial dataset, we iteratively collect groups of fragments of cover versions of the same song and retrain the model on the extended data. Using this pipeline, we collect over 308 hours of additional music fragments, paired with time-aligned cover versions. The final model is successfully applied to the QbH task and achieves competitive results on benchmark datasets. Our study shows that the proposed dataset and training pipeline can effectively facilitate the implementation of QbH systems.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
Workflows Community Summit 2022: A Roadmap Revolution
Authors:
Rafael Ferreira da Silva,
Rosa M. Badia,
Venkat Bala,
Debbie Bard,
Peer-Timo Bremer,
Ian Buckley,
Silvina Caino-Lores,
Kyle Chard,
Carole Goble,
Shantenu Jha,
Daniel S. Katz,
Daniel Laney,
Manish Parashar,
Frederic Suter,
Nick Tyler,
Thomas Uram,
Ilkay Altintas,
Stefan Andersson,
William Arndt,
Juan Aznar,
Jonathan Bader,
Bartosz Balis,
Chris Blanton,
Kelly Rosa Braghetto,
Aharon Brodutch
, et al. (80 additional authors not shown)
Abstract:
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and t…
▽ More
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and the evolving needs of emerging scientific applications, it is paramount that the development of novel scientific workflows and system functionalities seek to increase the efficiency, resilience, and pervasiveness of existing systems and applications. Specifically, the proliferation of machine learning/artificial intelligence (ML/AI) workflows, need for processing large scale datasets produced by instruments at the edge, intensification of near real-time data processing, support for long-term experiment campaigns, and emergence of quantum computing as an adjunct to HPC, have significantly changed the functional and operational requirements of workflow systems. Workflow systems now need to, for example, support data streams from the edge-to-cloud-to-HPC enable the management of many small-sized files, allow data reduction while ensuring high accuracy, orchestrate distributed services (workflows, instruments, data movement, provenance, publication, etc.) across computing and user facilities, among others. Further, to accelerate science, it is also necessary that these systems implement specifications/standards and APIs for seamless (horizontal and vertical) integration between systems and applications, as well as enabling the publication of workflows and their associated products according to the FAIR principles. This document reports on discussions and findings from the 2022 international edition of the Workflows Community Summit that took place on November 29 and 30, 2022.
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
The Ghost of Performance Reproducibility Past
Authors:
Srinivasan Ramesh,
Mikhail Titov,
Matteo Turilli,
Shantenu Jha,
Allen Malony
Abstract:
The importance of ensemble computing is well established. However, executing ensembles at scale introduces interesting performance fluctuations that have not been well investigated. In this paper, we trace our experience uncovering performance fluctuations of ensemble applications (primarily constituting a workflow of GROMACS tasks), and unsuccessful attempts, so far, at trying to discern the unde…
▽ More
The importance of ensemble computing is well established. However, executing ensembles at scale introduces interesting performance fluctuations that have not been well investigated. In this paper, we trace our experience uncovering performance fluctuations of ensemble applications (primarily constituting a workflow of GROMACS tasks), and unsuccessful attempts, so far, at trying to discern the underlying cause(s) of performance fluctuations. Is the failure to discern the causative or contributing factors a failure of capability? Or imagination? Do the fluctuations have their genesis in some inscrutable aspect of the system or software? Does it warrant a fundamental reassessment and rethinking of how we assume and conceptualize performance reproducibility? Answers to these questions are not straightforward, nor are they immediate or obvious. We conclude with a discussion about the performance of ensemble applications and ruminate over the implications for how we define and measure application performance.
△ Less
Submitted 27 August, 2022;
originally announced August 2022.
-
ExaWorks: Workflows for Exascale
Authors:
Aymen Al-Saadi,
Dong H. Ahn,
Yadu Babuji,
Kyle Chard,
James Corbett,
Mihael Hategan,
Stephen Herbein,
Shantenu Jha,
Daniel Laney,
Andre Merzky,
Todd Munson,
Michael Salim,
Mikhail Titov,
Matteo Turilli,
Justin M. Wozniak
Abstract:
Exascale computers will offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. These software combinations and integrations, however, are difficult to achieve due to challenges of coordination and deployment of heterogeneous software components on diverse and massive platforms.…
▽ More
Exascale computers will offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications to accelerate scientific discovery and insight. These software combinations and integrations, however, are difficult to achieve due to challenges of coordination and deployment of heterogeneous software components on diverse and massive platforms. We present the ExaWorks project, which can address many of these challenges: ExaWorks is leading a co-design process to create a workflow software development Toolkit (SDK) consisting of a wide range of workflow management tools that can be composed and interoperate through common interfaces. We describe the initial set of tools and interfaces supported by the SDK, efforts to make them easier to apply to complex science challenges, and examples of their application to exemplar cases. Furthermore, we discuss how our project is working with the workflows community, large computing facilities as well as HPC platform vendors to sustainably address the requirements of workflows at the exascale.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Pandemic Drugs at Pandemic Speed: Infrastructure for Accelerating COVID-19 Drug Discovery with Hybrid Machine Learning- and Physics-based Simulations on High Performance Computers
Authors:
Agastya P. Bhati,
Shunzhou Wan,
Dario Alfè,
Austin R. Clyde,
Mathis Bode,
Li Tan,
Mikhail Titov,
Andre Merzky,
Matteo Turilli,
Shantenu Jha,
Roger R. Highfield,
Walter Rocchia,
Nicola Scafuri,
Sauro Succi,
Dieter Kranzlmüller,
Gerald Mathias,
David Wifling,
Yann Donon,
Alberto Di Meglio,
Sofia Vallecorsa,
Heng Ma,
Anda Trifan,
Arvind Ramanathan,
Tom Brettin,
Alexander Partin
, et al. (4 additional authors not shown)
Abstract:
The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods…
▽ More
The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods, in this case developed for linear accelerators, and physics-based methods. The two in silico methods, each have their own advantages and limitations which, interestingly, complement each other. Here, we present an innovative infrastructural development that combines both approaches to accelerate drug discovery. The scale of the potential resulting workflow is such that it is dependent on supercomputing to achieve extremely high throughput. We have demonstrated the viability of this workflow for the study of inhibitors for four COVID-19 target proteins and our ability to perform the required large-scale calculations to identify lead antiviral compounds through repurposing on a variety of supercomputers.
△ Less
Submitted 4 September, 2021; v1 submitted 4 March, 2021;
originally announced March 2021.
-
Design and Performance Characterization of RADICAL-Pilot on Leadership-class Platforms
Authors:
Andre Merzky,
Matteo Turilli,
Mikhail Titov,
Aymen Al-Saadi,
Shantenu Jha
Abstract:
Many extreme scale scientific applications have workloads comprised of a large number of individual high-performance tasks. The Pilot abstraction decouples workload specification, resource management, and task execution via job placeholders and late-binding. As such, suitable implementations of the Pilot abstraction can support the collective execution of large number of tasks on supercomputers. W…
▽ More
Many extreme scale scientific applications have workloads comprised of a large number of individual high-performance tasks. The Pilot abstraction decouples workload specification, resource management, and task execution via job placeholders and late-binding. As such, suitable implementations of the Pilot abstraction can support the collective execution of large number of tasks on supercomputers. We introduce RADICAL-Pilot (RP) as a portable, modular and extensible pilot-enabled runtime system. We describe RP's design, architecture and implementation. We characterize its performance and show its ability to scalably execute workloads comprised of tens of thousands heterogeneous tasks on DOE and NSF leadership-class HPC platforms. Specifically, we investigate RP's weak/strong scaling with CPU/GPU, single/multi core, (non)MPI tasks and Python functions when using most of ORNL Summit and TACC Frontera. RADICAL-Pilot can be used stand-alone, as well as the runtime for third-party workflow systems.
△ Less
Submitted 2 November, 2021; v1 submitted 26 February, 2021;
originally announced March 2021.
-
Scalable HPC and AI Infrastructure for COVID-19 Therapeutics
Authors:
Hyungro Lee,
Andre Merzky,
Li Tan,
Mikhail Titov,
Matteo Turilli,
Dario Alfe,
Agastya Bhati,
Alex Brace,
Austin Clyde,
Peter Coveney,
Heng Ma,
Arvind Ramanathan,
Rick Stevens,
Anda Trifan,
Hubertus Van Dam,
Shunzhou Wan,
Sean Wilkinson,
Shantenu Jha
Abstract:
COVID-19 has claimed more 1 million lives and resulted in over 40 million infections. There is an urgent need to identify drugs that can inhibit SARS-CoV-2. In response, the DOE recently established the Medical Therapeutics project as part of the National Virtual Biotechnology Laboratory, and tasked it with creating the computational infrastructure and methods necessary to advance therapeutics dev…
▽ More
COVID-19 has claimed more 1 million lives and resulted in over 40 million infections. There is an urgent need to identify drugs that can inhibit SARS-CoV-2. In response, the DOE recently established the Medical Therapeutics project as part of the National Virtual Biotechnology Laboratory, and tasked it with creating the computational infrastructure and methods necessary to advance therapeutics development. We discuss innovations in computational infrastructure and methods that are accelerating and advancing drug design. Specifically, we describe several methods that integrate artificial intelligence and simulation-based approaches, and the design of computational infrastructure to support these methods at scale. We discuss their implementation and characterize their performance, and highlight science advances that these capabilities have enabled.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads
Authors:
Aymen Al Saadi,
Dario Alfe,
Yadu Babuji,
Agastya Bhati,
Ben Blaiszik,
Thomas Brettin,
Kyle Chard,
Ryan Chard,
Peter Coveney,
Anda Trifan,
Alex Brace,
Austin Clyde,
Ian Foster,
Tom Gibbs,
Shantenu Jha,
Kristopher Keipert,
Thorsten Kurth,
Dieter Kranzlmüller,
Hyungro Lee,
Zhuozhao Li,
Heng Ma,
Andre Merzky,
Gerald Mathias,
Alexander Partin,
Junqi Yin
, et al. (11 additional authors not shown)
Abstract:
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating…
▽ More
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple algorithmic innovations to overcome this fundamental limitation, development and deployment of computational infrastructure at scale integrates multiple artificial intelligence and simulation-based approaches. Three measures of performance are:(i) throughput, the number of ligands per unit time; (ii) scientific performance, the number of effective ligands sampled per unit time and (iii) peak performance, in flop/s. The capabilities outlined here have been used in production for several months as the workhorse of the computational infrastructure to support the capabilities of the US-DOE National Virtual Biotechnology Laboratory in combination with resources from the EU Centre of Excellence in Computational Biomedicine.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.