-
A Graph-based Approach to Variant Extraction
Authors:
Mark A. Santcroos,
Walter A. Kosters,
Mihai Lefter,
Jeroen F. J. Laros,
Jonathan K. Vis
Abstract:
Accurate variant descriptions are of paramount importance in the field of genetics. The domain is confronted with increasingly complex variants, making it more challenging to generate proper variant descriptions. We present a graph based on all minimal alignments that is a complete representation of a variant and we provide three complementary extraction methods to derive variant descriptions from…
▽ More
Accurate variant descriptions are of paramount importance in the field of genetics. The domain is confronted with increasingly complex variants, making it more challenging to generate proper variant descriptions. We present a graph based on all minimal alignments that is a complete representation of a variant and we provide three complementary extraction methods to derive variant descriptions from this graph. Our experiments show that our method in comparison with dbSNP results in identical HGVS descriptions for simple variants and more meaningful descriptions for complex variants.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
A Boolean Algebra for Genetic Variants
Authors:
Jonathan K. Vis,
Mark A. Santcroos,
Walter A. Kosters,
Jeroen F. J. Laros
Abstract:
Beyond identifying genetic variants, we introduce a set of Boolean relations that allows for a comprehensive classification of the relations for every pair of variants by taking all minimal alignments into account. We present an efficient algorithm to compute these relations, including a novel way of efficiently computing all minimal alignments within the best theoretical complexity bounds. We sho…
▽ More
Beyond identifying genetic variants, we introduce a set of Boolean relations that allows for a comprehensive classification of the relations for every pair of variants by taking all minimal alignments into account. We present an efficient algorithm to compute these relations, including a novel way of efficiently computing all minimal alignments within the best theoretical complexity bounds. We show that for variants of the CFTR gene in dbSNP these relations are common and many non-trivial. Ultimately, we present an approach for the storing and indexing of variants in the context of a database that enables efficient querying for all these relations.
△ Less
Submitted 3 January, 2023; v1 submitted 29 December, 2021;
originally announced December 2021.
-
Using Pilot Systems to Execute Many Task Workloads on Supercomputers
Authors:
Andre Merzky,
Matteo Turilli,
Manuel Maldonado,
Mark Santcroos,
Shantenu Jha
Abstract:
High performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job placeholders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot (RP) is a modular…
▽ More
High performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job placeholders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot (RP) is a modular and extensible Python-based pilot system. In this paper we describe RP's design, architecture and implementation, and characterize its performance. RP is capable of spawning more than 100 tasks/second and supports the steady-state execution of up to 16K concurrent tasks. RP can be used stand-alone, as well as integrated with other application-level tools as a runtime system.
△ Less
Submitted 30 July, 2018; v1 submitted 27 December, 2015;
originally announced December 2015.
-
A Comprehensive Perspective on Pilot-Job Systems
Authors:
Matteo Turilli,
Mark Santcroos,
Shantenu Jha
Abstract:
Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to consume more than 700 million CPU hours a year by the Open Science Grid communities, and by processing up to 1 million jobs a day for the ATLAS experiment on the Worldwide LHC Computing Grid. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job syst…
▽ More
Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to consume more than 700 million CPU hours a year by the Open Science Grid communities, and by processing up to 1 million jobs a day for the ATLAS experiment on the Worldwide LHC Computing Grid. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement upon a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability, and maintainability. This paper offers a comprehensive analysis of Pilot-Job systems critically assessing their motivations, evolution, properties, and implementation. The three main contributions of this paper are: (i) an analysis of the motivations and evolution of Pilot-Job systems; (ii) an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology, and its architecture pattern; and (iii) the description of core and auxiliary properties of Pilot-Jobs systems and the analysis of seven exemplar Pilot-Job implementations. Together, these contributions illustrate the Pilot paradigm, its generality, and how it helps to address some challenges in distributed scientific computing.
△ Less
Submitted 5 March, 2016; v1 submitted 17 August, 2015;
originally announced August 2015.
-
Pilot-Data: An Abstraction for Distributed Data
Authors:
Andre Luckow,
Mark Santcroos,
Ashley Zebrowski,
Shantenu Jha
Abstract:
Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, controlling co-placement and scheduling of data with compute resources, and storing, transferring, and managing large volumes of data. Although there exist multiple approaches to addressing each of these challenges, an integrative approach is…
▽ More
Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, controlling co-placement and scheduling of data with compute resources, and storing, transferring, and managing large volumes of data. Although there exist multiple approaches to addressing each of these challenges, an integrative approach is missing; furthermore, extending existing functionality or enabling interoperable capabilities remains difficult at best. We propose the concept of Pilot-Data to address the fundamental challenges of co-placement and scheduling of data and compute in heterogeneous and distributed environments with interoperability and extensibility as first-order concerns. Pilot-Data is an extension of the Pilot-Job abstraction for supporting the management of data in conjunction with compute tasks. Pilot-Data separates logical data units from physical storage, thereby providing the basis for efficient compute/data placement and scheduling. In this paper, we discuss the design and implementation of the Pilot-Data prototype, demonstrate its use by data-intensive applications on multiple production distributed cyberinfrastructure and illustrate the advantages arising from flexible execution modes enabled by Pilot-Data. Our experiments utilize an implementation of Pilot-Data in conjunction with a scalable Pilot-Job (BigJob) to establish the application performance that can be enabled by the use of Pilot-Data. We demonstrate how the concept of Pilot-Data also provides the basis upon which to build tools and support capabilities like affinity which in turn can be used for advanced data-compute co-placement and scheduling.
△ Less
Submitted 18 November, 2013; v1 submitted 26 January, 2013;
originally announced January 2013.
-
P*: A Model of Pilot-Abstractions
Authors:
Andre Luckow,
Mark Santcroos,
Ole Weidner,
Andre Merzky,
Pradeep Mantha,
Shantenu Jha
Abstract:
Pilot-Jobs support effective distributed resource utilization, and are arguably one of the most widely-used distributed computing abstractions - as measured by the number and types of applications that use them, as well as the number of production distributed cyberinfrastructures that support them. In spite of broad uptake, there does not exist a well-defined, unifying conceptual model of Pilot-Jo…
▽ More
Pilot-Jobs support effective distributed resource utilization, and are arguably one of the most widely-used distributed computing abstractions - as measured by the number and types of applications that use them, as well as the number of production distributed cyberinfrastructures that support them. In spite of broad uptake, there does not exist a well-defined, unifying conceptual model of Pilot-Jobs which can be used to define, compare and contrast different implementations. Often Pilot-Job implementations are strongly coupled to the distributed cyber-infrastructure they were originally designed for. These factors present a barrier to extensibility and interoperability. This pa- per is an attempt to (i) provide a minimal but complete model (P*) of Pilot-Jobs, (ii) establish the generality of the P* Model by mapping various existing and well known Pilot-Job frameworks such as Condor and DIANE to P*, (iii) derive an interoperable and extensible API for the P* Model (Pilot-API), (iv) validate the implementation of the Pilot-API by concurrently using multiple distinct Pilot-Job frameworks on distinct production distributed cyberinfrastructures, and (v) apply the P* Model to Pilot-Data.
△ Less
Submitted 27 July, 2012;
originally announced July 2012.