-
Tasking framework for Adaptive Speculative Parallel Mesh Generation
Authors:
Christos Tsolakis,
Polykarpos Thomadakis,
Nikos Chrisochoides
Abstract:
Handling the ever-increasing complexity of mesh generation codes along with the intricacies of newer hardware often results in codes that are both difficult to comprehend and maintain. Different facets of codes such as thread management and load balancing are often intertwined, resulting in efficient but highly complex software. In this work, we present a framework which aids in establishing a cor…
▽ More
Handling the ever-increasing complexity of mesh generation codes along with the intricacies of newer hardware often results in codes that are both difficult to comprehend and maintain. Different facets of codes such as thread management and load balancing are often intertwined, resulting in efficient but highly complex software. In this work, we present a framework which aids in establishing a core principle, deemed separation of concerns, where functionality is separated from performance aspects of various mesh operations. In particular, thread management and scheduling decisions are elevated into a generic and reusable tasking framework. The results indicate that our approach can successfully abstract the load balancing aspects of two case studies, while providing access to a plethora of different execution back-ends. One would expect, this new flexibility to lead to some additional cost. However, for the configurations studied in this work, we observed up to 13% speedup for some meshing operations and up to 5.8% speedup over the entire application runtime compared to hand-optimized code. Moreover, we show that by using different task creation strategies, the overhead compared to straight-forward task execution models can be improved dramatically by as much as 1200% without compromises in portability and functionality.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Towards Distributed Semi-speculative Adaptive Anisotropic Parallel Mesh Generation
Authors:
Kevin Garner,
Christos Tsolakis,
Polykarpos Thomadakis,
Nikos Chrisochoides
Abstract:
This paper presents the foundational elements of a distributed memory method for mesh generation that is designed to leverage concurrency offered by large-scale computing. To achieve this goal, meshing functionality is separated from performance aspects by utilizing a separate entity for each - a shared memory mesh generation code called CDT3D and PREMA for parallel runtime support. Although CDT3D…
▽ More
This paper presents the foundational elements of a distributed memory method for mesh generation that is designed to leverage concurrency offered by large-scale computing. To achieve this goal, meshing functionality is separated from performance aspects by utilizing a separate entity for each - a shared memory mesh generation code called CDT3D and PREMA for parallel runtime support. Although CDT3D is designed for scalability, lessons are presented regarding additional measures that were taken to enable the code's integration into the distributed memory method as a black box. In the presented method, an initial mesh is data decomposed and subdomains are distributed amongst the nodes of a high-performance computing (HPC) cluster. Meshing operations within CDT3D utilize a speculative execution model, enabling the strict adaptation of subdomains' interior elements. Interface elements undergo several iterations of shifting so that they are adapted when their data dependencies are resolved. PREMA aids in this endeavor by providing asynchronous message passing between encapsulations of data, work load balancing, and migration capabilities all within a globally addressable namespace. PREMA also assists in establishing data dependencies between subdomains, thus enabling "neighborhoods" of subdomains to work independently of each other in performing interface shifts and adaptation. Preliminary results show that the presented method is able to produce meshes of comparable quality to those generated by the original shared memory CDT3D code. Given the costly overhead of collective communication seen by existing state-of-the-art software, relative communication performance of the presented distributed memory method also shows that its emphasis on avoiding global synchronization presents a potentially viable solution in achieving scalability when targeting large configurations of cores.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Experience with Distributed Memory Delaunay-based Image-to-Mesh Conversion Implementation
Authors:
Polykarpos Thomadakis,
Nikos Chrisochoides
Abstract:
This paper presents some of our findings on the scalability of parallel 3D mesh generation on distributed memory machines. The primary objective of this study was to evaluate a distributed memory approach for implementing a 3D parallel Delaunay-based algorithm that converts images to meshes by leveraging an efficient shared memory implementation. The secondary objective was to evaluate the effecti…
▽ More
This paper presents some of our findings on the scalability of parallel 3D mesh generation on distributed memory machines. The primary objective of this study was to evaluate a distributed memory approach for implementing a 3D parallel Delaunay-based algorithm that converts images to meshes by leveraging an efficient shared memory implementation. The secondary objective was to evaluate the effectiveness of labor (i.e., reduce development time) while introducing minimal overheads to maintain the parallel efficiency of the end-product i.e., distributed implementation. The distributed algorithm utilizes two existing and independently developed parallel Delaunay-based methods: (1) a fine-grained method that employs multi-threading and speculative execution on shared memory nodes and (2) a loosely coupled Delaunay-refinement framework for multi-node platforms. The shared memory implementation uses a FIFO work-sharing scheme for thread scheduling, while the distributed memory implementation utilizes the MPI and the Master-Worker (MW) model. The findings from the specific MPI-MW implementation we tested suggest that the execution on (1) 40 cores not necessary in the same single node is 2.3 times faster than the execution on ten cores, (2) the best speedup is 5.4 with 180 cores again the comparison is with the best performance on ten cores. A closer look at the performance of distributed memory and shared memory implementation executing on a single node (40 cores) suggest that the overheads introduced in the MPI-MW implementation are high and render the MPI-MW implementation 4 times slower than the shared memory code using the same number of cores. These findings raise several questions on the potential scalability of a "black box" approach, i.e., re-using a code designed to execute efficiently on shared memory machines without considering its potential use in a distributed memory setting.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Runtime Support for Performance Portability on Heterogeneous Distributed Platforms
Authors:
Polykarpos Thomadakis,
Nikos Chrisochoides
Abstract:
Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This…
▽ More
Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs. The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10\%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%. This is accomplished by the optimizations at the library level as well as by creating opportunities to leverage application-specific optimizations like over-decomposition.
△ Less
Submitted 7 March, 2023; v1 submitted 4 March, 2023;
originally announced March 2023.
-
Towards Performance Portable Programming for Distributed Heterogeneous Systems
Authors:
Polykarpos Thomadakis,
Nikos Chrisochoides
Abstract:
Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such archit…
▽ More
Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement in a shared memory benchmark and up to 10 times in distributed device communication. Preliminary results indicate that our software incurs low overhead and achieves 40% improvement in a distributed Jacobi proxy application while hiding the idiosyncrasies of the hardware.
△ Less
Submitted 3 October, 2022;
originally announced October 2022.
-
Convolutional Auto-Encoders for Drift Chamber data de-noising for CLAS12
Authors:
Gagik Gavalian,
Polykarpos Thomadakis,
Angelos Angelopoulos,
Nikos Chrisochoides
Abstract:
In this article, we present the results of using Convolutional Auto-Encoders for de-noising raw data for CLAS12 drift chambers. The de-noising neural network provides increased efficiency in track reconstruction and also improved performance for high luminosity experimental data collection. The de-noising neural network used in conjunction with the previously developed track classifier neural netw…
▽ More
In this article, we present the results of using Convolutional Auto-Encoders for de-noising raw data for CLAS12 drift chambers. The de-noising neural network provides increased efficiency in track reconstruction and also improved performance for high luminosity experimental data collection. The de-noising neural network used in conjunction with the previously developed track classifier neural network \cite{Gavalian:2022hfa} lead to a significant track reconstruction efficiency increase for current luminosity ($0.6\times10^{35}~cm^{-2}~sec^{-1}$ ). The increase in experimentally measured quantities will allow running experiments at twice the luminosity with the same track reconstruction efficiency. This will lead to huge savings in accelerator operational costs, and large savings for Jefferson Lab and collaborating institutions.
△ Less
Submitted 13 June, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
CLAS12 Track Reconstruction with Artificial Intelligence
Authors:
Gagik Gavalian,
Polykarpos Thomadakis,
Angelos Angelopoulos,
Nikos Chrisochoides,
Raffaella De Vita,
Veronique Ziegler
Abstract:
In this article we describe the implementation of Artificial Intelligence models in track reconstruction software for the CLAS12 detector at Jefferson Lab. The Artificial Intelligence based approach resulted in improved track reconstruction efficiency in high luminosity experimental conditions. The track reconstruction efficiency increased by $10-12\%$ for single particle, and statistics in multi-…
▽ More
In this article we describe the implementation of Artificial Intelligence models in track reconstruction software for the CLAS12 detector at Jefferson Lab. The Artificial Intelligence based approach resulted in improved track reconstruction efficiency in high luminosity experimental conditions. The track reconstruction efficiency increased by $10-12\%$ for single particle, and statistics in multi-particle physics reactions increased by $15\%-35\%$ depending on the number of particles in the reaction. The implementation of artificial intelligence in the workflow also resulted in a speedup of the tracking by $35\%$.
△ Less
Submitted 13 June, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
Using Machine Learning for Particle Track Identification in the CLAS12 Detector
Authors:
Polykarpos Thomadakis,
Angelos Angelopoulos,
Gagik Gavalian,
Nikos Chrisochoides
Abstract:
Particle track reconstruction is the most computationally intensive process in nuclear physics experiments. Traditional algorithms use a combinatorial approach that exhaustively tests track measurements ("hits") to identify those that form an actual particle trajectory. In this article, we describe the development of four machine learning (ML) models that assist the tracking algorithm by identifyi…
▽ More
Particle track reconstruction is the most computationally intensive process in nuclear physics experiments. Traditional algorithms use a combinatorial approach that exhaustively tests track measurements ("hits") to identify those that form an actual particle trajectory. In this article, we describe the development of four machine learning (ML) models that assist the tracking algorithm by identifying valid track candidates from the measurements in drift chambers. Several types of machine learning models were tested, including: Convolutional Neural Networks (CNN), Multi-Layer Perceptrons (MLP), Extremely Randomized Trees (ERT) and Recurrent Neural Networks (RNN). As a result of this work, an MLP network classifier was implemented as part of the CLAS12 reconstruction software to provide the tracking code with recommended track candidates. The resulting software achieved accuracy of greater than 99\% and resulted in an end-to-end speedup of 35\% compared to existing algorithms.
△ Less
Submitted 28 April, 2022; v1 submitted 28 August, 2020;
originally announced August 2020.
-
Reed-Solomon and Concatenated Codes with Applications in Space Communication
Authors:
Polykarpos Thomadakis,
Antonios Argyriou
Abstract:
In this paper we provide a detailed description of Reed-Solomon (RS) codes, the most important algorithms for decoding them, and their use in concatenated coding systems for space applications. In the current literature there is scattered information regarding the bit-level implementation of such codes for either space systems or any other type of application. Consequently, we start with a general…
▽ More
In this paper we provide a detailed description of Reed-Solomon (RS) codes, the most important algorithms for decoding them, and their use in concatenated coding systems for space applications. In the current literature there is scattered information regarding the bit-level implementation of such codes for either space systems or any other type of application. Consequently, we start with a general overview of the channel coding systems used in space communications and then we focus in the finest details. We first present a detailed description of the required algebra of RS codes with detailed examples. Next, the steps of the encoding and decoding algorithms are described with detail and again with additional examples. Next, we focus on a particularly important class of concatenated encoders/decoders namely the Consultative Committee for Space Data Systems (CCSDS) concatenated coding system that uses RS as the outer code, and a convolutional inner code. Finally, we perform a thorough performance evaluation of the presented codes under the AWGN channel model.
△ Less
Submitted 13 August, 2016;
originally announced August 2016.