-
Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with SW Sequence Alignment
Authors:
Manuel Costanzo,
Enzo Rucci,
Carlos García-Sánchez,
Marcelo Naiouf,
Manuel Prieto-Matías
Abstract:
The high-performance computing (HPC) landscape is undergoing rapid transformation, with an increasing emphasis on energy-efficient and heterogeneous computing environments. This comprehensive study extends our previous research on SYCL's performance portability by evaluating its effectiveness across a broader spectrum of computing architectures, including CPUs, GPUs, and hybrid CPU-GPU configurati…
▽ More
The high-performance computing (HPC) landscape is undergoing rapid transformation, with an increasing emphasis on energy-efficient and heterogeneous computing environments. This comprehensive study extends our previous research on SYCL's performance portability by evaluating its effectiveness across a broader spectrum of computing architectures, including CPUs, GPUs, and hybrid CPU-GPU configurations from NVIDIA, Intel, and AMD. Our analysis covers single-GPU, multi-GPU, single-CPU, and CPU-GPU hybrid setups, using two common, bioinformatic applications as a case study. The results demonstrate SYCL's versatility across different architectures, maintaining comparable performance to CUDA on NVIDIA GPUs while achieving similar architectural efficiency rates on AMD and Intel GPUs in the majority of cases tested. SYCL also demonstrated remarkable versatility and effectiveness across CPUs from various manufacturers, including the latest hybrid architectures from Intel. Although SYCL showed excellent functional portability in hybrid CPU-GPU configurations, performance varied significantly based on specific hardware combinations. Some performance limitations were identified in multi-GPU and CPU-GPU configurations, primarily attributed to workload distribution strategies rather than SYCL-specific constraints. These findings position SYCL as a promising unified programming model for heterogeneous computing environments, particularly for bioinformatic applications.
△ Less
Submitted 14 April, 2025; v1 submitted 11 December, 2024;
originally announced December 2024.
-
Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs
Authors:
Manuel Costanzo,
Enzo Rucci,
Carlos García Sánchez,
Marcelo Naiouf,
Manuel Prieto-Matías
Abstract:
The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the portability and performance of the SYCL and CUDA languages for one fundamental bioinformatics application (Smith-Waterman protein database search) across different GPU…
▽ More
The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the portability and performance of the SYCL and CUDA languages for one fundamental bioinformatics application (Smith-Waterman protein database search) across different GPU architectures, considering single and multi-GPU configurations from different vendors. The experimental work showed that, while both CUDA and SYCL versions achieve similar performance on NVIDIA devices, the latter demonstrated remarkable code portability to other GPU architectures, such as AMD and Intel. Furthermore, the architectural efficiency rates achieved on these devices were superior in 3 of the 4 cases tested. This brief study highlights the potential of SYCL as a viable solution for achieving both performance and portability in the heterogeneous computing ecosystem.
△ Less
Submitted 10 November, 2023; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Enhanced 6D Pose Estimation for Robotic Fruit Picking
Authors:
Marco Costanzo,
Marco De Simone,
Sara Federico,
Ciro Natale,
Salvatore Pirozzi
Abstract:
This paper proposes a novel method to refine the 6D pose estimation inferred by an instance-level deep neural network which processes a single RGB image and that has been trained on synthetic images only. The proposed optimization algorithm usefully exploits the depth measurement of a standard RGB-D camera to estimate the dimensions of the considered object, even though the network is trained on a…
▽ More
This paper proposes a novel method to refine the 6D pose estimation inferred by an instance-level deep neural network which processes a single RGB image and that has been trained on synthetic images only. The proposed optimization algorithm usefully exploits the depth measurement of a standard RGB-D camera to estimate the dimensions of the considered object, even though the network is trained on a single CAD model of the same object with given dimensions. The improved accuracy in the pose estimation allows a robot to grasp apples of various types and significantly different dimensions successfully; this was not possible using the standard pose estimation algorithm, except for the fruits with dimensions very close to those of the CAD drawing used in the training process. Grasping fresh fruits without damaging each item also demands a suitable grasp force control. A parallel gripper equipped with special force/tactile sensors is thus adopted to achieve safe grasps with the minimum force necessary to lift the fruits without any slippage and any deformation at the same time, with no knowledge of their weight.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Visual motion analysis of the player's finger
Authors:
Marco Costanzo
Abstract:
This work is about the extraction of the motion of fingers, in their three articulations, of a keyboard player from a video sequence. The relevance of the problem involves several aspects, in fact, the extraction of the movements of the fingers may be used to compute the keystroke efficiency and individual joint contributions, as showed by Werner Goebl and Caroline Palmer in the paper 'Temporal Co…
▽ More
This work is about the extraction of the motion of fingers, in their three articulations, of a keyboard player from a video sequence. The relevance of the problem involves several aspects, in fact, the extraction of the movements of the fingers may be used to compute the keystroke efficiency and individual joint contributions, as showed by Werner Goebl and Caroline Palmer in the paper 'Temporal Control and Hand Movement Efficiency in Skilled Music Performance'. Those measures are directly related to the precision in timing and force measures. A very good approach to the hand gesture recognition problem has been presented in the paper ' Real-Time Hand Gesture Recognition Using Finger Segmentation'. Detecting the keys pressed on a keyboard is a task that can be complex because of the shadows that can degrade the quality of the result and possibly cause the detection of not pressed keys. Among the several approaches that already exist, a great amount of them is based on the subtraction of frames in order to detect the movements of the keys caused by their pressure. Detecting the keys that are pressed could be useful to automatically evaluate the performance of a pianist or to automatically write sheet music of the melody that is being played.
△ Less
Submitted 24 February, 2023;
originally announced March 2023.
-
Assessing Opportunities of SYCL for Biological Sequence Alignment on GPU-based Systems
Authors:
Manuel Costanzo,
Enzo Rucci,
Carlos García Sánchez,
Marcelo Naiouf,
Manuel Prieto-Matías
Abstract:
Bioinformatics and Computational Biology are two fields that have been exploiting GPUs for more than two decades, being CUDA the most used programming language for them. However, as CUDA is an NVIDIA proprietary language, it implies a strong portability restriction to a wide range of heterogeneous architectures, like AMD or Intel GPUs. To face this issue, the Khronos Group has recently proposed th…
▽ More
Bioinformatics and Computational Biology are two fields that have been exploiting GPUs for more than two decades, being CUDA the most used programming language for them. However, as CUDA is an NVIDIA proprietary language, it implies a strong portability restriction to a wide range of heterogeneous architectures, like AMD or Intel GPUs. To face this issue, the Khronos Group has recently proposed the SYCL standard, which is an open, royalty-free, cross-platform abstraction layer, that enables the programming of a heterogeneous system to be written using standard, single-source C++ code. Over the past few years, several implementations of this SYCL standard have emerged, being oneAPI the one from Intel. This paper presents the migration process of the SW\# suite, a biological sequence alignment tool developed in CUDA, to SYCL using Intel's oneAPI ecosystem. The experimental results show that SW\# was completely migrated with a small programmer intervention in terms of hand-coding. In addition, it was possible to port the migrated code between different architectures (considering multiple vendor GPUs and also CPUs), with no noticeable performance degradation on 5 different NVIDIA GPUs. Moreover, performance remained stable when switching to another SYCL implementation. As a consequence, SYCL and its implementations can offer attractive opportunities for the Bioinformatics community, especially considering the vast existence of CUDA-based legacy codes.
△ Less
Submitted 23 February, 2024; v1 submitted 19 November, 2022;
originally announced November 2022.
-
Migrating CUDA to oneAPI: A Smith-Waterman Case Study
Authors:
Manuel Costanzo,
Enzo Rucci,
Carlos Garcia Sanchez,
Marcelo Naiouf,
Manuel Prieto-Matias
Abstract:
To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration to DPC++. Due to t…
▽ More
To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration to DPC++. Due to the large amount of existing CUDA-based software in the bioinformatics context, this paper presents our experiences porting SW#db, a well-known sequence alignment tool, to DPC++ using dpct. From the experimental work, it was possible to prove the usefulness of dpct for SW#db code migration and the cross-GPU vendor, cross-architecture portability of the migrated DPC++ code. In addition, the performance results showed that the migrated DPC++ code reports similar efficiency rates to its CUDA-native counterpart or even better in some tests (approximately +5%).
△ Less
Submitted 20 June, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Getting the best from skylines and top-k queries
Authors:
Marco Costanzo
Abstract:
Top-k and skylines are two important techniques that can be used to extract the best objects from a set. Both the approaches have well-known pros and cons: a quite big limitation of skyline queries is the impossibility to control the cardinality of the output and the difficulty in specifying a trade-off among attributes, whereas the ranking queries allow so. On the other hand, the usage of ranking…
▽ More
Top-k and skylines are two important techniques that can be used to extract the best objects from a set. Both the approaches have well-known pros and cons: a quite big limitation of skyline queries is the impossibility to control the cardinality of the output and the difficulty in specifying a trade-off among attributes, whereas the ranking queries allow so. On the other hand, the usage of ranking implies that ranking functions need to be specified by users and renouncing the simplicity of skylines. Flexible/ restricted skylines present a new approach to tackle this problem, combining the best characteristics of both techniques making use of a new flexible relation of dominance.
△ Less
Submitted 25 February, 2022;
originally announced February 2022.
-
Performance vs Programming Effort between Rust and C on Multicore Architectures: Case Study in N-Body
Authors:
Manuel Costanzo,
Enzo Rucci,
Marcelo Naiouf,
Armando De Giusti
Abstract:
Historically, Fortran and C have been the default programming languages in High-Performance Computing (HPC). In both, programmers have primitives and functions available that allow manipulating system memory and interacting directly with the underlying hardware, resulting in efficient code in both response times and resource use. On the other hand, it is a real challenge to generate code that is m…
▽ More
Historically, Fortran and C have been the default programming languages in High-Performance Computing (HPC). In both, programmers have primitives and functions available that allow manipulating system memory and interacting directly with the underlying hardware, resulting in efficient code in both response times and resource use. On the other hand, it is a real challenge to generate code that is maintainable and scalable over time in these types of languages. In 2010, Rust emerged as a new programming language designed for concurrent and secure applications, which adopts features of procedural, object-oriented and functional languages. Among its design principles, Rust is aimed at matching C in terms of efficiency, but with increased code security and productivity. This paper presents a comparative study between C and Rust in terms of performance and programming effort, selecting as a case study the simulation of N computational bodies (N-Body), a popular problem in the HPC community. Based on the experimental work, it was possible to establish that Rust is a language that reduces programming effort while maintaining acceptable performance levels, meaning that it is a possible alternative to C for HPC.
△ Less
Submitted 19 October, 2021; v1 submitted 25 July, 2021;
originally announced July 2021.
-
Early Experiences Migrating CUDA codes to oneAPI
Authors:
Manuel Costanzo,
Enzo Rucci,
Carlos García Sanchez,
Marcelo Naiouf
Abstract:
The heterogeneous computing paradigm represents a real programming challenge due to the proliferation of devices with different hardware characteristics. Recently Intel introduced oneAPI, a new programming environment that allows code developed in DPC++ to be run on different devices such as CPUs, GPUs, FPGAs, among others. This paper presents our first experiences in porting two CUDA applications…
▽ More
The heterogeneous computing paradigm represents a real programming challenge due to the proliferation of devices with different hardware characteristics. Recently Intel introduced oneAPI, a new programming environment that allows code developed in DPC++ to be run on different devices such as CPUs, GPUs, FPGAs, among others. This paper presents our first experiences in porting two CUDA applications to DPC++ using the oneAPI dpct tool. From the experimental work, it was possible to verify that dpct does not achieve 100% of the migration task; however, it performs most of the work, reporting the programmer of possible pending adaptations. Additionally, it was possible to verify the functional portability of the DPC++ code obtained, having successfully executed it on different CPU and GPU architectures.
△ Less
Submitted 27 May, 2021;
originally announced May 2021.
-
Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal
Authors:
Manuel Costanzo,
Enzo Rucci,
Ulises Costi,
Franco Chichizola,
Marcelo Naiouf
Abstract:
Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as GPUs or many-core Intel Xeon Phi processors. In this work, devices of the NVIDIA Pascal and Intel Xeon Phi Knights Landing architectures are described and compared…
▽ More
Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as GPUs or many-core Intel Xeon Phi processors. In this work, devices of the NVIDIA Pascal and Intel Xeon Phi Knights Landing architectures are described and compared. Selecting the Floyd-Warshall algorithm as a representative case of graph and memory-bound applications, optimized implementations were developed to analyze and compare performance and energy efficiency on both devices. As it was expected, Xeon Phi showed superior when considering double-precision data. However, contrary to what was considered in our preliminary analysis, it was found that the performance and energy efficiency of both devices were comparable using single-precision datatype.
△ Less
Submitted 15 May, 2021;
originally announced May 2021.
-
Manipulation Planning and Control for Shelf Replenishment
Authors:
Marco Costanzo,
Simon Stelter,
Ciro Natale,
Salvatore Pirozzi,
Georg Bartels,
Alexis Maldonado,
Michael Beetz
Abstract:
Manipulation planning and control are relevant building blocks of a robotic system and their tight integration is a key factor to improve robot autonomy and allows robots to perform manipulation tasks of increasing complexity, such as those needed in the in-store logistics domain. Supermarkets contain a large variety of objects to be placed on the shelf layers with specific constraints, doing this…
▽ More
Manipulation planning and control are relevant building blocks of a robotic system and their tight integration is a key factor to improve robot autonomy and allows robots to perform manipulation tasks of increasing complexity, such as those needed in the in-store logistics domain. Supermarkets contain a large variety of objects to be placed on the shelf layers with specific constraints, doing this with a robot is a challenge and requires a high dexterity. However, an integration of reactive grasping control and motion planning can allow robots to perform such tasks even with grippers with limited dexterity. The main contribution of the paper is a novel method for planning manipulation tasks to be executed using a reactive control layer that provides more control modalities, i.e., slipping avoidance and controlled sliding. Experiments with a new force/tactile sensor equipping the gripper of a mobile manipulator show that the approach allows the robot to successfully perform manipulation tasks unfeasible with a standard fixed grasp.
△ Less
Submitted 12 March, 2020; v1 submitted 23 December, 2019;
originally announced December 2019.
-
High energy Coulomb-scattered electrons for relativistic particle beam diagnostics
Authors:
P. Thieberger,
Z. Altinbas,
C. Carlson,
C. Chasman,
M. Costanzo,
C. Degen,
K. A. Drees,
W. Fischer,
D. Gassner,
X. Gu,
K. Hamdi,
J. Hock,
A. Marusic,
T. Miller,
M. Minty,
C. Montag,
Y. Luo,
A. I. Pikin,
S. M. White
Abstract:
A new system used for monitoring energetic Coulomb-scattered electrons as the main diagnostic for accurately aligning the electron and ion beams in the new Relativistic Heavy Ion Collider (RHIC) electron lenses is described in detail. The theory of electron scattering from relativistic ions is developed and applied to the design and implementation of the system used to achieve and maintain the ali…
▽ More
A new system used for monitoring energetic Coulomb-scattered electrons as the main diagnostic for accurately aligning the electron and ion beams in the new Relativistic Heavy Ion Collider (RHIC) electron lenses is described in detail. The theory of electron scattering from relativistic ions is developed and applied to the design and implementation of the system used to achieve and maintain the alignment. Commissioning with gold and 3He beams is then described as well as the successful utilization of the new system during the 2015 RHIC polarized proton run. Systematic errors of the new method are then estimated. Finally, some possible future applications of Coulomb-scattered electrons for beam diagnostics are briefly discussed.
△ Less
Submitted 24 March, 2016; v1 submitted 19 January, 2016;
originally announced January 2016.
-
Status of head-on beam-beam compensation in RHIC
Authors:
W. Fischer,
Z. Altinbas,
M. Anerella,
M. Blaskiewicz,
D. Bruno,
M. Costanzo,
W. C. Dawson,
D. M. Gassner,
X. Gu,
R. C. Gupta,
K. Hamdi,
J. Hock,
L. T. Hoff,
R. Hulsart,
A. K. Jain,
R. Lambiase,
Y. Luo,
M. Mapes,
A. Marone,
R. Michnoff,
T. A. Miller,
M. Minty,
C. Montag,
J. Muratore,
S. Nemesure
, et al. (12 additional authors not shown)
Abstract:
In polarized proton operation, the performance of the Relativistic Heavy Ion Collider (RHIC) is limited by the head-on beam-beam effect. To overcome this limitation, two electron lenses are under commissioning. We give an overview of head-on beam-beam compensation in general and in the specific design for RHIC, which is based on electron lenses. The status of installation and commissioning are pre…
▽ More
In polarized proton operation, the performance of the Relativistic Heavy Ion Collider (RHIC) is limited by the head-on beam-beam effect. To overcome this limitation, two electron lenses are under commissioning. We give an overview of head-on beam-beam compensation in general and in the specific design for RHIC, which is based on electron lenses. The status of installation and commissioning are presented along with plans for the future.
△ Less
Submitted 20 October, 2014;
originally announced October 2014.