Search | arXiv e-print repository

doi 10.1016/j.future.2025.107838

Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with SW Sequence Alignment

Authors: Manuel Costanzo, Enzo Rucci, Carlos García-Sánchez, Marcelo Naiouf, Manuel Prieto-Matías

Abstract: The high-performance computing (HPC) landscape is undergoing rapid transformation, with an increasing emphasis on energy-efficient and heterogeneous computing environments. This comprehensive study extends our previous research on SYCL's performance portability by evaluating its effectiveness across a broader spectrum of computing architectures, including CPUs, GPUs, and hybrid CPU-GPU configurati… ▽ More The high-performance computing (HPC) landscape is undergoing rapid transformation, with an increasing emphasis on energy-efficient and heterogeneous computing environments. This comprehensive study extends our previous research on SYCL's performance portability by evaluating its effectiveness across a broader spectrum of computing architectures, including CPUs, GPUs, and hybrid CPU-GPU configurations from NVIDIA, Intel, and AMD. Our analysis covers single-GPU, multi-GPU, single-CPU, and CPU-GPU hybrid setups, using two common, bioinformatic applications as a case study. The results demonstrate SYCL's versatility across different architectures, maintaining comparable performance to CUDA on NVIDIA GPUs while achieving similar architectural efficiency rates on AMD and Intel GPUs in the majority of cases tested. SYCL also demonstrated remarkable versatility and effectiveness across CPUs from various manufacturers, including the latest hybrid architectures from Intel. Although SYCL showed excellent functional portability in hybrid CPU-GPU configurations, performance varied significantly based on specific hardware combinations. Some performance limitations were identified in multi-GPU and CPU-GPU configurations, primarily attributed to workload distribution strategies rather than SYCL-specific constraints. These findings position SYCL as a promising unified programming model for heterogeneous computing environments, particularly for bioinformatic applications. △ Less

Submitted 14 April, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

Comments: arXiv admin note: text overlap with arXiv:2309.09609

arXiv:2309.09609 [pdf, other]

doi 10.1109/SBAC-PAD59825.2023.00023

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs

Authors: Manuel Costanzo, Enzo Rucci, Carlos García Sánchez, Marcelo Naiouf, Manuel Prieto-Matías

Abstract: The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the portability and performance of the SYCL and CUDA languages for one fundamental bioinformatics application (Smith-Waterman protein database search) across different GPU… ▽ More The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the portability and performance of the SYCL and CUDA languages for one fundamental bioinformatics application (Smith-Waterman protein database search) across different GPU architectures, considering single and multi-GPU configurations from different vendors. The experimental work showed that, while both CUDA and SYCL versions achieve similar performance on NVIDIA devices, the latter demonstrated remarkable code portability to other GPU architectures, such as AMD and Intel. Furthermore, the architectural efficiency rates achieved on these devices were superior in 3 of the 4 cases tested. This brief study highlights the potential of SYCL as a viable solution for achieving both performance and portability in the heterogeneous computing ecosystem. △ Less

Submitted 10 November, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: This article was accepted for publication in 2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

arXiv:2305.15856 [pdf, other]

Enhanced 6D Pose Estimation for Robotic Fruit Picking

Authors: Marco Costanzo, Marco De Simone, Sara Federico, Ciro Natale, Salvatore Pirozzi

Abstract: This paper proposes a novel method to refine the 6D pose estimation inferred by an instance-level deep neural network which processes a single RGB image and that has been trained on synthetic images only. The proposed optimization algorithm usefully exploits the depth measurement of a standard RGB-D camera to estimate the dimensions of the considered object, even though the network is trained on a… ▽ More This paper proposes a novel method to refine the 6D pose estimation inferred by an instance-level deep neural network which processes a single RGB image and that has been trained on synthetic images only. The proposed optimization algorithm usefully exploits the depth measurement of a standard RGB-D camera to estimate the dimensions of the considered object, even though the network is trained on a single CAD model of the same object with given dimensions. The improved accuracy in the pose estimation allows a robot to grasp apples of various types and significantly different dimensions successfully; this was not possible using the standard pose estimation algorithm, except for the fruits with dimensions very close to those of the CAD drawing used in the training process. Grasping fresh fruits without damaging each item also demands a suitable grasp force control. A parallel gripper equipped with special force/tactile sensors is thus adopted to achieve safe grasps with the minimum force necessary to lift the fruits without any slippage and any deformation at the same time, with no knowledge of their weight. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: To be published in conference proceedings of the 9th Int. Conf. on Control, Decision and Information Technologies (CODIT 2023), July 3-6, Rome, Italy

arXiv:2303.12697 [pdf]

Visual motion analysis of the player's finger

Authors: Marco Costanzo

Abstract: This work is about the extraction of the motion of fingers, in their three articulations, of a keyboard player from a video sequence. The relevance of the problem involves several aspects, in fact, the extraction of the movements of the fingers may be used to compute the keystroke efficiency and individual joint contributions, as showed by Werner Goebl and Caroline Palmer in the paper 'Temporal Co… ▽ More This work is about the extraction of the motion of fingers, in their three articulations, of a keyboard player from a video sequence. The relevance of the problem involves several aspects, in fact, the extraction of the movements of the fingers may be used to compute the keystroke efficiency and individual joint contributions, as showed by Werner Goebl and Caroline Palmer in the paper 'Temporal Control and Hand Movement Efficiency in Skilled Music Performance'. Those measures are directly related to the precision in timing and force measures. A very good approach to the hand gesture recognition problem has been presented in the paper ' Real-Time Hand Gesture Recognition Using Finger Segmentation'. Detecting the keys pressed on a keyboard is a task that can be complex because of the shadows that can degrade the quality of the result and possibly cause the detection of not pressed keys. Among the several approaches that already exist, a great amount of them is based on the subtraction of frames in order to detect the movements of the keys caused by their pressure. Detecting the keys that are pressed could be useful to automatically evaluate the performance of a pianist or to automatically write sheet music of the melody that is being played. △ Less

Submitted 24 February, 2023; originally announced March 2023.

Comments: 34 pages, 49 figures

arXiv:2211.10769 [pdf, other]

doi 10.1007/s11227-024-05907-2

Assessing Opportunities of SYCL for Biological Sequence Alignment on GPU-based Systems

Authors: Manuel Costanzo, Enzo Rucci, Carlos García Sánchez, Marcelo Naiouf, Manuel Prieto-Matías

Abstract: Bioinformatics and Computational Biology are two fields that have been exploiting GPUs for more than two decades, being CUDA the most used programming language for them. However, as CUDA is an NVIDIA proprietary language, it implies a strong portability restriction to a wide range of heterogeneous architectures, like AMD or Intel GPUs. To face this issue, the Khronos Group has recently proposed th… ▽ More Bioinformatics and Computational Biology are two fields that have been exploiting GPUs for more than two decades, being CUDA the most used programming language for them. However, as CUDA is an NVIDIA proprietary language, it implies a strong portability restriction to a wide range of heterogeneous architectures, like AMD or Intel GPUs. To face this issue, the Khronos Group has recently proposed the SYCL standard, which is an open, royalty-free, cross-platform abstraction layer, that enables the programming of a heterogeneous system to be written using standard, single-source C++ code. Over the past few years, several implementations of this SYCL standard have emerged, being oneAPI the one from Intel. This paper presents the migration process of the SW\# suite, a biological sequence alignment tool developed in CUDA, to SYCL using Intel's oneAPI ecosystem. The experimental results show that SW\# was completely migrated with a small programmer intervention in terms of hand-coding. In addition, it was possible to port the migrated code between different architectures (considering multiple vendor GPUs and also CPUs), with no noticeable performance degradation on 5 different NVIDIA GPUs. Moreover, performance remained stable when switching to another SYCL implementation. As a consequence, SYCL and its implementations can offer attractive opportunities for the Bioinformatics community, especially considering the vast existence of CUDA-based legacy codes. △ Less

Submitted 23 February, 2024; v1 submitted 19 November, 2022; originally announced November 2022.

Journal ref: J Supercomput (2024)

arXiv:2203.11100 [pdf, ps, other]

doi 10.1007/978-3-031-07802-6_9

Migrating CUDA to oneAPI: A Smith-Waterman Case Study

Authors: Manuel Costanzo, Enzo Rucci, Carlos Garcia Sanchez, Marcelo Naiouf, Manuel Prieto-Matias

Abstract: To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration to DPC++. Due to t… ▽ More To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration to DPC++. Due to the large amount of existing CUDA-based software in the bioinformatics context, this paper presents our experiences porting SW#db, a well-known sequence alignment tool, to DPC++ using dpct. From the experimental work, it was possible to prove the usefulness of dpct for SW#db code migration and the cross-GPU vendor, cross-architecture portability of the migrated DPC++ code. In addition, the performance results showed that the migrated DPC++ code reports similar efficiency rates to its CUDA-native counterpart or even better in some tests (approximately +5%). △ Less

Submitted 20 June, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

Comments: Accepted in IWBBIO 2022

Journal ref: In: Bioinformatics and Biomedical Engineering. IWBBIO 2022. Lecture Notes in Computer Science(), vol 13347. Springer, Cham (2022)

arXiv:2202.12618 [pdf]

Getting the best from skylines and top-k queries

Authors: Marco Costanzo

Abstract: Top-k and skylines are two important techniques that can be used to extract the best objects from a set. Both the approaches have well-known pros and cons: a quite big limitation of skyline queries is the impossibility to control the cardinality of the output and the difficulty in specifying a trade-off among attributes, whereas the ranking queries allow so. On the other hand, the usage of ranking… ▽ More Top-k and skylines are two important techniques that can be used to extract the best objects from a set. Both the approaches have well-known pros and cons: a quite big limitation of skyline queries is the impossibility to control the cardinality of the output and the difficulty in specifying a trade-off among attributes, whereas the ranking queries allow so. On the other hand, the usage of ranking implies that ranking functions need to be specified by users and renouncing the simplicity of skylines. Flexible/ restricted skylines present a new approach to tackle this problem, combining the best characteristics of both techniques making use of a new flexible relation of dominance. △ Less

Submitted 25 February, 2022; originally announced February 2022.

Comments: 24 pages

arXiv:2107.11912 [pdf, other]

Performance vs Programming Effort between Rust and C on Multicore Architectures: Case Study in N-Body

Authors: Manuel Costanzo, Enzo Rucci, Marcelo Naiouf, Armando De Giusti

Abstract: Historically, Fortran and C have been the default programming languages in High-Performance Computing (HPC). In both, programmers have primitives and functions available that allow manipulating system memory and interacting directly with the underlying hardware, resulting in efficient code in both response times and resource use. On the other hand, it is a real challenge to generate code that is m… ▽ More Historically, Fortran and C have been the default programming languages in High-Performance Computing (HPC). In both, programmers have primitives and functions available that allow manipulating system memory and interacting directly with the underlying hardware, resulting in efficient code in both response times and resource use. On the other hand, it is a real challenge to generate code that is maintainable and scalable over time in these types of languages. In 2010, Rust emerged as a new programming language designed for concurrent and secure applications, which adopts features of procedural, object-oriented and functional languages. Among its design principles, Rust is aimed at matching C in terms of efficiency, but with increased code security and productivity. This paper presents a comparative study between C and Rust in terms of performance and programming effort, selecting as a case study the simulation of N computational bodies (N-Body), a popular problem in the HPC community. Based on the experimental work, it was possible to establish that Rust is a language that reduces programming effort while maintaining acceptable performance levels, meaning that it is a possible alternative to C for HPC. △ Less

Submitted 19 October, 2021; v1 submitted 25 July, 2021; originally announced July 2021.

Comments: This article was accepted for publication in 2021 XLVI Latin American Computing Conference (CLEI)

arXiv:2105.13489 [pdf, other]

Early Experiences Migrating CUDA codes to oneAPI

Authors: Manuel Costanzo, Enzo Rucci, Carlos García Sanchez, Marcelo Naiouf

Abstract: The heterogeneous computing paradigm represents a real programming challenge due to the proliferation of devices with different hardware characteristics. Recently Intel introduced oneAPI, a new programming environment that allows code developed in DPC++ to be run on different devices such as CPUs, GPUs, FPGAs, among others. This paper presents our first experiences in porting two CUDA applications… ▽ More The heterogeneous computing paradigm represents a real programming challenge due to the proliferation of devices with different hardware characteristics. Recently Intel introduced oneAPI, a new programming environment that allows code developed in DPC++ to be run on different devices such as CPUs, GPUs, FPGAs, among others. This paper presents our first experiences in porting two CUDA applications to DPC++ using the oneAPI dpct tool. From the experimental work, it was possible to verify that dpct does not achieve 100% of the migration task; however, it performs most of the work, reporting the programmer of possible pending adaptations. Additionally, it was possible to verify the functional portability of the DPC++ code obtained, having successfully executed it on different CPU and GPU architectures. △ Less

Submitted 27 May, 2021; originally announced May 2021.

Comments: Accepted for publication in 9th Conference on Cloud Computing Conference, Big Data & Emerging Topics (JCC-BD&ET 2021, https://jcc.info.unlp.edu.ar/en/)

arXiv:2105.07298 [pdf, other]

doi 10.1007/978-3-030-75836-3_3

Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal

Authors: Manuel Costanzo, Enzo Rucci, Ulises Costi, Franco Chichizola, Marcelo Naiouf

Abstract: Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as GPUs or many-core Intel Xeon Phi processors. In this work, devices of the NVIDIA Pascal and Intel Xeon Phi Knights Landing architectures are described and compared… ▽ More Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as GPUs or many-core Intel Xeon Phi processors. In this work, devices of the NVIDIA Pascal and Intel Xeon Phi Knights Landing architectures are described and compared. Selecting the Floyd-Warshall algorithm as a representative case of graph and memory-bound applications, optimized implementations were developed to analyze and compare performance and energy efficiency on both devices. As it was expected, Xeon Phi showed superior when considering double-precision data. However, contrary to what was considered in our preliminary analysis, it was found that the performance and energy efficiency of both devices were comparable using single-precision datatype. △ Less

Submitted 15 May, 2021; originally announced May 2021.

Comments: Computer Science - CACIC 2020. CACIC 2020. Communications in Computer and Information Science, vol 1409. Springer, Cham

arXiv:1912.11018 [pdf, other]

doi 10.1109/LRA.2020.2969179

Manipulation Planning and Control for Shelf Replenishment

Authors: Marco Costanzo, Simon Stelter, Ciro Natale, Salvatore Pirozzi, Georg Bartels, Alexis Maldonado, Michael Beetz

Abstract: Manipulation planning and control are relevant building blocks of a robotic system and their tight integration is a key factor to improve robot autonomy and allows robots to perform manipulation tasks of increasing complexity, such as those needed in the in-store logistics domain. Supermarkets contain a large variety of objects to be placed on the shelf layers with specific constraints, doing this… ▽ More Manipulation planning and control are relevant building blocks of a robotic system and their tight integration is a key factor to improve robot autonomy and allows robots to perform manipulation tasks of increasing complexity, such as those needed in the in-store logistics domain. Supermarkets contain a large variety of objects to be placed on the shelf layers with specific constraints, doing this with a robot is a challenge and requires a high dexterity. However, an integration of reactive grasping control and motion planning can allow robots to perform such tasks even with grippers with limited dexterity. The main contribution of the paper is a novel method for planning manipulation tasks to be executed using a reactive control layer that provides more control modalities, i.e., slipping avoidance and controlled sliding. Experiments with a new force/tactile sensor equipping the gripper of a mobile manipulator show that the approach allows the robot to successfully perform manipulation tasks unfeasible with a standard fixed grasp. △ Less

Submitted 12 March, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

Comments: 8 pages, 12 figures, accepted at RAL

arXiv:1601.05126 [pdf]

doi 10.1103/PhysRevAccelBeams.19.041002

High energy Coulomb-scattered electrons for relativistic particle beam diagnostics

Authors: P. Thieberger, Z. Altinbas, C. Carlson, C. Chasman, M. Costanzo, C. Degen, K. A. Drees, W. Fischer, D. Gassner, X. Gu, K. Hamdi, J. Hock, A. Marusic, T. Miller, M. Minty, C. Montag, Y. Luo, A. I. Pikin, S. M. White

Abstract: A new system used for monitoring energetic Coulomb-scattered electrons as the main diagnostic for accurately aligning the electron and ion beams in the new Relativistic Heavy Ion Collider (RHIC) electron lenses is described in detail. The theory of electron scattering from relativistic ions is developed and applied to the design and implementation of the system used to achieve and maintain the ali… ▽ More A new system used for monitoring energetic Coulomb-scattered electrons as the main diagnostic for accurately aligning the electron and ion beams in the new Relativistic Heavy Ion Collider (RHIC) electron lenses is described in detail. The theory of electron scattering from relativistic ions is developed and applied to the design and implementation of the system used to achieve and maintain the alignment. Commissioning with gold and 3He beams is then described as well as the successful utilization of the new system during the 2015 RHIC polarized proton run. Systematic errors of the new method are then estimated. Finally, some possible future applications of Coulomb-scattered electrons for beam diagnostics are briefly discussed. △ Less

Submitted 24 March, 2016; v1 submitted 19 January, 2016; originally announced January 2016.

Comments: 16 pages, 23 figures

arXiv:1410.5315 [pdf, other]

doi 10.5170/CERN-2014-004.109

Status of head-on beam-beam compensation in RHIC

Authors: W. Fischer, Z. Altinbas, M. Anerella, M. Blaskiewicz, D. Bruno, M. Costanzo, W. C. Dawson, D. M. Gassner, X. Gu, R. C. Gupta, K. Hamdi, J. Hock, L. T. Hoff, R. Hulsart, A. K. Jain, R. Lambiase, Y. Luo, M. Mapes, A. Marone, R. Michnoff, T. A. Miller, M. Minty, C. Montag, J. Muratore, S. Nemesure , et al. (12 additional authors not shown)

Abstract: In polarized proton operation, the performance of the Relativistic Heavy Ion Collider (RHIC) is limited by the head-on beam-beam effect. To overcome this limitation, two electron lenses are under commissioning. We give an overview of head-on beam-beam compensation in general and in the specific design for RHIC, which is based on electron lenses. The status of installation and commissioning are pre… ▽ More In polarized proton operation, the performance of the Relativistic Heavy Ion Collider (RHIC) is limited by the head-on beam-beam effect. To overcome this limitation, two electron lenses are under commissioning. We give an overview of head-on beam-beam compensation in general and in the specific design for RHIC, which is based on electron lenses. The status of installation and commissioning are presented along with plans for the future. △ Less

Submitted 20 October, 2014; originally announced October 2014.

Comments: 12 pages, contribution to the ICFA Mini-Workshop on Beam-Beam Effects in Hadron Colliders, CERN, Geneva, Switzerland, 18-22 Mar 2013

Journal ref: CERN Yellow Report CERN-2014-004, pp.109-120

Showing 1–13 of 13 results for author: Costanzo, M