-
APEIRON: composing smart TDAQ systems for high energy physics experiments
Authors:
Roberto Ammendola,
Andrea Biagioni,
Carlotta Chiarini,
Andrea Ciardiello,
Paolo Cretaro,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Pier Stanislao Paolucci,
Cristian Rossi,
Francesco Simula,
Matteo Turisini,
Piero Vicini
Abstract:
APEIRON is a framework encompassing the general architecture of a distributed heterogeneous processing platform and the corresponding software stack, from the low level device drivers up to the high level programming model. The framework is designed to be efficiently used for studying, prototyping and deploying smart trigger and data acquisition (TDAQ) systems for high energy physics experiments.
APEIRON is a framework encompassing the general architecture of a distributed heterogeneous processing platform and the corresponding software stack, from the low level device drivers up to the high level programming model. The framework is designed to be efficiently used for studying, prototyping and deploying smart trigger and data acquisition (TDAQ) systems for high energy physics experiments.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Architectural improvements and technological enhancements for the APEnet+ interconnect system
Authors:
R. Ammendola,
A. Biagioni,
O. Frezza,
A. Lonardo,
F. Lo Cicero,
M. Martinelli,
P. S. Paolucci,
E. Pastorelli,
D. Rossetti,
F. Simula,
L. Tosoratto,
P. Vicini
Abstract:
The APEnet+ board delivers a point-to-point, low-latency, 3D torus network interface card. In this paper we describe the latest generation of APEnet NIC, APEnet v5, integrated in a PCIe Gen3 board based on a state-of-the-art, 28 nm Altera Stratix V FPGA. The NIC features a network architecture designed following the Remote DMA paradigm and tailored to tightly bind the computing power of modern GPU…
▽ More
The APEnet+ board delivers a point-to-point, low-latency, 3D torus network interface card. In this paper we describe the latest generation of APEnet NIC, APEnet v5, integrated in a PCIe Gen3 board based on a state-of-the-art, 28 nm Altera Stratix V FPGA. The NIC features a network architecture designed following the Remote DMA paradigm and tailored to tightly bind the computing power of modern GPUs to the communication fabric. For the APEnet v5 board we show characterizing figures as achieved bandwidth and BER obtained by exploiting new high performance ALTERA transceivers and PCIe Gen3 compliancy.
△ Less
Submitted 4 January, 2022;
originally announced January 2022.
-
Real-time cortical simulations: energy and interconnect scaling on distributed systems
Authors:
Francesco Simula,
Elena Pastorelli,
Pier Stanislao Paolucci,
Michele Martinelli,
Alessandro Lonardo,
Andrea Biagioni,
Cristiano Capone,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Luca Pontisso,
Piero Vicini,
Roberto Ammendola
Abstract:
We profile the impact of computation and inter-processor communication on the energy consumption and on the scaling of cortical simulations approaching the real-time regime on distributed computing platforms. Also, the speed and energy consumption of processor architectures typical of standard HPC and embedded platforms are compared. We demonstrate the importance of the design of low-latency inter…
▽ More
We profile the impact of computation and inter-processor communication on the energy consumption and on the scaling of cortical simulations approaching the real-time regime on distributed computing platforms. Also, the speed and energy consumption of processor architectures typical of standard HPC and embedded platforms are compared. We demonstrate the importance of the design of low-latency interconnect for speed and energy consumption. The cost of cortical simulations is quantified using the Joule per synaptic event metric on both architectures. Reaching efficient real-time on large scale cortical simulations is of increasing relevance for both future bio-inspired artificial intelligence applications and for understanding the cognitive functions of the brain, a scientific quest that will require to embed large scale simulations into highly complex virtual or real worlds. This work stands at the crossroads between the WaveScalES experiment in the Human Brain Project (HBP), which includes the objective of large scale thalamo-cortical simulations of brain states and their transitions, and the ExaNeSt and EuroExa projects, that investigate the design of an ARM-based, low-power High Performance Computing (HPC) architecture with a dedicated interconnect scalable to million of cores; simulation of deep sleep Slow Wave Activity (SWA) and Asynchronous aWake (AW) regimes expressed by thalamo-cortical models are among their benchmarks.
△ Less
Submitted 26 November, 2019; v1 submitted 12 December, 2018;
originally announced December 2018.
-
Large Scale Low Power Computing System - Status of Network Design in ExaNeSt and EuroExa Projects
Authors:
Roberto Ammendola,
Andrea Biagioni,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Pier Stanislao Paolucci,
Elena Pastorelli,
Luca Pontisso,
Francesco Simula,
Piero Vicini
Abstract:
The deployment of the next generation computing platform at ExaFlops scale requires to solve new technological challenges mainly related to the impressive number (up to 10^6) of compute elements required. This impacts on system power consumption, in terms of feasibility and costs, and on system scalability and computing efficiency. In this perspective analysis, exploration and evaluation of techno…
▽ More
The deployment of the next generation computing platform at ExaFlops scale requires to solve new technological challenges mainly related to the impressive number (up to 10^6) of compute elements required. This impacts on system power consumption, in terms of feasibility and costs, and on system scalability and computing efficiency. In this perspective analysis, exploration and evaluation of technologies characterized by low power, high efficiency and high degree of customization is strongly needed. Among the various European initiative targeting the design of ExaFlops system, ExaNeSt and EuroExa are EU-H2020 funded initiatives leveraging on high end MPSoC FPGAs. Last generation MPSoC FPGAs can be seen as non-mainstream but powerful HPC Exascale enabling components thanks to the integration of embedded multi-core, ARM-based low power CPUs and a huge number of hardware resources usable to co-design application oriented accelerators and to develop a low latency high bandwidth network architecture. In this paper we introduce ExaNet the FPGA-based, scalable, direct network architecture of ExaNeSt system. ExaNet allow us to explore different interconnection topologies, to evaluate advanced routing functions for congestion control and fault tolerance and to design specific hardware components for acceleration of collective operations. After a brief introduction of the motivations and goals of ExaNeSt and EuroExa projects, we will report on the status of network architecture design and its hardware/software testbed adding preliminary bandwidth and latency achievements.
△ Less
Submitted 11 April, 2018;
originally announced April 2018.
-
The Brain on Low Power Architectures - Efficient Simulation of Cortical Slow Waves and Asynchronous States
Authors:
Roberto Ammendola,
Andrea Biagioni,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Pier Stanislao Paolucci,
Elena Pastorelli,
Luca Pontisso,
Francesco Simula,
Piero Vicini
Abstract:
Efficient brain simulation is a scientific grand challenge, a parallel/distributed coding challenge and a source of requirements and suggestions for future computing architectures. Indeed, the human brain includes about 10^15 synapses and 10^11 neurons activated at a mean rate of several Hz. Full brain simulation poses Exascale challenges even if simulated at the highest abstraction level. The Wav…
▽ More
Efficient brain simulation is a scientific grand challenge, a parallel/distributed coding challenge and a source of requirements and suggestions for future computing architectures. Indeed, the human brain includes about 10^15 synapses and 10^11 neurons activated at a mean rate of several Hz. Full brain simulation poses Exascale challenges even if simulated at the highest abstraction level. The WaveScalES experiment in the Human Brain Project (HBP) has the goal of matching experimental measures and simulations of slow waves during deep-sleep and anesthesia and the transition to other brain states. The focus is the development of dedicated large-scale parallel/distributed simulation technologies. The ExaNeSt project designs an ARM-based, low-power HPC architecture scalable to million of cores, developing a dedicated scalable interconnect system, and SWA/AW simulations are included among the driving benchmarks. At the joint between both projects is the INFN proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation engine. DPSNN can be configured to stress either the networking or the computation features available on the execution platforms. The simulation stresses the networking component when the neural net - composed by a relatively low number of neurons, each one projecting thousands of synapses - is distributed over a large number of hardware cores. When growing the number of neurons per core, the computation starts to be the dominating component for short range connections. This paper reports about preliminary performance results obtained on an ARM-based HPC prototype developed in the framework of the ExaNeSt project. Furthermore, a comparison is given of instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of SWA/AW DPSNN simulations when executed on either ARM- or Intel-based server platforms.
△ Less
Submitted 10 April, 2018;
originally announced April 2018.
-
Gaussian and exponential lateral connectivity on distributed spiking neural network simulation
Authors:
Elena Pastorelli,
Pier Stanislao Paolucci,
Francesco Simula,
Andrea Biagioni,
Fabrizio Capuani,
Paolo Cretaro,
Giulia De Bonis,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Luca Pontisso,
Piero Vicini,
Roberto Ammendola
Abstract:
We measured the impact of long-range exponentially decaying intra-areal lateral connectivity on the scaling and memory occupation of a distributed spiking neural network simulator compared to that of short-range Gaussian decays. While previous studies adopted short-range connectivity, recent experimental neurosciences studies are pointing out the role of longer-range intra-areal connectivity with…
▽ More
We measured the impact of long-range exponentially decaying intra-areal lateral connectivity on the scaling and memory occupation of a distributed spiking neural network simulator compared to that of short-range Gaussian decays. While previous studies adopted short-range connectivity, recent experimental neurosciences studies are pointing out the role of longer-range intra-areal connectivity with implications on neural simulation platforms. Two-dimensional grids of cortical columns composed by up to 11 M point-like spiking neurons with spike frequency adaption were connected by up to 30 G synapses using short- and long-range connectivity models. The MPI processes composing the distributed simulator were run on up to 1024 hardware cores, hosted on a 64 nodes server platform. The hardware platform was a cluster of IBM NX360 M5 16-core compute nodes, each one containing two Intel Xeon Haswell 8-core E5-2630 v3 processors, with a clock of 2.40 G Hz, interconnected through an InfiniBand network, equipped with 4x QDR switches.
△ Less
Submitted 19 February, 2019; v1 submitted 23 March, 2018;
originally announced March 2018.
-
Impact of exponential long range and Gaussian short range lateral connectivity on the distributed simulation of neural networks including up to 30 billion synapses
Authors:
Elena Pastorelli,
Pier Stanislao Paolucci,
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Francesco Simula,
Piero Vicini
Abstract:
Recent experimental neuroscience studies are pointing out the role of long-range intra-areal connectivity that can be modeled by a distance dependent exponential decay of the synaptic probability distribution. This short report provides a preliminary measure of the impact of exponentially decaying lateral connectivity compared to that of shorter-range Gaussian decays on the scaling behaviour and m…
▽ More
Recent experimental neuroscience studies are pointing out the role of long-range intra-areal connectivity that can be modeled by a distance dependent exponential decay of the synaptic probability distribution. This short report provides a preliminary measure of the impact of exponentially decaying lateral connectivity compared to that of shorter-range Gaussian decays on the scaling behaviour and memory occupation of a distributed spiking neural network simulator (DPSNN). Two-dimensional grids of cortical columns composed by point-like spiking neurons have been connected by up to 30 billion synapses using exponential and Gaussian connectivity models. Up to 1024 hardware cores, hosted on a 64 nodes server platform, executed the MPI processes composing the distributed simulator. The hardware platform was a cluster of IBM NX360 M5 16-core compute nodes, each one containing two Intel Xeon Haswell 8-core E5-2630 v3 processors, with a clock of 2.40GHz, interconnected through an InfiniBand network. This study is conducted in the framework of the CORTICONIC FET project, also in view of the next -to-start activities foreseen as part of the Human Brain Project (HBP), SubProject 3 Cognitive and Systems Neuroscience, WaveScalES work-package.
△ Less
Submitted 16 December, 2015;
originally announced December 2015.
-
Scaling to 1024 software processes and hardware cores of the distributed simulation of a spiking neural network including up to 20G synapses
Authors:
Elena Pastorelli,
Pier Stanislao Paolucci,
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Francesco Simula,
Piero Vicini
Abstract:
This short report describes the scaling, up to 1024 software processes and hardware cores, of a distributed simulator of plastic spiking neural networks. A previous report demonstrated good scalability of the simulator up to 128 processes. Herein we extend the speed-up measurements and strong and weak scaling analysis of the simulator to the range between 1 and 1024 software processes and hardware…
▽ More
This short report describes the scaling, up to 1024 software processes and hardware cores, of a distributed simulator of plastic spiking neural networks. A previous report demonstrated good scalability of the simulator up to 128 processes. Herein we extend the speed-up measurements and strong and weak scaling analysis of the simulator to the range between 1 and 1024 software processes and hardware cores. We simulated two-dimensional grids of cortical columns including up to ~20G synapses connecting ~11M neurons. The neural network was distributed over a set of MPI processes and the simulations were run on a server platform composed of up to 64 dual-socket nodes, each socket equipped with Intel Haswell E5-2630 v3 processors (8 cores @ 2.4 GHz clock). All nodes are interconned through an InfiniBand network. The DPSNN simulator has been developed by INFN in the framework of EURETILE and CORTICONIC European FET Project and will be used by the WaveScalEW tem in the framework of the Human Brain Project (HBP), SubProject 2 - Cognitive and Systems Neuroscience. This report lays the groundwork for a more thorough comparison with the neural simulation tool NEST.
△ Less
Submitted 30 November, 2015;
originally announced November 2015.
-
Power, Energy and Speed of Embedded and Server Multi-Cores applied to Distributed Simulation of Spiking Neural Networks: ARM in NVIDIA Tegra vs Intel Xeon quad-cores
Authors:
Pier Stanislao Paolucci,
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Michele Martinelli,
Elena Pastorelli,
Francesco Simula,
Piero Vicini
Abstract:
This short note regards a comparison of instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of a spiking neural network simulator (DPSNN-STDP) distributed on MPI processes when executed either on an embedded platform (based on a dual socket quad-core ARM platform) or a server platform (INTEL-based quad-core dual socket platform). We also compare the…
▽ More
This short note regards a comparison of instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of a spiking neural network simulator (DPSNN-STDP) distributed on MPI processes when executed either on an embedded platform (based on a dual socket quad-core ARM platform) or a server platform (INTEL-based quad-core dual socket platform). We also compare the measure with those reported by leading custom and semi-custom designs: TrueNorth and SpiNNaker. In summary, we observed that: 1- we spent 2.2 micro-Joule per simulated event on the "embedded platform", approx. 4.4 times lower than what was spent by the "server platform"; 2- the instantaneous power consumption of the "embedded platform" was 14.4 times better than the "server" one; 3- the server platform is a factor 3.3 faster. The "embedded platform" is made of NVIDIA Jetson TK1 boards, interconnected by Ethernet, each mounting a Tegra K1 chip including a quad-core ARM Cortex-A15 at 2.3GHz. The "server platform" is based on dual-socket quad-core Intel Xeon CPUs (E5620 at 2.4GHz). The measures were obtained with the DPSNN-STDP simulator (Distributed Simulator of Polychronous Spiking Neural Network with synaptic Spike Timing Dependent Plasticity) developed by INFN, that already proved its efficient scalability and execution speed-up on hundreds of similar "server" cores and MPI processes, applied to neural nets composed of several billions of synapses.
△ Less
Submitted 12 May, 2015;
originally announced May 2015.
-
EURETILE D7.3 - Dynamic DAL benchmark coding, measurements on MPI version of DPSNN-STDP (distributed plastic spiking neural net) and improvements to other DAL codes
Authors:
Pier Stanislao Paolucci,
Iuliana Bacivarov,
Devendra Rai,
Lars Schor,
Lothar Thiele,
Hoeseok Yang,
Elena Pastorelli,
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
The EURETILE project required the selection and coding of a set of dedicated benchmarks. The project is about the software and hardware architecture of future many-tile distributed fault-tolerant systems. We focus on dynamic workloads characterised by heavy numerical processing requirements. The ambition is to identify common techniques that could be applied to both the Embedded Systems and HPC do…
▽ More
The EURETILE project required the selection and coding of a set of dedicated benchmarks. The project is about the software and hardware architecture of future many-tile distributed fault-tolerant systems. We focus on dynamic workloads characterised by heavy numerical processing requirements. The ambition is to identify common techniques that could be applied to both the Embedded Systems and HPC domains. This document is the first public deliverable of Work Package 7: Challenging Tiled Applications.
△ Less
Submitted 20 August, 2014;
originally announced August 2014.
-
NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features
Authors:
A. Lonardo,
F. Ameli,
R. Ammendola,
A. Biagioni,
O. Frezza,
G. Lamanna,
F. Lo Cicero,
M. Martinelli,
P. S. Paolucci,
E. Pastorelli,
L. Pontisso,
D. Rossetti,
F. Simeone,
F. Simula,
M. Sozzi,
L. Tosoratto,
P. Vicini
Abstract:
While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages.
Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU s…
▽ More
While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages.
Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU system needs careful characterization of all subsystems along data stream path.
The networking subsystem results in being the most critical one in terms of absolute value and fluctuations of its response latency.
Our envisioned solution to this issue is NaNet, a FPGA-based PCIe Network Interface Card (NIC) design featuring a configurable and extensible set of network channels with direct access through GPUDirect to NVIDIA Fermi/Kepler GPU memories.
NaNet design currently supports both standard - GbE (1000BASE-T) and 10GbE (10Base-R) - and custom - 34~Gbps APElink and 2.5~Gbps deterministic latency KM3link - channels, but its modularity allows for a straightforward inclusion of other link technologies.
To avoid host OS intervention on data stream and remove a possible source of jitter, the design includes a network/transport layer offload module with cycle-accurate, upper-bound latency, supporting UDP, KM3link Time Division Multiplexing and APElink protocols.
After NaNet architecture description and its latency/bandwidth characterization for all supported links, two real world use cases will be presented: the GPU-based low level trigger for the RICH detector in the NA62 experiment at CERN and the on-/off-shore data link for KM3 underwater neutrino telescope.
△ Less
Submitted 13 June, 2014;
originally announced June 2014.
-
Applications of Many-Core Technologies to On-line Event Reconstruction in High Energy Physics Experiments
Authors:
A. Gianelle,
S. Amerio,
D. Bastieri,
M. Corvo,
W. Ketchum,
T. Liu,
A. Lonardo,
D. Lucchesi,
S. Poprocki,
R. Rivera,
L. Tosoratto,
P. Vicini,
P. Wittich
Abstract:
Interest in many-core architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of many-core devices when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron…
▽ More
Interest in many-core architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of many-core devices when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron for online track reconstruction - the SVT algorithm - as a realistic test-case for low-latency trigger systems using new computing architectures for LHC experiment. We examine the complexity/performance trade-off in porting existing serial algorithms to many-core devices. We measure performance of different architectures (Intel Xeon Phi and AMD GPUs, in addition to NVidia GPUs) and different software environments (OpenCL, in addition to NVidia CUDA). Measurements of both data processing and data transfer latency are shown, considering different I/O strategies to/from the many-core devices.
△ Less
Submitted 4 December, 2013; v1 submitted 3 December, 2013;
originally announced December 2013.
-
NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs
Authors:
R. Ammendola,
A. Biagioni,
O. Frezza,
G. Lamanna,
A. Lonardo,
F. Lo Cicero,
P. S. Paolucci,
F. Pantaleo,
D. Rossetti,
F. Simula,
M. Sozzi,
L. Tosoratto,
P. Vicini
Abstract:
NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandw…
▽ More
NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.
△ Less
Submitted 9 January, 2014; v1 submitted 15 November, 2013;
originally announced November 2013.
-
Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems
Authors:
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Pier Stanislao Paolucci,
Alessandro Lonardo,
Davide Rossetti,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for sc…
▽ More
Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.
△ Less
Submitted 14 November, 2013; v1 submitted 7 November, 2013;
originally announced November 2013.
-
Many-core applications to online track reconstruction in HEP experiments
Authors:
S. Amerio,
D. Bastieri,
M. Corvo,
A. Gianelle,
W. Ketchum,
T. Liu,
A. Lonardo,
D. Lucchesi,
S. Poprocki,
R. Rivera,
L. Tosoratto,
P. Vicini,
P. Wittich
Abstract:
Interest in parallel architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of Graphic Processing Units (GPUs) and Intel Many Integrated Core architecture (MIC) when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-…
▽ More
Interest in parallel architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of Graphic Processing Units (GPUs) and Intel Many Integrated Core architecture (MIC) when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron for online track reconstruction - the SVT algorithm - as a realistic test-case for low-latency trigger systems using new computing architectures for LHC experiment. We examine the complexity/performance trade-off in porting existing serial algorithms to many-core devices. Measurements of both data processing and data transfer latency are shown, considering different I/O strategies to/from the parallel devices.
△ Less
Submitted 11 November, 2013; v1 submitted 2 November, 2013;
originally announced November 2013.
-
Distributed simulation of polychronous and plastic spiking neural networks: strong and weak scaling of a representative mini-application benchmark executed on a small-scale commodity cluster
Authors:
Pier Stanislao Paolucci,
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Elena Pastorelli,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
We introduce a natively distributed mini-application benchmark representative of plastic spiking neural network simulators. It can be used to measure performances of existing computing platforms and to drive the development of future parallel/distributed computing systems dedicated to the simulation of plastic spiking networks. The mini-application is designed to generate spiking behaviors and syn…
▽ More
We introduce a natively distributed mini-application benchmark representative of plastic spiking neural network simulators. It can be used to measure performances of existing computing platforms and to drive the development of future parallel/distributed computing systems dedicated to the simulation of plastic spiking networks. The mini-application is designed to generate spiking behaviors and synaptic connectivity that do not change when the number of hardware processing nodes is varied, simplifying the quantitative study of scalability on commodity and custom architectures. Here, we present the strong and weak scaling and the profiling of the computational/communication components of the DPSNN-STDP benchmark (Distributed Simulation of Polychronous Spiking Neural Network with synaptic Spike-Timing Dependent Plasticity). In this first test, we used the benchmark to exercise a small-scale cluster of commodity processors (varying the number of used physical cores from 1 to 128). The cluster was interconnected through a commodity network. Bidimensional grids of columns composed of Izhikevich neurons projected synapses locally and toward first, second and third neighboring columns. The size of the simulated network varied from 6.6 Giga synapses down to 200 K synapses. The code demonstrated to be fast and scalable: 10 wall clock seconds were required to simulate one second of activity and plasticity (per Hertz of average firing rate) of a network composed by 3.2 G synapses running on 128 hardware cores clocked @ 2.4 GHz. The mini-application has been designed to be easily interfaced with standard and custom software and hardware communication interfaces. It has been designed from its foundation to be natively distributed and parallel, and should not pose major obstacles against distribution and parallelization on several platforms.
△ Less
Submitted 14 April, 2014; v1 submitted 31 October, 2013;
originally announced October 2013.
-
GPU peer-to-peer techniques applied to a cluster interconnect
Authors:
Roberto Ammendola,
Massimo Bernaschi,
Andrea Biagioni,
Mauro Bisson,
Massimiliano Fatica,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Enrico Mastrostefano,
Pier Stanislao Paolucci,
Davide Rossetti,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement pee…
▽ More
Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.
△ Less
Submitted 31 July, 2013;
originally announced July 2013.
-
A heterogeneous many-core platform for experiments on scalable custom interconnects and management of fault and critical events, applied to many-process applications: Vol. II, 2012 technical report
Authors:
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Werner Geurts,
Gert Goossens,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Stanislao Paolucci,
Davide Rossetti,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
This is the second of a planned collection of four yearly volumes describing the deployment of a heterogeneous many-core platform for experiments on scalable custom interconnects and management of fault and critical events, applied to many-process applications. This volume covers several topics, among which: 1- a system for awareness of faults and critical events (named LO|FA|MO) on experimental h…
▽ More
This is the second of a planned collection of four yearly volumes describing the deployment of a heterogeneous many-core platform for experiments on scalable custom interconnects and management of fault and critical events, applied to many-process applications. This volume covers several topics, among which: 1- a system for awareness of faults and critical events (named LO|FA|MO) on experimental heterogeneous many-core hardware platforms; 2- the integration and test of the experimental hardware heterogeneous many-core platform QUoNG, based on the APEnet+ custom interconnect; 3- the design of a Software-Programmable Distributed Network Processor architecture (DNP) using ASIP technology; 4- the initial stages of design of a new DNP generation onto a 28nm FPGA. These developments were performed in the framework of the EURETILE European Project under the Grant Agreement no. 247846.
△ Less
Submitted 4 July, 2013;
originally announced July 2013.
-
'Mutual Watch-dog Networking': Distributed Awareness of Faults and Critical Events in Petascale/Exascale systems
Authors:
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Stanislao Paolucci,
Davide Rossetti,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
Many tile systems require techniques to be applied to increase components resilience and control the FIT (Failures In Time) rate. When scaling to peta- exa-scale systems the FIT rate may become unacceptable due to component numerosity, requiring more systemic countermeasures. Thus, the ability to be fault aware, i.e. to detect and collect information about fault and critical events, is a necessary…
▽ More
Many tile systems require techniques to be applied to increase components resilience and control the FIT (Failures In Time) rate. When scaling to peta- exa-scale systems the FIT rate may become unacceptable due to component numerosity, requiring more systemic countermeasures. Thus, the ability to be fault aware, i.e. to detect and collect information about fault and critical events, is a necessary feature that large scale distributed architectures must provide in order to apply systemic fault tolerance techniques. In this context, the LO|FA|MO approach is a way to obtain systemic fault awareness, by implementing a mutual watchdog mechanism and guaranteeing fault detection in a no-single-point-of-failure fashion. This document contains specification and implementation details about this approach, in the shape of a technical report.
△ Less
Submitted 2 July, 2013; v1 submitted 1 July, 2013;
originally announced July 2013.
-
The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture
Authors:
Andrea Biagioni,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Stanislao Paolucci,
Mersia Perra,
Davide Rossetti,
Carlo Sidore,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network betwee…
▽ More
One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.
△ Less
Submitted 7 March, 2012;
originally announced March 2012.
-
APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters
Authors:
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Stanislao Paolucci,
Davide Rossetti,
Andrea Salamon,
Gaetano Salina,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable c…
▽ More
We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera FPGA, are provided.
△ Less
Submitted 18 February, 2011;
originally announced February 2011.
-
APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters
Authors:
Roberto Ammendola,
Andrea Biagioni,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Pier Paolucci,
Roberto Petronzio,
Davide Rossetti,
Andrea Salamon,
Gaetano Salina,
Francesco Simula,
Nazario Tantalo,
Laura Tosoratto,
Piero Vicini
Abstract:
Many scientific computations need multi-node parallelism for matching up both space (memory) and time (speed) ever-increasing requirements. The use of GPUs as accelerators introduces yet another level of complexity for the programmer and may potentially result in large overheads due to the complex memory hierarchy. Additionally, top-notch problems may easily employ more than a Petaflops of sustain…
▽ More
Many scientific computations need multi-node parallelism for matching up both space (memory) and time (speed) ever-increasing requirements. The use of GPUs as accelerators introduces yet another level of complexity for the programmer and may potentially result in large overheads due to the complex memory hierarchy. Additionally, top-notch problems may easily employ more than a Petaflops of sustained computing power, requiring thousands of GPUs orchestrated with some parallel programming model. Here we describe APEnet+, the new generation of our interconnect, which scales up to tens of thousands of nodes with linear cost, thus improving the price/performance ratio on large clusters. The project target is the development of the Apelink+ host adapter featuring a low latency, high bandwidth direct network, state-of-the-art wire speeds on the links and a PCIe X8 gen2 host interface. It features hardware support for the RDMA programming model and experimental acceleration of GPU networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI library driver are available, allowing for painless porting of standard applications. Finally, we give an insight of future work and intended developments.
△ Less
Submitted 1 December, 2010;
originally announced December 2010.
-
C++ programming language for an abstract massively parallel SIMD architecture
Authors:
Alessandro Lonardo,
Emanuele Panizzi,
Benedetto Proietti
Abstract:
The aim of this work is to define and implement an extended C++ language to support the SIMD programming paradigm. The C++ programming language has been extended to express all the potentiality of an abstract SIMD machine consisting of a central Control Processor and a N-dimensional toroidal array of Numeric Processors. Very few extensions have been added to the standard C++ with the goal of min…
▽ More
The aim of this work is to define and implement an extended C++ language to support the SIMD programming paradigm. The C++ programming language has been extended to express all the potentiality of an abstract SIMD machine consisting of a central Control Processor and a N-dimensional toroidal array of Numeric Processors. Very few extensions have been added to the standard C++ with the goal of minimising the effort for the programmer in learning a new language and to keep very high the performance of the compiled code. The proposed language has been implemented as a porting of the GNU C++ Compiler on a SIMD supercomputer.
△ Less
Submitted 19 May, 2000;
originally announced May 2000.