-
Double-Exponential Increases in Inference Energy: The Cost of the Race for Accuracy
Authors:
Zeyu Yang,
Karel Adamek,
Wesley Armour
Abstract:
Deep learning models in computer vision have achieved significant success but pose increasing concerns about energy consumption and sustainability. Despite these concerns, there is a lack of comprehensive understanding of their energy efficiency during inference. In this study, we conduct a comprehensive analysis of the inference energy consumption of 1,200 ImageNet classification models - the lar…
▽ More
Deep learning models in computer vision have achieved significant success but pose increasing concerns about energy consumption and sustainability. Despite these concerns, there is a lack of comprehensive understanding of their energy efficiency during inference. In this study, we conduct a comprehensive analysis of the inference energy consumption of 1,200 ImageNet classification models - the largest evaluation of its kind to date. Our findings reveal a steep diminishing return in accuracy gains relative to the increase in energy usage, highlighting sustainability concerns in the pursuit of marginal improvements. We identify key factors contributing to energy consumption and demonstrate methods to improve energy efficiency. To promote more sustainable AI practices, we introduce an energy efficiency scoring system and develop an interactive web application that allows users to compare models based on accuracy and energy consumption. By providing extensive empirical data and practical tools, we aim to facilitate informed decision-making and encourage collaborative efforts in developing energy-efficient AI technologies.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Toward using GANs in astrophysical Monte-Carlo simulations
Authors:
Ahab Isaac,
Wesley Armour,
Karel Adámek
Abstract:
Accurate modelling of spectra produced by X-ray sources requires the use of Monte-Carlo simulations. These simulations need to evaluate physical processes, such as those occurring in accretion processes around compact objects by sampling a number of different probability distributions. This is computationally time-consuming and could be sped up if replaced by neural networks. We demonstrate, on an…
▽ More
Accurate modelling of spectra produced by X-ray sources requires the use of Monte-Carlo simulations. These simulations need to evaluate physical processes, such as those occurring in accretion processes around compact objects by sampling a number of different probability distributions. This is computationally time-consuming and could be sped up if replaced by neural networks. We demonstrate, on an example of the Maxwell-Jüttner distribution that describes the speed of relativistic electrons, that the generative adversarial network (GAN) is capable of statistically replicating the distribution. The average value of the Kolmogorov-Smirnov test is 0.5 for samples generated by the neural network, showing that the generated distribution cannot be distinguished from the true distribution.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Part-time Power Measurements: nvidia-smi's Lack of Attention
Authors:
Zeyu Yang,
Karel Adamek,
Wesley Armour
Abstract:
The GPU has emerged as the go-to accelerator for high throughput and parallel workloads, spanning scientific simulations to AI, thanks to its performance and power efficiency. Given that 6 out of the top 10 fastest supercomputers in the world use NVIDIA GPUs and many AI companies each employ 10,000's of NVIDIA GPUs, an accurate understanding of GPU power consumption is essential for making progres…
▽ More
The GPU has emerged as the go-to accelerator for high throughput and parallel workloads, spanning scientific simulations to AI, thanks to its performance and power efficiency. Given that 6 out of the top 10 fastest supercomputers in the world use NVIDIA GPUs and many AI companies each employ 10,000's of NVIDIA GPUs, an accurate understanding of GPU power consumption is essential for making progress to further improve its efficiency. Despite the limited documentation and the lack of understanding of its mechanisms, NVIDIA GPUs' built-in power sensor, providing easily accessible power readings via the nvidia-smi interface, is widely used in energy efficient computing research on GPUs. Our study seeks to elucidate the internal mechanisms of the power readings provided by nvidia-smi and assess the accuracy of the power and energy consumption data. We have developed a suite of micro-benchmarks to profile the behaviour of nvidia-smi power readings and have evaluated them on over 70 different GPUs from all architectural generations since power measurement was first introduced in the 'Fermi' generation. We have identified several unforeseen problems in terms of power/energy measurement using nvidia-smi, for example on the A100 and H100 GPUs only 25% of the runtime is sampled for power consumption, during the other 75% of the time, the GPU can be using drastically different power and nvidia-smi and results presented by it are unaware of this. This along with other findings can lead to a drastic under/overestimation of energy consumed, especially when considering data centres housing tens of thousands of GPUs. We proposed several good practices that help to mitigate these problems. By comparing our results to those measured from an external power-meter, we have reduced the error in the energy measurement by an average of 35% and in some cases by as much as 65% in the test cases we present.
△ Less
Submitted 12 December, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Accelerating Dedispersion using Many-Core Architectures
Authors:
Jan Novotný,
Karel Adámek,
M. A. Clark,
Mike Giles,
Wesley Armour
Abstract:
Astrophysical radio signals are excellent probes of extreme physical processes that emit them. However, to reach Earth, electromagnetic radiation passes through the ionised interstellar medium (ISM), introducing a frequency-dependent time delay (dispersion) to the emitted signal. Removing dispersion enables searches for transient signals like Fast Radio Bursts (FRB) or repeating signals from isola…
▽ More
Astrophysical radio signals are excellent probes of extreme physical processes that emit them. However, to reach Earth, electromagnetic radiation passes through the ionised interstellar medium (ISM), introducing a frequency-dependent time delay (dispersion) to the emitted signal. Removing dispersion enables searches for transient signals like Fast Radio Bursts (FRB) or repeating signals from isolated pulsars or those in orbit around other compact objects. The sheer volume and high resolution of data that next generation radio telescopes will produce require High-Performance Computing (HPC) solutions and algorithms to be used in time-domain data processing pipelines to extract scientifically valuable results in real-time. This paper presents a state-of-the-art implementation of brute force incoherent dedispersion on NVIDIA GPUs, and on Intel and AMD CPUs. We show that our implementation is 4x faster (8-bit 8192 channels input) than other available solutions and demonstrate, using 11 existing telescopes, that our implementation is at least 20 faster than real-time. This work is part of the AstroAccelerate package.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
A Survey of Feature detection methods for localisation of plain sections of Axial Brain Magnetic Resonance Imaging
Authors:
Jiří Martinů,
Jan Novotný,
Karel Adámek,
Petr Čermák,
Jiří Kozel,
David Školoudík
Abstract:
Matching MRI brain images between patients or mapping patients' MRI slices to the simulated atlas of a brain is key to the automatic registration of MRI of a brain. The ability to match MRI images would also enable such applications as indexing and searching MRI images among multiple patients or selecting images from the region of interest. In this work, we have introduced robustness, accuracy and…
▽ More
Matching MRI brain images between patients or mapping patients' MRI slices to the simulated atlas of a brain is key to the automatic registration of MRI of a brain. The ability to match MRI images would also enable such applications as indexing and searching MRI images among multiple patients or selecting images from the region of interest. In this work, we have introduced robustness, accuracy and cumulative distance metrics and methodology that allows us to compare different techniques and approaches in matching brain MRI of different patients or matching MRI brain slice to a position in the brain atlas. To that end, we have used feature detection methods AGAST, AKAZE, BRISK, GFTT, HardNet, and ORB, which are established methods in image processing, and compared them on their resistance to image degradation and their ability to match the same brain MRI slice of different patients. We have demonstrated that some of these techniques can correctly match most of the brain MRI slices of different patients. When matching is performed with the atlas of the human brain, their performance is significantly lower. The best performing feature detection method was a combination of SIFT detector and HardNet descriptor that achieved 93% accuracy in matching images with other patients and only 52% accurately matched images when compared to atlas.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
Cutting the cost of pulsar astronomy: Saving time and energy when searching for binary pulsars using NVIDIA GPUs
Authors:
Jack White,
Karel Adamek,
Wes Armour
Abstract:
Using the Fourier Domain Acceleration Search (FDAS) method to search for binary pulsars is a computationally costly process. Next generation radio telescopes will have to perform FDAS in real time, as data volumes are too large to store. FDAS is a matched filtering approach for searching time-domain radio astronomy datasets for the signatures of binary pulsars with approximately linear acceleratio…
▽ More
Using the Fourier Domain Acceleration Search (FDAS) method to search for binary pulsars is a computationally costly process. Next generation radio telescopes will have to perform FDAS in real time, as data volumes are too large to store. FDAS is a matched filtering approach for searching time-domain radio astronomy datasets for the signatures of binary pulsars with approximately linear acceleration. In this paper we will explore how we have reduced the energy cost of an SKA-like implementation of FDAS in AstroAccelerate, utilising a combination of mixed-precision computing and dynamic frequency scaling on NVIDIA GPUs. Combining the two approaches, we have managed to save 58% of the overall energy cost of FDAS with a (<3%) sacrifice in numerical sensitivity.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
Implementing CUDA Streams into AstroAccelerate -- A Case Study
Authors:
Jan Novotný,
Karel Adámek,
Wes Armour
Abstract:
To be able to run tasks asynchronously on NVIDIA GPUs a programmer must explicitly implement asynchronous execution in their code using the syntax of CUDA streams. Streams allow a programmer to launch independent concurrent execution tasks, providing the ability to utilise different functional units on the GPU asynchronously. For example, it is possible to transfer the results from a previous comp…
▽ More
To be able to run tasks asynchronously on NVIDIA GPUs a programmer must explicitly implement asynchronous execution in their code using the syntax of CUDA streams. Streams allow a programmer to launch independent concurrent execution tasks, providing the ability to utilise different functional units on the GPU asynchronously. For example, it is possible to transfer the results from a previous computation performed on input data n-1, over the PCIe bus whilst computing the result for input data n, by placing different tasks in different CUDA streams. The benefit of such an approach is that the time taken for the data transfer between the host and device can be hidden with computation. This case study deals with the implementation of CUDA streams into AstroAccelerate. AstroAccelerate is a GPU accelerated real-time signal processing pipeline for time-domain radio astronomy.
△ Less
Submitted 6 May, 2021; v1 submitted 4 January, 2021;
originally announced January 2021.
-
Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-time Edge Computing
Authors:
Karel Adámek,
Jan Novotný,
Jeyarajan Thiyagalingam,
Wesley Armour
Abstract:
The Square Kilometre Array (SKA) is an international initiative for developing the world's largest radio telescope with a total collecting area of over a million square meters. The scale of the operation, combined with the remote location of the telescope, requires the use of energy-efficient computational algorithms. This, along with the extreme data rates that will be produced by the SKA and the…
▽ More
The Square Kilometre Array (SKA) is an international initiative for developing the world's largest radio telescope with a total collecting area of over a million square meters. The scale of the operation, combined with the remote location of the telescope, requires the use of energy-efficient computational algorithms. This, along with the extreme data rates that will be produced by the SKA and the requirement for a real-time observing capability, necessitates in-situ data processing in an edge style computing solution. More generally, energy efficiency in the modern computing landscape is becoming of paramount concern. Whether it be the power budget that can limit some of the world's largest supercomputers, or the limited power available to the smallest Internet-of-Things devices. In this paper, we study the impact of hardware frequency scaling on the energy consumption and execution time of the Fast Fourier Transform (FFT) on NVIDIA GPUs using the cuFFT library. The FFT is used in many areas of science and it is one of the key algorithms used in radio astronomy data processing pipelines. Through the use of frequency scaling, we show that we can lower the power consumption of the NVIDIA V100 GPU when computing the FFT by up to 60% compared to the boost clock frequency, with less than a 10% increase in the execution time. Furthermore, using one common core clock frequency for all tested FFT lengths, we show on average a 50% reduction in power consumption compared to the boost core clock frequency with an increase in the execution time still below 10%. We demonstrate how these results can be used to lower the power consumption of existing data processing pipelines. These savings, when considered over years of operation, can yield significant financial savings, but can also lead to a significant reduction of greenhouse gas emissions.
△ Less
Submitted 9 November, 2021; v1 submitted 13 September, 2020;
originally announced September 2020.
-
GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory
Authors:
Karel Adámek,
Sofia Dimoudi,
Mike Giles,
Wesley Armour
Abstract:
We present an implementation of the overlap-and-save method, a method for the convolution of very long signals with short response functions, which is tailored to GPUs. We have implemented several FFT algorithms (using the CUDA programming language) which exploit GPU shared memory, allowing for GPU accelerated convolution. We compare our implementation with an implementation of the overlap-and-sav…
▽ More
We present an implementation of the overlap-and-save method, a method for the convolution of very long signals with short response functions, which is tailored to GPUs. We have implemented several FFT algorithms (using the CUDA programming language) which exploit GPU shared memory, allowing for GPU accelerated convolution. We compare our implementation with an implementation of the overlap-and-save algorithm utilizing the NVIDIA FFT library (cuFFT). We demonstrate that by using a shared memory based FFT we can achieved significant speed-ups for certain problem sizes and lower the memory requirements of the overlap-and-save method on GPUs.
△ Less
Submitted 10 April, 2020; v1 submitted 4 October, 2019;
originally announced October 2019.
-
A polyphase filter for many-core architectures
Authors:
Karel Adámek,
Jan Novotný,
Wes Armour
Abstract:
In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse t…
▽ More
In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFlop/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.47x to 1.95x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.
△ Less
Submitted 21 April, 2016; v1 submitted 11 November, 2015;
originally announced November 2015.
-
The Implementation of a Real-Time Polyphase Filter
Authors:
Karel Adámek,
Jan Novotný,
Wes Armour
Abstract:
In this article we study the suitability of dierent computational accelerators for the task of real-time data processing. The algorithm used for comparison is the polyphase filter, a standard tool in signal processing and a well established algorithm. We measure performance in FLOPs and execution time, which is a critical factor for real-time systems. For our real-time studies we have chosen a dat…
▽ More
In this article we study the suitability of dierent computational accelerators for the task of real-time data processing. The algorithm used for comparison is the polyphase filter, a standard tool in signal processing and a well established algorithm. We measure performance in FLOPs and execution time, which is a critical factor for real-time systems. For our real-time studies we have chosen a data rate of 6.5GB/s, which is the estimated data rate for a single channel on the SKAs Low Frequency Aperture Array. Our findings how that GPUs are the most likely candidate for real-time data processing. GPUs are better in both performance and power consumption.
△ Less
Submitted 12 November, 2014;
originally announced November 2014.