-
Multiple right hand side multigrid for domain wall fermions with a multigrid preconditioned block conjugate gradient algorithm
Authors:
Peter A Boyle
Abstract:
We introduce a class of efficient multiple right-hand side multigrid algorithm for domain wall fermions. The simultaneous solution for a modest number of right hand sides concurrently allows for a significant reduction in the time spent solving the coarse grid operator in a multigrid preconditioner. We introduce a preconditioned block conjuate gradient with a multigrid preconditioner, giving addit…
▽ More
We introduce a class of efficient multiple right-hand side multigrid algorithm for domain wall fermions. The simultaneous solution for a modest number of right hand sides concurrently allows for a significant reduction in the time spent solving the coarse grid operator in a multigrid preconditioner. We introduce a preconditioned block conjuate gradient with a multigrid preconditioner, giving additional algorithmic benefit from the multiple right hand sides. There is also a very significant additional to computation rate benefit to multiple right hand sides. This both increases the arithmetic intensity in the coarse space and increases the amount of work being performed in each subroutine call, leading to excellent performance on modern GPU architectures. Further, the software implementation makes use of vendor linear algebra routines (batched GEMM) that can make use of high throughput tensor hardware on recent Nvidia, AMD and Intel GPUs. The cost of the coarse space is made sub-dominant in this algorithm, and benchmarks from the Frontier supercomputer system show up to a factor of twenty speed up over the standard red-black preconditioned conjugate gradient algorithm on a large system with physical quark masses.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Curriculum Based Multi-Task Learning for Parkinson's Disease Detection
Authors:
Nikhil J. Dhinagar,
Conor Owens-Walton,
Emily Laltoo,
Christina P. Boyle,
Yao-Liang Chen,
Philip Cook,
Corey McMillan,
Chih-Chien Tsai,
J-J Wang,
Yih-Ru Wu,
Ysbrand van der Werf,
Paul M. Thompson
Abstract:
There is great interest in developing radiological classifiers for diagnosis, staging, and predictive modeling in progressive diseases such as Parkinson's disease (PD), a neurodegenerative disease that is difficult to detect in its early stages. Here we leverage severity-based meta-data on the stages of disease to define a curriculum for training a deep convolutional neural network (CNN). Typicall…
▽ More
There is great interest in developing radiological classifiers for diagnosis, staging, and predictive modeling in progressive diseases such as Parkinson's disease (PD), a neurodegenerative disease that is difficult to detect in its early stages. Here we leverage severity-based meta-data on the stages of disease to define a curriculum for training a deep convolutional neural network (CNN). Typically, deep learning networks are trained by randomly selecting samples in each mini-batch. By contrast, curriculum learning is a training strategy that aims to boost classifier performance by starting with examples that are easier to classify. Here we define a curriculum to progressively increase the difficulty of the training data corresponding to the Hoehn and Yahr (H&Y) staging system for PD (total N=1,012; 653 PD patients, 359 controls; age range: 20.0-84.9 years). Even with our multi-task setting using pre-trained CNNs and transfer learning, PD classification based on T1-weighted (T1-w) MRI was challenging (ROC AUC: 0.59-0.65), but curriculum training boosted performance (by 3.9%) compared to our baseline model. Future work with multimodal imaging may further boost performance.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
Retinal vessel segmentation by probing adaptive to lighting variations
Authors:
Guillaume Noyel,
Christine Vartin,
Peter Boyle,
Laurent Kodjikian
Abstract:
We introduce a novel method to extract the vessels in eye fun-dus images which is adaptive to lighting variations. In the Logarithmic Image Processing framework, a 3-segment probe detects the vessels by probing the topographic surface of an image from below. A map of contrasts between the probe and the image allows to detect the vessels by a threshold. In a lowly contrasted image, results show tha…
▽ More
We introduce a novel method to extract the vessels in eye fun-dus images which is adaptive to lighting variations. In the Logarithmic Image Processing framework, a 3-segment probe detects the vessels by probing the topographic surface of an image from below. A map of contrasts between the probe and the image allows to detect the vessels by a threshold. In a lowly contrasted image, results show that our method better extract the vessels than another state-of the-art method. In a highly contrasted image database (DRIVE) with a reference , ours has an accuracy of 0.9454 which is similar or better than three state-of-the-art methods and below three others. The three best methods have a higher accuracy than a manual segmentation by another expert. Importantly, our method automatically adapts to the lighting conditions of the image acquisition.
△ Less
Submitted 29 April, 2020;
originally announced April 2020.
-
Registration of retinal images from Public Health by minimising an error between vessels using an affine model with radial distortions
Authors:
Guillaume Noyel,
R Thomas,
S Iles,
G Bhakta,
A Crowder,
D. Owens,
P. Boyle
Abstract:
In order to estimate a registration model of eye fundus images made of an affinity and two radial distortions, we introduce an estimation criterion based on an error between the vessels. In [1], we estimated this model by minimising the error between characteristics points. In this paper, the detected vessels are selected using the circle and ellipse equations of the overlap area boundaries deduce…
▽ More
In order to estimate a registration model of eye fundus images made of an affinity and two radial distortions, we introduce an estimation criterion based on an error between the vessels. In [1], we estimated this model by minimising the error between characteristics points. In this paper, the detected vessels are selected using the circle and ellipse equations of the overlap area boundaries deduced from our model. Our method successfully registers 96 % of the 271 pairs in a Public Health dataset acquired mostly with different cameras. This is better than our previous method [1] and better than three other state-of-the-art methods. On a publicly available dataset, ours still better register the images than the reference method.
△ Less
Submitted 17 April, 2019;
originally announced April 2019.
-
Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning
Authors:
Peter Boyle,
Michael Chuvelev,
Guido Cossu,
Christopher Kelly,
Christoph Lehner,
Lawrence Meadows
Abstract:
We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knight's Landing) processors, and using the Linux operating system.
The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: fir…
▽ More
We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knight's Landing) processors, and using the Linux operating system.
The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: firstly in Cartesian communicator halo exchange problems, appropriate for structured grid PDE solvers that arise in quantum chromodynamics simulations of particle physics, and secondly in gradient reduction appropriate to synchronous stochastic gradient descent for machine learning. As an example, we accelerate a published Baidu Research reduction code and obtain a factor of ten speedup over the original code using the techniques discussed in this paper. This displays how a factor of ten speedup in strongly scaled distributed machine learning could be achieved when synchronous stochastic gradient descent is massively parallelised with a fixed mini-batch size.
We find a significant improvement in performance robustness when memory is obtained using carefully allocated 2MB "huge" virtual memory pages, implying that either non-standard allocation routines should be used for communication buffers. These can be accessed via a LD\_PRELOAD override in the manner suggested by libhugetlbfs. We make use of a the Intel(R) MPI 2019 library "Technology Preview" and underlying software to enable thread concurrency throughout the communication software stake via multiple PSM2 endpoints per process and use of multiple independent MPI communicators. When using a single MPI process per node, we find that this greatly accelerates delivered bandwidth in many core Intel(R) Xeon Phi processors.
△ Less
Submitted 13 November, 2017;
originally announced November 2017.
-
Performance Portability Strategies for Grid C++ Expression Templates
Authors:
Peter A. Boyle,
M. A. Clark,
Carleton DeTar,
Meifeng Lin,
Verinder Rana,
Alejandro Vaquero Avilés-Casco
Abstract:
One of the key requirements for the Lattice QCD Application Development as part of the US Exascale Computing Project is performance portability across multiple architectures. Using the Grid C++ expression template as a starting point, we report on the progress made with regards to the Grid GPU offloading strategies. We present both the successes and issues encountered in using CUDA, OpenACC and Ju…
▽ More
One of the key requirements for the Lattice QCD Application Development as part of the US Exascale Computing Project is performance portability across multiple architectures. Using the Grid C++ expression template as a starting point, we report on the progress made with regards to the Grid GPU offloading strategies. We present both the successes and issues encountered in using CUDA, OpenACC and Just-In-Time compilation. Experimentation and performance on GPUs with a SU(3)$\times$SU(3) streaming test will be reported. We will also report on the challenges of using current OpenMP 4.x for GPU offloading in the same code.
△ Less
Submitted 25 October, 2017;
originally announced October 2017.
-
Machines and Algorithms
Authors:
Peter A Boyle
Abstract:
I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms. New architectures are discussed and recent performance results are displayed. I also review recent progress in multilevel solver and integation algorithms.
I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms. New architectures are discussed and recent performance results are displayed. I also review recent progress in multilevel solver and integation algorithms.
△ Less
Submitted 1 February, 2017;
originally announced February 2017.
-
Superimposition of eye fundus images for longitudinal analysis from large public health databases
Authors:
Guillaume Noyel,
Rebecca Thomas,
Gavin Bhakta,
Andrew Crowder,
David Owens,
Peter Boyle
Abstract:
In this paper, a method is presented for superimposition (i.e. registration) of eye fundus images from persons with diabetes screened over many years for diabetic retinopathy. The method is fully automatic and robust to camera changes and colour variations across the images both in space and time. All the stages of the process are designed for longitudinal analysis of cohort public health database…
▽ More
In this paper, a method is presented for superimposition (i.e. registration) of eye fundus images from persons with diabetes screened over many years for diabetic retinopathy. The method is fully automatic and robust to camera changes and colour variations across the images both in space and time. All the stages of the process are designed for longitudinal analysis of cohort public health databases where retinal examinations are made at approximately yearly intervals. The method relies on a model correcting two radial distortions and an affine transformation between pairs of images which is robustly fitted on salient points. Each stage involves linear estimators followed by non-linear optimisation. The model of image warping is also invertible for fast computation. The method has been validated (1) on a simulated montage and (2) on public health databases with 69 patients with high quality images (271 pairs acquired mostly with different types of camera and 268 pairs acquired mostly with the same type of camera) with success rates of 92% and 98%, and five patients (20 pairs) with low quality images with a success rate of 100%. Compared to two state-of-the-art methods, ours gives better results.
△ Less
Submitted 18 July, 2018; v1 submitted 7 July, 2016;
originally announced July 2016.
-
Grid: A next generation data parallel C++ QCD library
Authors:
Peter Boyle,
Azusa Yamaguchi,
Guido Cossu,
Antonin Portelli
Abstract:
In this proceedings we discuss the motivation, implementation details, and performance of a new physics code base called Grid. It is intended to be more performant, more general, but similar in spirit to QDP++\cite{QDP}. Our approach is to engineer the basic type system to be consistently fast, rather than bolt on a few optimised routines, and we are attempt to write all our optimised routines dir…
▽ More
In this proceedings we discuss the motivation, implementation details, and performance of a new physics code base called Grid. It is intended to be more performant, more general, but similar in spirit to QDP++\cite{QDP}. Our approach is to engineer the basic type system to be consistently fast, rather than bolt on a few optimised routines, and we are attempt to write all our optimised routines directly in the Grid framework. It is hoped this will deliver best known practice performance across the next generation of supercomputers, which will provide programming challenges to traditional scalar codes.
We illustrate the programming patterns used to implement our goals, and advances in productivity that have been enabled by using new features in C++11.
△ Less
Submitted 10 December, 2015;
originally announced December 2015.