Search | arXiv e-print repository

arXiv:2503.19894 [pdf, other]

Versatile Cross-platform Compilation Toolchain for Schrödinger-style Quantum Circuit Simulation

Authors: Yuncheng Lu, Shuang Liang, Hongxiang Fan, Ce Guo, Wayne Luk, Paul H. J. Kelly

Abstract: While existing quantum hardware resources have limited availability and reliability, there is a growing demand for exploring and verifying quantum algorithms. Efficient classical simulators for high-performance quantum simulation are critical to meeting this demand. However, due to the vastly varied characteristics of classical hardware, implementing hardware-specific optimizations for different h… ▽ More While existing quantum hardware resources have limited availability and reliability, there is a growing demand for exploring and verifying quantum algorithms. Efficient classical simulators for high-performance quantum simulation are critical to meeting this demand. However, due to the vastly varied characteristics of classical hardware, implementing hardware-specific optimizations for different hardware platforms is challenging. To address such needs, we propose CAST (Cross-platform Adaptive Schrödiner-style Simulation Toolchain), a novel compilation toolchain with cross-platform (CPU and Nvidia GPU) optimization and high-performance backend supports. CAST exploits a novel sparsity-aware gate fusion algorithm that automatically selects the best fusion strategy and backend configuration for targeted hardware platforms. CAST also aims to offer versatile and high-performance backend for different hardware platforms. To this end, CAST provides an LLVM IR-based vectorization optimization for various CPU architectures and instruction sets, as well as a PTX-based code generator for Nvidia GPU support. We benchmark CAST against IBM Qiskit, Google QSimCirq, Nvidia cuQuantum backend, and other high-performance simulators. On various 32-qubit CPU-based benchmarks, CAST is able to achieve up to 8.03x speedup than Qiskit. On various 30-qubit GPU-based benchmarks, CAST is able to achieve up to 39.3x speedup than Nvidia cuQuantum backend. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: To appear in DAC 25

arXiv:2404.02218 [pdf, other]

doi 10.1145/3620666.3651344

A shared compilation stack for distributed-memory parallelism in stencil DSLs

Authors: George Bisbas, Anton Lydike, Emilien Bauer, Nick Brown, Mathieu Fehr, Lawrence Mitchell, Gabriel Rodriguez-Canal, Maurice Jamieson, Paul H. J. Kelly, Michel Steuwer, Tobias Grosser

Abstract: Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloe… ▽ More Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem. △ Less

Submitted 7 March, 2025; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: Fix some bibtex links, journal ref

Journal ref: In ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 38-56 (2024)

arXiv:2401.15036 [pdf, other]

doi 10.1109/LRA.2024.3352361

Distributed Simultaneous Localisation and Auto-Calibration using Gaussian Belief Propagation

Authors: Riku Murai, Ignacio Alzugaray, Paul H. J. Kelly, Andrew J. Davison

Abstract: We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method… ▽ More We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method not only yields accurate robot localisation and auto-calibration but also is able to perform under challenging circumstances such as highly noisy measurements, significant communication failures or limited communication range. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Published in IEEE Robotics and Automation Letters (RA-L) 2024

Journal ref: IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2136-2143, March 2024

arXiv:2312.13094 [pdf, other]

Automated MPI-X code generation for scalable finite-difference solvers

Authors: George Bisbas, Rhodri Nelson, Mathias Louboutin, Fabio Luporini, Paul H. J. Kelly, Gerard Gorman

Abstract: Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored… ▽ More Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to execute explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modeling simulations for real-world applications at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive strong and weak scaling on CPU and GPU clusters, proving its effectiveness and capability to meet the demands of large-scale scientific simulations. △ Less

Submitted 6 January, 2025; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 11 pages, 12 figures (23 pages with References and Appendix)

arXiv:2312.06741 [pdf, other]

Gaussian Splatting SLAM

Authors: Hidenobu Matsuki, Riku Murai, Paul H. J. Kelly, Andrew J. Davison

Abstract: We present the first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM. Our method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, mapping, and high-quality rendering. Designed for challenging monocular settings, our approach is seamles… ▽ More We present the first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM. Our method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, mapping, and high-quality rendering. Designed for challenging monocular settings, our approach is seamlessly extendable to RGB-D SLAM when an external depth sensor is available. Several innovations are required to continuously reconstruct 3D scenes with high fidelity from a live camera. First, to move beyond the original 3DGS algorithm, which requires accurate poses from an offline Structure from Motion (SfM) system, we formulate camera tracking for 3DGS using direct optimisation against the 3D Gaussians, and show that this enables fast and robust tracking with a wide basin of convergence. Second, by utilising the explicit nature of the Gaussians, we introduce geometric verification and regularisation to handle the ambiguities occurring in incremental 3D dense reconstruction. Finally, we introduce a full SLAM system which not only achieves state-of-the-art results in novel view synthesis and trajectory estimation but also reconstruction of tiny and even transparent objects. △ Less

Submitted 14 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: CVPR2024 Highlight. First two authors contributed equally to this work. Project Page: https://rmurai.co.uk/projects/GaussianSplattingSLAM/

arXiv:2203.03092 [pdf, other]

Systematic Comparison of Path Planning Algorithms using PathBench

Authors: Hao-Ya Hsueh, Alexandru-Iosif Toma, Hussein Ali Jaafar, Edward Stow, Riku Murai, Paul H. J. Kelly, Sajad Saeedi

Abstract: Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly-exploring random tree (RRT) are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing rapid growth. An unified path planning interface that facilitates the development an… ▽ More Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly-exploring random tree (RRT) are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing rapid growth. An unified path planning interface that facilitates the development and benchmarking of existing and new algorithms is needed. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learning-based path planning algorithms in 2D and 3D grid world environments. Many existing path planning algorithms are supported; e.g. A*, Dijkstra, waypoint planning networks, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. The benchmarking ability of PathBench is explored in this paper by comparing algorithms across five different hardware systems and three different map types, including built-in PathBench maps, video game maps, and maps from real world databases. Metrics, such as path length, success rate, and computational time, were used to evaluate algorithms. Algorithmic analysis was also performed on a real world robot to demonstrate PathBench's support for Robot Operating System (ROS). PathBench is open source. △ Less

Submitted 6 March, 2022; originally announced March 2022.

Comments: Accepted to Advanced Robotics Journal; 23 pages, 9 figures, 4 tables. arXiv admin note: substantial text overlap with arXiv:2105.01777

arXiv:2202.03314 [pdf, other]

doi 10.1109/TRO.2023.3324127

A Robot Web for Distributed Many-Device Localisation

Authors: Riku Murai, Joseph Ortiz, Sajad Saeedi, Paul H. J. Kelly, Andrew J. Davison

Abstract: We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, a… ▽ More We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, and is flexible for any type of robot, motion or sensor. We define a simple and efficient communication protocol which can be implemented by the publishing and reading of web pages or other asynchronous communication technologies. We show in simulations with up to 1000 robots interacting in arbitrary patterns that our solution convergently achieves global accuracy as accurate as a centralised non-linear factor graph solver while operating with high distributed efficiency of computation and communication. Via the use of robust factors in GBP, our method is tolerant to a high percentage of faults in sensor measurements or dropped communication packets. △ Less

Submitted 26 January, 2024; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: Published in IEEE Transactions on Robotics (TRO) 2023

Journal ref: IEEE Transactions on Robotics, vol. 40, pp. 121-138, 2024

arXiv:2106.07456 [pdf, other]

Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions

Authors: Philippos Papaphilippou, Paul H. J. Kelly, Wayne Luk

Abstract: This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and str… ▽ More This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and streaming performance. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instructions can be developed with few low-level lines of code. In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications on an FPGA. Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions as custom instructions. Finally, we provide some insights on the challenges and effectiveness of such future micro-architectures. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: Accepted at the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV 2021), co-located with ISCA 2021

arXiv:2105.01777 [pdf, other]

PathBench: A Benchmarking Platform for Classical and Learned Path Planning Algorithms

Authors: Alexandru-Iosif Toma, Hao-Ya Hsueh, Hussein Ali Jaafar, Riku Murai, Paul H. J. Kelly, Sajad Saeedi

Abstract: Path planning is a key component in mobile robotics. A wide range of path planning algorithms exist, but few attempts have been made to benchmark the algorithms holistically or unify their interface. Moreover, with the recent advances in deep neural networks, there is an urgent need to facilitate the development and benchmarking of such learning-based planning algorithms. This paper presents PathB… ▽ More Path planning is a key component in mobile robotics. A wide range of path planning algorithms exist, but few attempts have been made to benchmark the algorithms holistically or unify their interface. Moreover, with the recent advances in deep neural networks, there is an urgent need to facilitate the development and benchmarking of such learning-based planning algorithms. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learned 2D and 3D path planning algorithms, while offering support for Robot Oper-ating System (ROS). Many existing path planning algorithms are supported; e.g. A*, wavefront, rapidly-exploring random tree, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. We demonstrate the benchmarking capability of PathBench by comparing implemented classical and learned algorithms for metrics, such as path length, success rate, computational time and path deviation. These evaluations are done on built-in PathBench maps and external path planning environments from video games and real world databases. PathBench is open source. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: The Conference on Robots and Vision (CRV2021), Supplementary Website: https://sites.google.com/view/pathbench/

arXiv:2101.08715 [pdf, other]

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Authors: Edward Stow, Riku Murai, Sajad Saeedi, Paul H. J. Kelly

Abstract: Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple con… ▽ More Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5. △ Less

Submitted 21 January, 2021; originally announced January 2021.

Comments: 17 pages, 4 figures, Accepted at LCPC 2020 to be published by Springer

ACM Class: D.3.4; I.4.m

arXiv:2010.10248 [pdf, other]

Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

Authors: George Bisbas, Fabio Luporini, Mathias Louboutin, Rhodri Nelson, Gerard Gorman, Paul H. J. Kelly

Abstract: Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However,… ▽ More Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid ("off-the-grid"). Our work is motivated by modeling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x. △ Less

Submitted 25 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: Accepted for publication at 35th IEEE International Parallel & Distributed Processing Symposium

arXiv:2006.01765 [pdf, other]

AnalogNet: Convolutional Neural Network Inference on Analog Focal Plane Sensor Processors

Authors: Matthew Z. Wong, Benoit Guillard, Riku Murai, Sajad Saeedi, Paul H. J. Kelly

Abstract: We present a high-speed, energy-efficient Convolutional Neural Network (CNN) architecture utilising the capabilities of a unique class of devices known as analog Focal Plane Sensor Processors (FPSP), in which the sensor and the processor are embedded together on the same silicon chip. Unlike traditional vision systems, where the sensor array sends collected data to a separate processor for process… ▽ More We present a high-speed, energy-efficient Convolutional Neural Network (CNN) architecture utilising the capabilities of a unique class of devices known as analog Focal Plane Sensor Processors (FPSP), in which the sensor and the processor are embedded together on the same silicon chip. Unlike traditional vision systems, where the sensor array sends collected data to a separate processor for processing, FPSPs allow data to be processed on the imaging device itself. This unique architecture enables ultra-fast image processing and high energy efficiency, at the expense of limited processing resources and approximate computations. In this work, we show how to convert standard CNNs to FPSP code, and demonstrate a method of training networks to increase their robustness to analog computation errors. Our proposed architecture, coined AnalogNet, reaches a testing accuracy of 96.9% on the MNIST handwritten digits recognition task, at a speed of 2260 FPS, for a cost of 0.7 mJ per frame. △ Less

Submitted 21 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

Comments: 8 pages, 7 figures

arXiv:2004.11186 [pdf, other]

doi 10.1109/IROS45743.2020.9341151

BIT-VO: Visual Odometry at 300 FPS using Binary Features from the Focal Plane

Authors: Riku Murai, Sajad Saeedi, Paul H. J. Kelly

Abstract: Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the… ▽ More Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the focal plane, data which is digitized and transferred is reduced. As a consequence, SCAMP-5 offers a high frame rate while maintaining low energy consumption. Here, we present BIT-VO, which is, to the best of our knowledge, the first 6 Degrees of Freedom visual odometry algorithm which utilises the FPSP. Our entire system operates at 300 FPS in a natural scene, using binary edges and corner features detected by the SCAMP-5. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: 8 pages, 16 figures

Journal ref: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 2020, pp. 8579-8586

arXiv:2003.03396 [pdf, other]

Scalable Uncertainty for Computer Vision with Functional Variational Inference

Authors: Eduardo D C Carvalho, Ronald Clark, Andrea Nicastro, Paul H J Kelly

Abstract: As Deep Learning continues to yield successful applications in Computer Vision, the ability to quantify all forms of uncertainty is a paramount requirement for its safe and reliable deployment in the real-world. In this work, we leverage the formulation of variational inference in function space, where we associate Gaussian Processes (GPs) to both Bayesian CNN priors and variational family. Since… ▽ More As Deep Learning continues to yield successful applications in Computer Vision, the ability to quantify all forms of uncertainty is a paramount requirement for its safe and reliable deployment in the real-world. In this work, we leverage the formulation of variational inference in function space, where we associate Gaussian Processes (GPs) to both Bayesian CNN priors and variational family. Since GPs are fully determined by their mean and covariance functions, we are able to obtain predictive uncertainty estimates at the cost of a single forward pass through any chosen CNN architecture and for any supervised learning task. By leveraging the structure of the induced covariance matrices, we propose numerically efficient algorithms which enable fast training in the context of high-dimensional tasks such as depth estimation and semantic segmentation. Additionally, we provide sufficient conditions for constructing regression loss functions whose probabilistic counterparts are compatible with aleatoric uncertainty quantification. △ Less

Submitted 6 March, 2020; originally announced March 2020.

Comments: CVPR 2020

arXiv:1906.00877 [pdf, other]

Pangloss: a novel Markov chain prefetcher

Authors: Philippos Papaphilippou, Paul H. J. Kelly, Wayne Luk

Abstract: We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In add… ▽ More We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: Accepted in The Third Data Prefetching Championship (DPC3), held in conjunction with ISCA 2019

arXiv:1903.08243 [pdf, other]

doi 10.1177/1094342020945005

A study of vectorization for matrix-free finite element methods

Authors: Tianjiao Sun, Lawrence Mitchell, Kaushik Kulkarni, Andreas Klöckner, David A. Ham, Paul H. J. Kelly

Abstract: Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while… ▽ More Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30\% of theoretical peak performance for many examples of practical significance, and exceed 50\% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels. △ Less

Submitted 19 May, 2020; v1 submitted 19 March, 2019; originally announced March 2019.

Journal ref: International Journal of High Performance Computing Applications (2020)

arXiv:1808.06820 [pdf, other]

SLAMBench2: Multi-Objective Head-to-Head Benchmarking for Visual SLAM

Authors: Bruno Bodin, Harry Wagstaff, Sajad Saeedi, Luigi Nardi, Emanuele Vespa, John H Mayer, Andy Nisbet, Mikel Luján, Steve Furber, Andrew J Davison, Paul H. J. Kelly, Michael O'Boyle

Abstract: SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example,… ▽ More SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phonebased AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs across SLAM systems. △ Less

Submitted 21 August, 2018; originally announced August 2018.

Journal ref: 2018 IEEE International Conference on Robotics and Automation (ICRA'18)

arXiv:1808.06352 [pdf, other]

doi 10.1109/JPROC.2018.2856739

Navigating the Landscape for Real-time Localisation and Mapping for Robotics and Virtual and Augmented Reality

Authors: Sajad Saeedi, Bruno Bodin, Harry Wagstaff, Andy Nisbet, Luigi Nardi, John Mawer, Nicolas Melot, Oscar Palomar, Emanuele Vespa, Tom Spink, Cosmin Gorgovan, Andrew Webb, James Clarkson, Erik Tomusk, Thomas Debrunner, Kuba Kaszyk, Pablo Gonzalez-de-Aledo, Andrey Rodchenko, Graham Riley, Christos Kotselidis, Björn Franke, Michael F. P. O'Boyle, Andrew J. Davison, Paul H. J. Kelly, Mikel Luján , et al. (1 additional authors not shown)

Abstract: Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, to… ▽ More Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context. △ Less

Submitted 20 August, 2018; originally announced August 2018.

Comments: Proceedings of the IEEE 2018

arXiv:1807.03032 [pdf, other]

Architecture and performance of Devito, a system for automated stencil computation

Authors: Fabio Luporini, Michael Lange, Mathias Louboutin, Navjot Kukreja, Jan Hückelheim, Charles Yount, Philipp Witte, Paul H. J. Kelly, Felix J. Herrmann, Gerard J. Gorman

Abstract: Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process… ▽ More Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications. △ Less

Submitted 7 February, 2020; v1 submitted 9 July, 2018; originally announced July 2018.

Comments: Submitted to ACM Transactions on Mathematical Software

MSC Class: 65N06; 68N20

arXiv:1708.03183 [pdf, other]

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Authors: Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, Paul H. J. Kelly

Abstract: Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of th… ▽ More Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation -- not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing: a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased run-time performance; implementation in a publicly-available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28x on a platform consisting of 896 Intel Broadwell cores. △ Less

Submitted 19 June, 2019; v1 submitted 10 August, 2017; originally announced August 2017.

Comments: 29 pages including supplementary materials and references

ACM Class: D.1.2; G.4

arXiv:1702.00505 [pdf, other]

Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications Using HyperMapper

Authors: Luigi Nardi, Bruno Bodin, Sajad Saeedi, Emanuele Vespa, Andrew J. Davison, Paul H. J. Kelly

Abstract: In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. The goal of this exploration is to reduce execution time while meeting our quality of result objectives. In previous work we showed for the first time that it is possible to map this application to power constrained embedded systems, highlighting that decis… ▽ More In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. The goal of this exploration is to reduce execution time while meeting our quality of result objectives. In previous work we showed for the first time that it is possible to map this application to power constrained embedded systems, highlighting that decision choices made at the algorithmic design-level have the most impact. As the algorithmic design space is too large to be exhaustively evaluated, we use a previously introduced multi-objective Random Forest Active Learning prediction framework dubbed HyperMapper, to find good algorithmic designs. We show that HyperMapper generalizes on a recent cutting edge 3D scene understanding algorithm and on a modern GPU-based computer architecture. HyperMapper is able to beat an expert human hand-tuning the algorithmic parameters of the class of Computer Vision applications taken under consideration in this paper automatically. In addition, we use crowd-sourcing using a 3D scene understanding Android app to show that the Pareto front obtained on an embedded system can be used to accelerate the same application on all the 83 smart-phones and tablets crowd-sourced with speedups ranging from 2 to over 12. △ Less

Submitted 21 March, 2017; v1 submitted 1 February, 2017; originally announced February 2017.

Comments: 10 pages, Keywords: design space exploration, machine learning, computer vision, SLAM, embedded systems, GPU, crowd-sourcing

Journal ref: 31st IEEE International Parallel and Distributed Processing Symposium May 29 - June 2, 2017 Orlando, Florida USA

arXiv:1604.05937 [pdf, other]

doi 10.5194/gmd-9-3803-2016

A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake

Authors: Gheorghe-Teodor Bercea, Andrew T. T. McRae, David A. Ham, Lawrence Mitchell, Florian Rathgeber, Luigi Nardi, Fabio Luporini, Paul H. J. Kelly

Abstract: We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite ele… ▽ More We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-specific limit. △ Less

Submitted 28 October, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

Comments: Bibliography fixes, 23 pages

Journal ref: Geoscientific Model Development 9:3803-3815 (2016)

arXiv:1604.05872 [pdf, other]

doi 10.1145/3054944

An algorithm for the optimization of finite element integration loops

Authors: Fabio Luporini, David A. Ham, Paul H. J. Kelly

Abstract: We present an algorithm for the optimization of a class of finite element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the s… ▽ More We present an algorithm for the optimization of a class of finite element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the state of the art in finite element code generation in almost all cases. This validates the effectiveness of the algorithm presented here, and illustrates its limitations. △ Less

Submitted 20 April, 2016; originally announced April 2016.

ACM Class: G.1.8; G.4

arXiv:1509.04648 [pdf, other]

doi 10.1109/ICRA.2016.7487261

Comparative Design Space Exploration of Dense and Semi-Dense SLAM

Authors: M. Zeeshan Zia, Luigi Nardi, Andrew Jack, Emanuele Vespa, Bruno Bodin, Paul H. J. Kelly, Andrew J. Davison

Abstract: SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these sys… ▽ More SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights. △ Less

Submitted 3 March, 2016; v1 submitted 15 September, 2015; originally announced September 2015.

Comments: IEEE International Conference on Robotics and Automation 2016

arXiv:1505.04694 [pdf, other]

Thread Parallelism for Highly Irregular Computation in Anisotropic Mesh Adaptation

Authors: Georgios Rokos, Gerard J. Gorman, Kristian Ejlebjerg Jensen, Paul H. J. Kelly

Abstract: Thread-level parallelism in irregular applications with mutable data dependencies presents challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free. In this article we describe a methodology for exploiting thread parallelism for a class of graph-mutating worklist algorithms, w… ▽ More Thread-level parallelism in irregular applications with mutable data dependencies presents challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free. In this article we describe a methodology for exploiting thread parallelism for a class of graph-mutating worklist algorithms, which guarantees safe parallel execution via processing in rounds of independent sets and using a deferred update strategy to commit changes in the underlying data structures. Scalability is assisted by atomic fetch-and-add operations to create worklists and work-stealing to balance the shared-memory workload. This work is motivated by mesh adaptation algorithms, for which we show a parallel efficiency of 60% and 50% on Intel(R) Xeon(R) Sandy Bridge and AMD Opteron(tm) Magny-Cours systems, respectively, using these techniques. △ Less

Submitted 18 May, 2015; originally announced May 2015.

Comments: To appear in the proceedings of EASC 2015

arXiv:1505.04134 [pdf, ps, other]

An Interrupt-Driven Work-Sharing For-Loop Scheduler

Authors: Georgios Rokos, Gerard J. Gorman, Paul H. J. Kelly

Abstract: In this paper we present a parallel for-loop scheduler which is based on work-stealing principles but runs under a completely cooperative scheme. POSIX signals are used by idle threads to interrupt left-behind workers, which in turn decide what portion of their workload can be given to the requester. We call this scheme Interrupt-Driven Work-Sharing (IDWS). This article describes how IDWS works, h… ▽ More In this paper we present a parallel for-loop scheduler which is based on work-stealing principles but runs under a completely cooperative scheme. POSIX signals are used by idle threads to interrupt left-behind workers, which in turn decide what portion of their workload can be given to the requester. We call this scheme Interrupt-Driven Work-Sharing (IDWS). This article describes how IDWS works, how it can be integrated into any POSIX-compliant OpenMP implementation and how a user can manually replace OpenMP parallel for-loops with IDWS in existing POSIX-compliant C++ applications. Additionally, we measure its performance using both a synthetic benchmark with varying distributions of workload across the iteration space and a real-life application on Sandy Bridge and Xeon Phi systems. Regardless the workload distribution and the underlying hardware, IDWS is always the best or among the best-performing strategies, providing a good all-around solution to the scheduling-choice dilemma. △ Less

Submitted 18 May, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

arXiv:1505.04086 [pdf, ps, other]

A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

Authors: Georgios Rokos, Gerard Gorman, Paul H J Kelly

Abstract: Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of expos… ▽ More Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels. △ Less

Submitted 18 May, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

Comments: To appear in the proceedings of Euro Par 2015

arXiv:1501.01809 [pdf, other]

doi 10.1145/2998441

Firedrake: automating the finite element method by composing abstractions

Authors: Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, Paul H. J. Kelly

Abstract: Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separat… ▽ More Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns which eases the incorporation of separate contributions from computer scientists, numerical analysts and application specialists. These contributions may add functionality, or improve performance. Firedrake benefits from automatically applying new optimisations. This includes factorising mixed function spaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside pure variational formulations, such as flux-limiters. △ Less

Submitted 1 July, 2016; v1 submitted 8 January, 2015; originally announced January 2015.

Comments: Minor revisions to v2

ACM Class: G.1.8; G.4

Journal ref: ACM Transactions on Mathematical Software 43(3):24:1--24:27 (2016)

arXiv:1410.2167 [pdf, other]

doi 10.1109/ICRA.2015.7140009

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Authors: Luigi Nardi, Bruno Bodin, M. Zeeshan Zia, John Mawer, Andy Nisbet, Paul H. J. Kelly, Andrew J. Davison, Mikel Luján, Michael F. P. O'Boyle, Graham Riley, Nigel Topham, Steve Furber

Abstract: Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for… ▽ More Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPUaccelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives. △ Less

Submitted 26 February, 2015; v1 submitted 8 October, 2014; originally announced October 2014.

Comments: 8 pages, ICRA 2015 conference paper

Journal ref: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7140009 IEEE Xplore 2015

arXiv:1407.0904 [pdf, other]

doi 10.1145/2687415

COFFEE: an Optimizing Compiler for Finite Element Local Assembly

Authors: Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, Paul H. J. Kelly

Abstract: The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient ke… ▽ More The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved. △ Less

Submitted 4 July, 2014; v1 submitted 3 July, 2014; originally announced July 2014.

Comments: Remove volume metadata

ACM Class: G.1.8; G.4

arXiv:1403.7209 [pdf, other]

doi 10.1109/TPDS.2015.2453972

Acceleration of a Full-scale Industrial CFD Application with OP2

Authors: István Z. Reguly, Gihan R. Mudalige, Carlo Bertolli, Michael B. Giles, Adam Betts, Paul H. J. Kelly, David Radford

Abstract: Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor archit… ▽ More Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor architectures, such as multi-core and many-core processor systems, Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework. OP2 targets the domain of unstructured mesh problems and follows the design of an active library using source-to-source translation and compilation to generate multiple parallel implementations from a single high-level application source for execution on a range of back-end hardware platforms. We chart the conversion of Hydra from its original hand-tuned production version to one that utilizes OP2, and map out the key difficulties encountered in the process. To our knowledge this research presents the first application of such a high-level framework to a full scale production code. Specifically we show (1) how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application such as Hydra, and (2) how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original production code could be achieved, but it can be significantly improved on conventional processor systems. Additionally, we achieve further... △ Less

Submitted 27 March, 2014; originally announced March 2014.

Comments: Submitted to ACM Transactions on Parallel Computing

ACM Class: C.4

Journal ref: IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 5, pp. 1265-1278, May 1 2016. doi: 10.1109/TPDS.2015.2453972

arXiv:1401.4582 [pdf]

Multidisciplinary Engineering Models: Methodology and Case Study in Spreadsheet Analytics

Authors: David Birch, Helen Liang, Paul H J Kelly, Glen Mullineux, Tony Field, Joan Ko, Alvise Simondetti

Abstract: This paper demonstrates a methodology to help practitioners maximise the utility of complex multidisciplinary engineering models implemented as spreadsheets, an area presenting unique challenges. As motivation we investigate the expanding use of Integrated Resource Management(IRM) models which assess the sustainability of urban masterplan designs. IRM models reflect the inherent complexity of mult… ▽ More This paper demonstrates a methodology to help practitioners maximise the utility of complex multidisciplinary engineering models implemented as spreadsheets, an area presenting unique challenges. As motivation we investigate the expanding use of Integrated Resource Management(IRM) models which assess the sustainability of urban masterplan designs. IRM models reflect the inherent complexity of multidisciplinary sustainability analysis by integrating models from many disciplines. This complexity makes their use time-consuming and reduces their adoption. We present a methodology and toolkit for analysing multidisciplinary engineering models implemented as spreadsheets to alleviate such problems and increase their adoption. For a given output a relevant slice of the model is extracted, visualised and analysed by computing model and interdisciplinary metrics. A sensitivity analysis of the extracted model supports engineers in their optimisation efforts. These methods expose, manage and reduce model complexity and risk whilst giving practitioners insight into multidisciplinary model composition. We report application of the methodology to several generations of an industrial IRM model and detail the insight generated, particularly considering model evolution. △ Less

Submitted 18 January, 2014; originally announced January 2014.

Comments: 12 Pages, 8 Colour figures

Journal ref: Proc. European Spreadsheet Risks Int. Grp. (EuSpRIG) 2013, ISBN: 978-1-9054045-1-3

arXiv:1308.2480 [pdf, other]

A thread-parallel algorithm for anisotropic mesh adaptation

Authors: Georgios Rokos, Gerard J. Gorman, James Southern, Paul H. J. Kelly

Abstract: Anisotropic mesh adaptation is a powerful way to directly minimise the computational cost of mesh based simulation. It is particularly important for multi-scale problems where the required number of floating-point operations can be reduced by orders of magnitude relative to more traditional static mesh approaches. Increasingly, finite element and finite volume codes are being optimised for moder… ▽ More Anisotropic mesh adaptation is a powerful way to directly minimise the computational cost of mesh based simulation. It is particularly important for multi-scale problems where the required number of floating-point operations can be reduced by orders of magnitude relative to more traditional static mesh approaches. Increasingly, finite element and finite volume codes are being optimised for modern multi-core architectures. Typically, decomposition methods implemented through the Message Passing Interface (MPI) are applied for inter-node parallelisation, while a threaded programming model, such as OpenMP, is used for intra-node parallelisation. Inter-node parallelism for mesh adaptivity has been successfully implemented by a number of groups. However, thread-level parallelism is significantly more challenging because the underlying data structures are extensively modified during mesh adaptation and a greater degree of parallelism must be realised. In this paper we describe a new thread-parallel algorithm for anisotropic mesh adaptation algorithms. For each of the mesh optimisation phases (refinement, coarsening, swapping and smoothing) we describe how independent sets of tasks are defined. We show how a deferred updates strategy can be used to update the mesh data structures in parallel and without data contention. We show that despite the complex nature of mesh adaptation and inherent load imbalances in the mesh adaptivity, a parallel efficiency of 60% is achieved on an 8 core Intel Xeon Sandybridge, and a 40% parallel efficiency is achieved using 16 cores in a 2 socket Intel Xeon Sandybridge ccNUMA system. △ Less

Submitted 12 August, 2013; originally announced August 2013.

Showing 1–33 of 33 results for author: Kelly, P H J