Search | arXiv e-print repository

Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

Authors: Agam Shah, Siddhant Sukhani, Huzaifa Pardawala, Saketh Budideti, Riya Bhadani, Rudra Gopal, Siddhartha Somani, Michael Galarnyk, Soungmin Lee, Arnav Hiray, Akshar Ravichandran, Eric Kim, Pranav Aluru, Joshua Zhang, Sebastian Jaskowski, Veer Guda, Meghaj Tarte, Liqin Ye, Spencer Gosden, Rutwik Routu, Rachel Yuh, Sloka Chava, Sahasra Chava, Dylan Patrick Kelly, Aiden Chiang , et al. (2 additional authors not shown)

Abstract: Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences fr… ▽ More Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2503.19894 [pdf, other]

Versatile Cross-platform Compilation Toolchain for Schrödinger-style Quantum Circuit Simulation

Authors: Yuncheng Lu, Shuang Liang, Hongxiang Fan, Ce Guo, Wayne Luk, Paul H. J. Kelly

Abstract: While existing quantum hardware resources have limited availability and reliability, there is a growing demand for exploring and verifying quantum algorithms. Efficient classical simulators for high-performance quantum simulation are critical to meeting this demand. However, due to the vastly varied characteristics of classical hardware, implementing hardware-specific optimizations for different h… ▽ More While existing quantum hardware resources have limited availability and reliability, there is a growing demand for exploring and verifying quantum algorithms. Efficient classical simulators for high-performance quantum simulation are critical to meeting this demand. However, due to the vastly varied characteristics of classical hardware, implementing hardware-specific optimizations for different hardware platforms is challenging. To address such needs, we propose CAST (Cross-platform Adaptive Schrödiner-style Simulation Toolchain), a novel compilation toolchain with cross-platform (CPU and Nvidia GPU) optimization and high-performance backend supports. CAST exploits a novel sparsity-aware gate fusion algorithm that automatically selects the best fusion strategy and backend configuration for targeted hardware platforms. CAST also aims to offer versatile and high-performance backend for different hardware platforms. To this end, CAST provides an LLVM IR-based vectorization optimization for various CPU architectures and instruction sets, as well as a PTX-based code generator for Nvidia GPU support. We benchmark CAST against IBM Qiskit, Google QSimCirq, Nvidia cuQuantum backend, and other high-performance simulators. On various 32-qubit CPU-based benchmarks, CAST is able to achieve up to 8.03x speedup than Qiskit. On various 30-qubit GPU-based benchmarks, CAST is able to achieve up to 39.3x speedup than Nvidia cuQuantum backend. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: To appear in DAC 25

arXiv:2503.12315 [pdf, other]

Simulation-based Bayesian inference under model misspecification

Authors: Ryan P. Kelly, David J. Warne, David T. Frazier, David J. Nott, Michael U. Gutmann, Christopher Drovandi

Abstract: Simulation-based Bayesian inference (SBI) methods are widely used for parameter estimation in complex models where evaluating the likelihood is challenging but generating simulations is relatively straightforward. However, these methods commonly assume that the simulation model accurately reflects the true data-generating process, an assumption that is frequently violated in realistic scenarios. I… ▽ More Simulation-based Bayesian inference (SBI) methods are widely used for parameter estimation in complex models where evaluating the likelihood is challenging but generating simulations is relatively straightforward. However, these methods commonly assume that the simulation model accurately reflects the true data-generating process, an assumption that is frequently violated in realistic scenarios. In this paper, we focus on the challenges faced by SBI methods under model misspecification. We consolidate recent research aimed at mitigating the effects of misspecification, highlighting three key strategies: i) robust summary statistics, ii) generalised Bayesian inference, and iii) error modelling and adjustment parameters. To illustrate both the vulnerabilities of popular SBI methods and the effectiveness of misspecification-robust alternatives, we present empirical results on an illustrative example. △ Less

Submitted 15 March, 2025; originally announced March 2025.

Comments: 46 pages, 8 figures

arXiv:2404.13557 [pdf, other]

Preconditioned Neural Posterior Estimation for Likelihood-free Inference

Authors: Xiaoyu Wang, Ryan P. Kelly, David J. Warne, Christopher Drovandi

Abstract: Simulation based inference (SBI) methods enable the estimation of posterior distributions when the likelihood function is intractable, but where model simulation is feasible. Popular neural approaches to SBI are the neural posterior estimator (NPE) and its sequential version (SNPE). These methods can outperform statistical SBI approaches such as approximate Bayesian computation (ABC), particularly… ▽ More Simulation based inference (SBI) methods enable the estimation of posterior distributions when the likelihood function is intractable, but where model simulation is feasible. Popular neural approaches to SBI are the neural posterior estimator (NPE) and its sequential version (SNPE). These methods can outperform statistical SBI approaches such as approximate Bayesian computation (ABC), particularly for relatively small numbers of model simulations. However, we show in this paper that the NPE methods are not guaranteed to be highly accurate, even on problems with low dimension. In such settings the posterior cannot be accurately trained over the prior predictive space, and even the sequential extension remains sub-optimal. To overcome this, we propose preconditioned NPE (PNPE) and its sequential version (PSNPE), which uses a short run of ABC to effectively eliminate regions of parameter space that produce large discrepancy between simulations and data and allow the posterior emulator to be more accurately trained. We present comprehensive empirical evidence that this melding of neural and statistical SBI methods improves performance over a range of examples, including a motivating example involving a complex agent-based model applied to real tumour growth data. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 31 pages, 11 figures

arXiv:2404.02218 [pdf, other]

doi 10.1145/3620666.3651344

A shared compilation stack for distributed-memory parallelism in stencil DSLs

Authors: George Bisbas, Anton Lydike, Emilien Bauer, Nick Brown, Mathieu Fehr, Lawrence Mitchell, Gabriel Rodriguez-Canal, Maurice Jamieson, Paul H. J. Kelly, Michel Steuwer, Tobias Grosser

Abstract: Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloe… ▽ More Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem. △ Less

Submitted 7 March, 2025; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: Fix some bibtex links, journal ref

Journal ref: In ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 38-56 (2024)

arXiv:2403.08056 [pdf, other]

Improving Memory Dependence Prediction with Static Analysis

Authors: Luke Panayi, Rohan Gandhi, Jim Whittaker, Vassilios Chouliaras, Martin Berger, Paul Kelly

Abstract: This paper explores the potential of communicating information gained by static analysis from compilers to Out-of-Order (OoO) machines, focusing on the memory dependence predictor (MDP). The MDP enables loads to issue without all in-flight store addresses being known, with minimal memory order violations. We use LLVM to find loads with no dependencies and label them via their opcode. These labelle… ▽ More This paper explores the potential of communicating information gained by static analysis from compilers to Out-of-Order (OoO) machines, focusing on the memory dependence predictor (MDP). The MDP enables loads to issue without all in-flight store addresses being known, with minimal memory order violations. We use LLVM to find loads with no dependencies and label them via their opcode. These labelled loads skip making lookups into the MDP, improving prediction accuracy by reducing false dependencies. We communicate this information in a minimally intrusive way, i.e.~without introducing additional hardware costs or instruction bandwidth, providing these improvements without any additional overhead in the CPU. We find that in select cases in Spec2017, a significant number of load instructions can skip interacting with the MDP and lead to a performance gain. These results point to greater possibilities for static analysis as a source of near zero cost performance gains in future CPU designs. △ Less

Submitted 5 June, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

Comments: 15 pages

ACM Class: B.0; B.8; C.1

arXiv:2401.15036 [pdf, other]

doi 10.1109/LRA.2024.3352361

Distributed Simultaneous Localisation and Auto-Calibration using Gaussian Belief Propagation

Authors: Riku Murai, Ignacio Alzugaray, Paul H. J. Kelly, Andrew J. Davison

Abstract: We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method… ▽ More We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method not only yields accurate robot localisation and auto-calibration but also is able to perform under challenging circumstances such as highly noisy measurements, significant communication failures or limited communication range. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Published in IEEE Robotics and Automation Letters (RA-L) 2024

Journal ref: IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2136-2143, March 2024

arXiv:2312.13094 [pdf, other]

Automated MPI-X code generation for scalable finite-difference solvers

Authors: George Bisbas, Rhodri Nelson, Mathias Louboutin, Fabio Luporini, Paul H. J. Kelly, Gerard Gorman

Abstract: Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored… ▽ More Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to execute explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modeling simulations for real-world applications at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive strong and weak scaling on CPU and GPU clusters, proving its effectiveness and capability to meet the demands of large-scale scientific simulations. △ Less

Submitted 6 January, 2025; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: 11 pages, 12 figures (23 pages with References and Appendix)

arXiv:2312.06741 [pdf, other]

Gaussian Splatting SLAM

Authors: Hidenobu Matsuki, Riku Murai, Paul H. J. Kelly, Andrew J. Davison

Abstract: We present the first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM. Our method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, mapping, and high-quality rendering. Designed for challenging monocular settings, our approach is seamles… ▽ More We present the first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM. Our method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, mapping, and high-quality rendering. Designed for challenging monocular settings, our approach is seamlessly extendable to RGB-D SLAM when an external depth sensor is available. Several innovations are required to continuously reconstruct 3D scenes with high fidelity from a live camera. First, to move beyond the original 3DGS algorithm, which requires accurate poses from an offline Structure from Motion (SfM) system, we formulate camera tracking for 3DGS using direct optimisation against the 3D Gaussians, and show that this enables fast and robust tracking with a wide basin of convergence. Second, by utilising the explicit nature of the Gaussians, we introduce geometric verification and regularisation to handle the ambiguities occurring in incremental 3D dense reconstruction. Finally, we introduce a full SLAM system which not only achieves state-of-the-art results in novel view synthesis and trajectory estimation but also reconstruction of tiny and even transparent objects. △ Less

Submitted 14 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: CVPR2024 Highlight. First two authors contributed equally to this work. Project Page: https://rmurai.co.uk/projects/GaussianSplattingSLAM/

arXiv:2305.05580 [pdf, other]

doi 10.1007/978-3-031-31435-3_21

Fashion CUT: Unsupervised domain adaptation for visual pattern classification in clothes using synthetic data and pseudo-labels

Authors: Enric Moreu, Alex Martinelli, Martina Naughton, Philip Kelly, Noel E. O'Connor

Abstract: Accurate product information is critical for e-commerce stores to allow customers to browse, filter, and search for products. Product data quality is affected by missing or incorrect information resulting in poor customer experience. While machine learning can be used to correct inaccurate or missing information, achieving high performance on fashion image classification tasks requires large amoun… ▽ More Accurate product information is critical for e-commerce stores to allow customers to browse, filter, and search for products. Product data quality is affected by missing or incorrect information resulting in poor customer experience. While machine learning can be used to correct inaccurate or missing information, achieving high performance on fashion image classification tasks requires large amounts of annotated data, but it is expensive to generate due to labeling costs. One solution can be to generate synthetic data which requires no manual labeling. However, training a model with a dataset of solely synthetic images can lead to poor generalization when performing inference on real-world data because of the domain shift. We introduce a new unsupervised domain adaptation technique that converts images from the synthetic domain into the real-world domain. Our approach combines a generative neural network and a classifier that are jointly trained to produce realistic images while preserving the synthetic label information. We found that using real-world pseudo-labels during training helps the classifier to generalize in the real-world domain, reducing the synthetic bias. We successfully train a visual pattern classification model in the fashion domain without real-world annotations. Experiments show that our method outperforms other unsupervised domain adaptation algorithms. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.02491 [pdf, other]

Self-Supervised Learning for Organs At Risk and Tumor Segmentation with Uncertainty Quantification

Authors: Ilkin Isler, Debesh Jha, Curtis Lisle, Justin Rineer, Patrick Kelly, Bulent Aydogan, Mohamed Abazeed, Damla Turgut, Ulas Bagci

Abstract: In this study, our goal is to show the impact of self-supervised pre-training of transformers for organ at risk (OAR) and tumor segmentation as compared to costly fully-supervised learning. The proposed algorithm is called Monte Carlo Transformer based U-Net (MC-Swin-U). Unlike many other available models, our approach presents uncertainty quantification with Monte Carlo dropout strategy while gen… ▽ More In this study, our goal is to show the impact of self-supervised pre-training of transformers for organ at risk (OAR) and tumor segmentation as compared to costly fully-supervised learning. The proposed algorithm is called Monte Carlo Transformer based U-Net (MC-Swin-U). Unlike many other available models, our approach presents uncertainty quantification with Monte Carlo dropout strategy while generating its voxel-wise prediction. We test and validate the proposed model on both public and one private datasets and evaluate the gross tumor volume (GTV) as well as nearby risky organs' boundaries. We show that self-supervised pre-training approach improves the segmentation scores significantly while providing additional benefits for avoiding large-scale annotation costs. △ Less

Submitted 3 May, 2023; originally announced May 2023.

arXiv:2301.13368 [pdf, other]

Misspecification-robust Sequential Neural Likelihood for Simulation-based Inference

Authors: Ryan P. Kelly, David J. Nott, David T. Frazier, David J. Warne, Chris Drovandi

Abstract: Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neu… ▽ More Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neural approaches, such as sequential neural likelihood (SNL) avoid this wastage by utilising all model simulations to train a neural surrogate for the likelihood function. However, the performance of SNL under model misspecification is unreliable and can result in overconfident posteriors centred around an inaccurate parameter estimate. In this paper, we propose a novel SNL method, which through the incorporation of additional adjustment parameters, is robust to model misspecification and capable of identifying features of the data that the model is not able to recover. We demonstrate the efficacy of our approach through several illustrative examples, where our method gives more accurate point estimates and uncertainty quantification than SNL. △ Less

Submitted 7 March, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

arXiv:2203.03092 [pdf, other]

Systematic Comparison of Path Planning Algorithms using PathBench

Authors: Hao-Ya Hsueh, Alexandru-Iosif Toma, Hussein Ali Jaafar, Edward Stow, Riku Murai, Paul H. J. Kelly, Sajad Saeedi

Abstract: Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly-exploring random tree (RRT) are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing rapid growth. An unified path planning interface that facilitates the development an… ▽ More Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly-exploring random tree (RRT) are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing rapid growth. An unified path planning interface that facilitates the development and benchmarking of existing and new algorithms is needed. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learning-based path planning algorithms in 2D and 3D grid world environments. Many existing path planning algorithms are supported; e.g. A*, Dijkstra, waypoint planning networks, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. The benchmarking ability of PathBench is explored in this paper by comparing algorithms across five different hardware systems and three different map types, including built-in PathBench maps, video game maps, and maps from real world databases. Metrics, such as path length, success rate, and computational time, were used to evaluate algorithms. Algorithmic analysis was also performed on a real world robot to demonstrate PathBench's support for Robot Operating System (ROS). PathBench is open source. △ Less

Submitted 6 March, 2022; originally announced March 2022.

Comments: Accepted to Advanced Robotics Journal; 23 pages, 9 figures, 4 tables. arXiv admin note: substantial text overlap with arXiv:2105.01777

arXiv:2202.03314 [pdf, other]

doi 10.1109/TRO.2023.3324127

A Robot Web for Distributed Many-Device Localisation

Authors: Riku Murai, Joseph Ortiz, Sajad Saeedi, Paul H. J. Kelly, Andrew J. Davison

Abstract: We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, a… ▽ More We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, and is flexible for any type of robot, motion or sensor. We define a simple and efficient communication protocol which can be implemented by the publishing and reading of web pages or other asynchronous communication technologies. We show in simulations with up to 1000 robots interacting in arbitrary patterns that our solution convergently achieves global accuracy as accurate as a centralised non-linear factor graph solver while operating with high distributed efficiency of computation and communication. Via the use of robust factors in GBP, our method is tolerant to a high percentage of faults in sensor measurements or dropped communication packets. △ Less

Submitted 26 January, 2024; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: Published in IEEE Transactions on Robotics (TRO) 2023

Journal ref: IEEE Transactions on Robotics, vol. 40, pp. 121-138, 2024

arXiv:2202.01866 [pdf, other]

Enhancing Organ at Risk Segmentation with Improved Deep Neural Networks

Authors: Ilkin Isler, Curtis Lisle, Justin Rineer, Patrick Kelly, Damla Turgut, Jacob Ricci, Ulas Bagci

Abstract: Organ at risk (OAR) segmentation is a crucial step for treatment planning and outcome determination in radiotherapy treatments of cancer patients. Several deep learning based segmentation algorithms have been developed in recent years, however, U-Net remains the de facto algorithm designed specifically for biomedical image segmentation and has spawned many variants with known weaknesses. In this s… ▽ More Organ at risk (OAR) segmentation is a crucial step for treatment planning and outcome determination in radiotherapy treatments of cancer patients. Several deep learning based segmentation algorithms have been developed in recent years, however, U-Net remains the de facto algorithm designed specifically for biomedical image segmentation and has spawned many variants with known weaknesses. In this study, our goal is to present simple architectural changes in U-Net to improve its accuracy and generalization properties. Unlike many other available studies evaluating their algorithms on single center data, we thoroughly evaluate several variations of U-Net as well as our proposed enhanced architecture on multiple data sets for an extensive and reliable study of the OAR segmentation problem. Our enhanced segmentation model includes (a)architectural changes in the loss function, (b)optimization framework, and (c)convolution type. Testing on three publicly available multi-object segmentation data sets, we achieved an average of 80% dice score compared to the baseline U-Net performance of 63%. △ Less

Submitted 3 February, 2022; originally announced February 2022.

Comments: 7 pages, 3 figures, 6 tables, The paper is published in SPIE Medical Imaging 2022

arXiv:2106.07456 [pdf, other]

Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions

Authors: Philippos Papaphilippou, Paul H. J. Kelly, Wayne Luk

Abstract: This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and str… ▽ More This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and streaming performance. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instructions can be developed with few low-level lines of code. In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications on an FPGA. Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions as custom instructions. Finally, we provide some insights on the challenges and effectiveness of such future micro-architectures. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: Accepted at the Fifth Workshop on Computer Architecture Research with RISC-V (CARRV 2021), co-located with ISCA 2021

arXiv:2106.06086 [pdf, ps, other]

PSB2: The Second Program Synthesis Benchmark Suite

Authors: Thomas Helmuth, Peter Kelly

Abstract: For the past six years, researchers in genetic programming and other program synthesis disciplines have used the General Program Synthesis Benchmark Suite to benchmark many aspects of automatic program synthesis systems. These problems have been used to make notable progress toward the goal of general program synthesis: automatically creating the types of software that human programmers code. Many… ▽ More For the past six years, researchers in genetic programming and other program synthesis disciplines have used the General Program Synthesis Benchmark Suite to benchmark many aspects of automatic program synthesis systems. These problems have been used to make notable progress toward the goal of general program synthesis: automatically creating the types of software that human programmers code. Many of the systems that have attempted the problems in the original benchmark suite have used it to demonstrate performance improvements granted through new techniques. Over time, the suite has gradually become outdated, hindering the accurate measurement of further improvements. The field needs a new set of more difficult benchmark problems to move beyond what was previously possible. In this paper, we describe the 25 new general program synthesis benchmark problems that make up PSB2, a new benchmark suite. These problems are curated from a variety of sources, including programming katas and college courses. We selected these problems to be more difficult than those in the original suite, and give results using PushGP showing this increase in difficulty. These new problems give plenty of room for improvement, pointing the way for the next six or more years of general program synthesis research. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: To be published in GECCO 2021

arXiv:2105.01777 [pdf, other]

PathBench: A Benchmarking Platform for Classical and Learned Path Planning Algorithms

Authors: Alexandru-Iosif Toma, Hao-Ya Hsueh, Hussein Ali Jaafar, Riku Murai, Paul H. J. Kelly, Sajad Saeedi

Abstract: Path planning is a key component in mobile robotics. A wide range of path planning algorithms exist, but few attempts have been made to benchmark the algorithms holistically or unify their interface. Moreover, with the recent advances in deep neural networks, there is an urgent need to facilitate the development and benchmarking of such learning-based planning algorithms. This paper presents PathB… ▽ More Path planning is a key component in mobile robotics. A wide range of path planning algorithms exist, but few attempts have been made to benchmark the algorithms holistically or unify their interface. Moreover, with the recent advances in deep neural networks, there is an urgent need to facilitate the development and benchmarking of such learning-based planning algorithms. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learned 2D and 3D path planning algorithms, while offering support for Robot Oper-ating System (ROS). Many existing path planning algorithms are supported; e.g. A*, wavefront, rapidly-exploring random tree, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. We demonstrate the benchmarking capability of PathBench by comparing implemented classical and learned algorithms for metrics, such as path length, success rate, computational time and path deviation. These evaluations are done on built-in PathBench maps and external path planning environments from video games and real world databases. PathBench is open source. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: The Conference on Robots and Vision (CRV2021), Supplementary Website: https://sites.google.com/view/pathbench/

arXiv:2104.05834 [pdf, other]

Generative Design of NU's Husky Carbon, A Morpho-Functional, Legged Robot

Authors: Alireza Ramezani, Pravin Dangol, Eric Sihite, Andrew Lessieur, Peter Kelly

Abstract: We report the design of a morpho-functional robot called Husky Carbon. Our goal is to integrate two forms of mobility, aerial and quadrupedal-legged locomotion, within a single platform. There are prohibitive design restrictions such as tight power budget and payload, which can particularly become important in aerial flights. To address these challenges, we pose a problem called the Mobility Value… ▽ More We report the design of a morpho-functional robot called Husky Carbon. Our goal is to integrate two forms of mobility, aerial and quadrupedal-legged locomotion, within a single platform. There are prohibitive design restrictions such as tight power budget and payload, which can particularly become important in aerial flights. To address these challenges, we pose a problem called the Mobility Value of Added Mass (MVAM) problem. In the MVAM problem, we attempt to allocate mass in our designs such that the energetic performance is affected the least. To solve the MVAM problem, we adopted a generative design approach using Grasshopper's evolutionary solver to synthesize a parametric design space for Husky. Then, this space was searched for the morphologies that could yield a minimized Total Cost Of Transport (TCOT) and payload. This approach revealed that a front-heavy quadrupedal robot can achieve a lower TCOT while retaining larger margins on allowable added mass to its design. Based on this framework Husky was built and tested as a front-heavy robot. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: 7 Pages, 7 figures, submitted to ICRA 2021

arXiv:2101.08886 [pdf]

A co-Design approach to develop a smart cooking appliance. Applying a Domain Specific Language for a community supported appliance

Authors: Matteo Zallio, Paula Kelly, Barry Cryan, Damon Berry

Abstract: Our environment, whether at work, in public spaces, or at home, is becoming more connected, and increasingly responsive. Meal preparation even when it involves simply heating ready-made food can be perceived as a complex process for people with disabilities. This research aimed to prototype, using a co-Design approach a Community Supported Appliance (CSA) by developing a Domain Specific Language (… ▽ More Our environment, whether at work, in public spaces, or at home, is becoming more connected, and increasingly responsive. Meal preparation even when it involves simply heating ready-made food can be perceived as a complex process for people with disabilities. This research aimed to prototype, using a co-Design approach a Community Supported Appliance (CSA) by developing a Domain Specific Language (DSL), precisely created for a semi-automated cooking process. The DSL was shaped and expressed in the idiom of the users and allowed the CSA to support independence for users while performing daily cooking activities. △ Less

Submitted 21 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

Comments: 9 pages, 7 figures

arXiv:2101.08715 [pdf, other]

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Authors: Edward Stow, Riku Murai, Sajad Saeedi, Paul H. J. Kelly

Abstract: Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple con… ▽ More Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5. △ Less

Submitted 21 January, 2021; originally announced January 2021.

Comments: 17 pages, 4 figures, Accepted at LCPC 2020 to be published by Springer

ACM Class: D.3.4; I.4.m

arXiv:2010.10248 [pdf, other]

Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

Authors: George Bisbas, Fabio Luporini, Mathias Louboutin, Rhodri Nelson, Gerard Gorman, Paul H. J. Kelly

Abstract: Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However,… ▽ More Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid ("off-the-grid"). Our work is motivated by modeling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x. △ Less

Submitted 25 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: Accepted for publication at 35th IEEE International Parallel & Distributed Processing Symposium

arXiv:2010.04702 [pdf, other]

Mechanism Design of a Bio-inspired Armwing Mechanism for Mimicking Bat Flapping Gait

Authors: E. Sihite, P. Kelly, A. Ramezani

Abstract: The objective of this work is to design and develop a bio-inspired soft and articulated armwing structure which will be an integral component of a morphing aerial co-bot, Aerobat. In our design, we draw inspiration from bats. Bat membranous wings possess unique functions that make them a good example to take inspiration from and transform current aerial drones. In contrast with other flying verteb… ▽ More The objective of this work is to design and develop a bio-inspired soft and articulated armwing structure which will be an integral component of a morphing aerial co-bot, Aerobat. In our design, we draw inspiration from bats. Bat membranous wings possess unique functions that make them a good example to take inspiration from and transform current aerial drones. In contrast with other flying vertebrates, bats have an extremely articulated musculoskeletal system, key to their body impact-survivability and deliver an impressively adaptive and multimodal locomotion behavior. Bats exclusively use this capability with structural flexibility to generate the controlled force distribution on each wing membrane. The wing flexibility, complex wing kinematics, and fast muscle actuation allow these creatures to change the body configuration within a few tens of milliseconds. These characteristics are crucial to the unrivaled agility of bats and copying them can potentially transform the state-of-the-art aerial drone design. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: 2 pages, abstract, 2 figures, accepted at International Conference on Intelligent Robots and Systems (IROS),2020

arXiv:2006.07187 [pdf, other]

doi 10.3390/info11060318

HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach

Authors: Kamran Kowsari, Rasoul Sali, Lubaina Ehsan, William Adorno, Asad Ali, Sean Moore, Beatrice Amadi, Paul Kelly, Sana Syed, Donald Brown

Abstract: Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the curre… ▽ More Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC). △ Less

Submitted 23 June, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Journal ref: Information 11, no. 6 (2020): 318

arXiv:2006.04794 [pdf]

Abstracting spreadsheet data flow through hypergraph redrawing

Authors: David Birch, Nicolai Stawinoga, Jack Binks, Bruno Nicoletti, Paul Kelly

Abstract: We believe the error prone nature of traditional spreadsheets is due to their low level of abstraction. End user programmers are forced to construct their data models from low level cells which we define as "a data container or manipulator linked by user-intent to model their world and positioned to reflect its structure". Spreadsheet cells are limited in what they may contain (scalar values) and… ▽ More We believe the error prone nature of traditional spreadsheets is due to their low level of abstraction. End user programmers are forced to construct their data models from low level cells which we define as "a data container or manipulator linked by user-intent to model their world and positioned to reflect its structure". Spreadsheet cells are limited in what they may contain (scalar values) and the links between them are inherently hidden. This paper proposes a method of raising the level of abstraction of spreadsheets by "redrawing the boundary" of the cell. To expose the hidden linkage structure we transform spreadsheets into fine-grained graphs with operators and values as nodes. "cells" are then represented as hypergraph edges by drawing a boundary "wall" around a set of operator/data nodes. To extend what cells may contain and to create a higher level model of the spreadsheet we propose that researchers should seek techniques to redraw these boundaries to create higher level "cells" which will more faithfully represent the end-user's real world/mental model. We illustrate this approach via common sub-expression identification and the application of sub-tree isomorphisms for the detection of vector (array) operations. △ Less

Submitted 4 June, 2020; originally announced June 2020.

Comments: 23 Pages, 12 Colour Figures

Journal ref: Proceedings of the EuSpRIG 2019 Conference "Spreadsheet Risk Management", Browns, Covent Garden, London, pp79-102, ISBN: 978-1-905404-56-8

arXiv:2006.01765 [pdf, other]

AnalogNet: Convolutional Neural Network Inference on Analog Focal Plane Sensor Processors

Authors: Matthew Z. Wong, Benoit Guillard, Riku Murai, Sajad Saeedi, Paul H. J. Kelly

Abstract: We present a high-speed, energy-efficient Convolutional Neural Network (CNN) architecture utilising the capabilities of a unique class of devices known as analog Focal Plane Sensor Processors (FPSP), in which the sensor and the processor are embedded together on the same silicon chip. Unlike traditional vision systems, where the sensor array sends collected data to a separate processor for process… ▽ More We present a high-speed, energy-efficient Convolutional Neural Network (CNN) architecture utilising the capabilities of a unique class of devices known as analog Focal Plane Sensor Processors (FPSP), in which the sensor and the processor are embedded together on the same silicon chip. Unlike traditional vision systems, where the sensor array sends collected data to a separate processor for processing, FPSPs allow data to be processed on the imaging device itself. This unique architecture enables ultra-fast image processing and high energy efficiency, at the expense of limited processing resources and approximate computations. In this work, we show how to convert standard CNNs to FPSP code, and demonstrate a method of training networks to increase their robustness to analog computation errors. Our proposed architecture, coined AnalogNet, reaches a testing accuracy of 96.9% on the MNIST handwritten digits recognition task, at a speed of 2260 FPS, for a cost of 0.7 mJ per frame. △ Less

Submitted 21 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

Comments: 8 pages, 7 figures

arXiv:2005.03868 [pdf, other]

Hierarchical Deep Convolutional Neural Networks for Multi-category Diagnosis of Gastrointestinal Disorders on Histopathological Images

Authors: Rasoul Sali, Sodiq Adewole, Lubaina Ehsan, Lee A. Denson, Paul Kelly, Beatrice C. Amadi, Lori Holtz, Syed Asad Ali, Sean R. Moore, Sana Syed, Donald E. Brown

Abstract: Deep convolutional neural networks(CNNs) have been successful for a wide range of computer vision tasks, including image classification. A specific area of the application lies in digital pathology for pattern recognition in the tissue-based diagnosis of gastrointestinal(GI) diseases. This domain can utilize CNNs to translate histopathological images into precise diagnostics. This is challenging s… ▽ More Deep convolutional neural networks(CNNs) have been successful for a wide range of computer vision tasks, including image classification. A specific area of the application lies in digital pathology for pattern recognition in the tissue-based diagnosis of gastrointestinal(GI) diseases. This domain can utilize CNNs to translate histopathological images into precise diagnostics. This is challenging since these complex biopsies are heterogeneous and require multiple levels of assessment. This is mainly due to structural similarities in different parts of the GI tract and shared features among different gut diseases. Addressing this problem with a flat model that assumes all classes (parts of the gut and their diseases) are equally difficult to distinguish leads to an inadequate assessment of each class. Since the hierarchical model restricts classification error to each sub-class, it leads to a more informative model than a flat model. In this paper, we propose to apply the hierarchical classification of biopsy images from different parts of the GI tract and the receptive diseases within each. We embedded a class hierarchy into the plain VGGNet to take advantage of its layers' hierarchical structure. The proposed model was evaluated using an independent set of image patches from 373 whole slide images. The results indicate that the hierarchical model can achieve better results than the flat model for multi-category diagnosis of GI disorders using histopathological images. △ Less

Submitted 6 August, 2020; v1 submitted 8 May, 2020; originally announced May 2020.

Comments: accepted at IEEE International Conference on Healthcare Informatics (ICHI 2020)

arXiv:2004.11186 [pdf, other]

doi 10.1109/IROS45743.2020.9341151

BIT-VO: Visual Odometry at 300 FPS using Binary Features from the Focal Plane

Authors: Riku Murai, Sajad Saeedi, Paul H. J. Kelly

Abstract: Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the… ▽ More Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the focal plane, data which is digitized and transferred is reduced. As a consequence, SCAMP-5 offers a high frame rate while maintaining low energy consumption. Here, we present BIT-VO, which is, to the best of our knowledge, the first 6 Degrees of Freedom visual odometry algorithm which utilises the FPSP. Our entire system operates at 300 FPS in a natural scene, using binary edges and corner features detected by the SCAMP-5. △ Less

Submitted 23 April, 2020; originally announced April 2020.

Comments: 8 pages, 16 figures

Journal ref: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 2020, pp. 8579-8586

arXiv:2003.03396 [pdf, other]

Scalable Uncertainty for Computer Vision with Functional Variational Inference

Authors: Eduardo D C Carvalho, Ronald Clark, Andrea Nicastro, Paul H J Kelly

Abstract: As Deep Learning continues to yield successful applications in Computer Vision, the ability to quantify all forms of uncertainty is a paramount requirement for its safe and reliable deployment in the real-world. In this work, we leverage the formulation of variational inference in function space, where we associate Gaussian Processes (GPs) to both Bayesian CNN priors and variational family. Since… ▽ More As Deep Learning continues to yield successful applications in Computer Vision, the ability to quantify all forms of uncertainty is a paramount requirement for its safe and reliable deployment in the real-world. In this work, we leverage the formulation of variational inference in function space, where we associate Gaussian Processes (GPs) to both Bayesian CNN priors and variational family. Since GPs are fully determined by their mean and covariance functions, we are able to obtain predictive uncertainty estimates at the cost of a single forward pass through any chosen CNN architecture and for any supervised learning task. By leveraging the structure of the induced covariance matrices, we propose numerically efficient algorithms which enable fast training in the context of high-dimensional tasks such as depth estimation and semantic segmentation. Additionally, we provide sufficient conditions for constructing regression loss functions whose probabilistic counterparts are compatible with aleatoric uncertainty quantification. △ Less

Submitted 6 March, 2020; originally announced March 2020.

Comments: CVPR 2020

arXiv:1909.01963 [pdf, other]

Self-Attentive Adversarial Stain Normalization

Authors: Aman Shrivastava, Will Adorno, Yash Sharma, Lubaina Ehsan, S. Asad Ali, Sean R. Moore, Beatrice C. Amadi, Paul Kelly, Sana Syed, Donald E. Brown

Abstract: Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSIs) are utilized for biopsy visualization-based diagnostic and prognostic assessment of diseases. Variation in the H&E staining process across different lab sites can lead to significant variations in biopsy image appearance. These variations introduce an undesirable bias when the slides are examined by pathologists or used for training dee… ▽ More Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSIs) are utilized for biopsy visualization-based diagnostic and prognostic assessment of diseases. Variation in the H&E staining process across different lab sites can lead to significant variations in biopsy image appearance. These variations introduce an undesirable bias when the slides are examined by pathologists or used for training deep learning models. To reduce this bias, slides need to be translated to a common domain of stain appearance before analysis. We propose a Self-Attentive Adversarial Stain Normalization (SAASN) approach for the normalization of multiple stain appearances to a common domain. This unsupervised generative adversarial approach includes self-attention mechanism for synthesizing images with finer detail while preserving the structural consistency of the biopsy features during translation. SAASN demonstrates consistent and superior performance compared to other popular stain normalization techniques on H&E stained duodenal biopsy image data. △ Less

Submitted 22 November, 2020; v1 submitted 4 September, 2019; originally announced September 2019.

Comments: Accepted at AIDP (ICPR 2021)

arXiv:1908.03272 [pdf, other]

Deep Learning for Visual Recognition of Environmental Enteropathy and Celiac Disease

Authors: Aman Shrivastava, Karan Kant, Saurav Sengupta, Sung-Jun Kang, Marium Khan, Asad Ali, Sean R. Moore, Beatrice C. Amadi, Paul Kelly, Donald E. Brown, Sana Syed

Abstract: Physicians use biopsies to distinguish between different but histologically similar enteropathies. The range of syndromes and pathologies that could cause different gastrointestinal conditions makes this a difficult problem. Recently, deep learning has been used successfully in helping diagnose cancerous tissues in histopathological images. These successes motivated the research presented in this… ▽ More Physicians use biopsies to distinguish between different but histologically similar enteropathies. The range of syndromes and pathologies that could cause different gastrointestinal conditions makes this a difficult problem. Recently, deep learning has been used successfully in helping diagnose cancerous tissues in histopathological images. These successes motivated the research presented in this paper, which describes a deep learning approach that distinguishes between Celiac Disease (CD) and Environmental Enteropathy (EE) and normal tissue from digitized duodenal biopsies. Experimental results show accuracies of over 90% for this approach. We also look into interpreting the neural network model using Gradient-weighted Class Activation Mappings and filter activations on input images to understand the visual explanations for the decisions made by the model. △ Less

Submitted 8 August, 2019; originally announced August 2019.

arXiv:1906.00877 [pdf, other]

Pangloss: a novel Markov chain prefetcher

Authors: Philippos Papaphilippou, Paul H. J. Kelly, Wayne Luk

Abstract: We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In add… ▽ More We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: Accepted in The Third Data Prefetching Championship (DPC3), held in conjunction with ISCA 2019

arXiv:1905.02427 [pdf]

doi 10.1016/j.jss.2019.05.013

Model Based System Assurance Using the Structured Assurance Case Metamodel

Authors: Ran Wei, Tim P. Kelly, Xiaotian Dai, Shuai Zhao, Richard Hawkins

Abstract: Assurance cases are used to demonstrate confidence in system properties of interest (e.g. safety and/or security). A number of system assurance approaches are adopted by industries in the safety-critical domain. However, the task of constructing assurance cases remains a manual, trivial and informal process. The Structured Assurance Case Metamodel (SACM) is a standard specified by the Object Manag… ▽ More Assurance cases are used to demonstrate confidence in system properties of interest (e.g. safety and/or security). A number of system assurance approaches are adopted by industries in the safety-critical domain. However, the task of constructing assurance cases remains a manual, trivial and informal process. The Structured Assurance Case Metamodel (SACM) is a standard specified by the Object Management Group (OMG). SACM provides a richer set of features than existing system assurance languages/approaches. SACM provides a foundation for model-based system assurance, which has great potentials in growing technology domains such as Open Adaptive Systems. However, the intended usage of SACM has not been sufficiently explained. In addition, there has been no support to interoperate between existing assurance case (models) and SACM models. In this article, we explain the intended usage of SACM based on our involvement in the OMG specification process of SACM. In addition, to promote a model-based approach, we provide SACM compliant metamodels for existing system assurance approaches (the Goal Structuring Notation and Claims-Arguments-Evidence), and the transformations from these models to SACM. We also briefly discuss the tool support for model-based system assurance which helps practitioners to make the transition from existing system assurance approaches to model-based system assurance using SACM. △ Less

Submitted 7 May, 2019; originally announced May 2019.

Comments: 45 pages, 41 figures, Accepted by Journal of Systems and Software

arXiv:1904.05773 [pdf, other]

Diagnosis of Celiac Disease and Environmental Enteropathy on Biopsy Images Using Color Balancing on Convolutional Neural Networks

Authors: Kamran Kowsari, Rasoul Sali, Marium N. Khan, William Adorno, S. Asad Ali, Sean R. Moore, Beatrice C. Amadi, Paul Kelly, Sana Syed, Donald E. Brown

Abstract: Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results… ▽ More Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results in barrier dysfunction but is thought to be caused by an increased vulnerability to infections. EE has been implicated as the predominant cause of under-nutrition, oral vaccine failure, and impaired cognitive development in low-and-middle-income countries. Both conditions require a tissue biopsy for diagnosis, and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose a convolutional neural network (CNN) to classify duodenal biopsy images from subjects with CD, EE, and healthy controls. We evaluated the performance of our proposed model using a large cohort containing 1000 biopsy images. Our evaluations show that the proposed model achieves an area under ROC of 0.99, 1.00, and 0.97 for CD, EE, and healthy controls, respectively. These results demonstrate the discriminative power of the proposed model in duodenal biopsies classification. △ Less

Submitted 9 October, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

arXiv:1903.08243 [pdf, other]

doi 10.1177/1094342020945005

A study of vectorization for matrix-free finite element methods

Authors: Tianjiao Sun, Lawrence Mitchell, Kaushik Kulkarni, Andreas Klöckner, David A. Ham, Paul H. J. Kelly

Abstract: Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while… ▽ More Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30\% of theoretical peak performance for many examples of practical significance, and exceed 50\% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels. △ Less

Submitted 19 May, 2020; v1 submitted 19 March, 2019; originally announced March 2019.

Journal ref: International Journal of High Performance Computing Applications (2020)

arXiv:1811.11874 [pdf, other]

RetinaMatch: Efficient Template Matching of Retina Images for Teleophthalmology

Authors: Chen Gong, N. Benjamin Erichson, John P. Kelly, Laura Trutoiu, Brian T. Schowengerdt, Steven L. Brunton, Eric J. Seibel

Abstract: Retinal template matching and registration is an important challenge in teleophthalmology with low-cost imaging devices. However, the images from such devices generally have a small field of view (FOV) and image quality degradations, making matching difficult. In this work, we develop an efficient and accurate retinal matching technique that combines dimension reduction and mutual information (MI)… ▽ More Retinal template matching and registration is an important challenge in teleophthalmology with low-cost imaging devices. However, the images from such devices generally have a small field of view (FOV) and image quality degradations, making matching difficult. In this work, we develop an efficient and accurate retinal matching technique that combines dimension reduction and mutual information (MI), called RetinaMatch. The dimension reduction initializes the MI optimization as a coarse localization process, which narrows the optimization domain and avoids local optima. The effectiveness of RetinaMatch is demonstrated on the open fundus image database STARE with simulated reduced FOV and anticipated degradations, and on retinal images acquired by adapter-based optics attached to a smartphone. RetinaMatch achieves a success rate over 94\% on human retinal images with the matched target registration errors below 2 pixels on average, excluding the observer variability. It outperforms the standard template matching solutions. In the application of measuring vessel diameter repeatedly, single pixel errors are expected. In addition, our method can be used in the process of image mosaicking with area-based registration, providing a robust approach when the feature based methods fail. To the best of our knowledge, this is the first template matching algorithm for retina images with small template images from unconstrained retinal areas. In the context of the emerging mixed reality market, we envision automated retinal image matching and registration methods as transformative for advanced teleophthalmology and long-term retinal monitoring. △ Less

Submitted 28 November, 2018; originally announced November 2018.

arXiv:1808.06820 [pdf, other]

SLAMBench2: Multi-Objective Head-to-Head Benchmarking for Visual SLAM

Authors: Bruno Bodin, Harry Wagstaff, Sajad Saeedi, Luigi Nardi, Emanuele Vespa, John H Mayer, Andy Nisbet, Mikel Luján, Steve Furber, Andrew J Davison, Paul H. J. Kelly, Michael O'Boyle

Abstract: SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example,… ▽ More SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phonebased AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs across SLAM systems. △ Less

Submitted 21 August, 2018; originally announced August 2018.

Journal ref: 2018 IEEE International Conference on Robotics and Automation (ICRA'18)

arXiv:1808.06352 [pdf, other]

doi 10.1109/JPROC.2018.2856739

Navigating the Landscape for Real-time Localisation and Mapping for Robotics and Virtual and Augmented Reality

Authors: Sajad Saeedi, Bruno Bodin, Harry Wagstaff, Andy Nisbet, Luigi Nardi, John Mawer, Nicolas Melot, Oscar Palomar, Emanuele Vespa, Tom Spink, Cosmin Gorgovan, Andrew Webb, James Clarkson, Erik Tomusk, Thomas Debrunner, Kuba Kaszyk, Pablo Gonzalez-de-Aledo, Andrey Rodchenko, Graham Riley, Christos Kotselidis, Björn Franke, Michael F. P. O'Boyle, Andrew J. Davison, Paul H. J. Kelly, Mikel Luján , et al. (1 additional authors not shown)

Abstract: Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, to… ▽ More Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context. △ Less

Submitted 20 August, 2018; originally announced August 2018.

Comments: Proceedings of the IEEE 2018

arXiv:1807.03032 [pdf, other]

Architecture and performance of Devito, a system for automated stencil computation

Authors: Fabio Luporini, Michael Lange, Mathias Louboutin, Navjot Kukreja, Jan Hückelheim, Charles Yount, Philipp Witte, Paul H. J. Kelly, Felix J. Herrmann, Gerard J. Gorman

Abstract: Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process… ▽ More Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications. △ Less

Submitted 7 February, 2020; v1 submitted 9 July, 2018; originally announced July 2018.

Comments: Submitted to ACM Transactions on Mathematical Software

MSC Class: 65N06; 68N20

arXiv:1708.03183 [pdf, other]

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Authors: Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, Paul H. J. Kelly

Abstract: Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of th… ▽ More Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation -- not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing: a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased run-time performance; implementation in a publicly-available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28x on a platform consisting of 896 Intel Broadwell cores. △ Less

Submitted 19 June, 2019; v1 submitted 10 August, 2017; originally announced August 2017.

Comments: 29 pages including supplementary materials and references

ACM Class: D.1.2; G.4

arXiv:1705.09866 [pdf, other]

doi 10.1007/s10596-018-9720-1

Machine learning for graph-based representations of three-dimensional discrete fracture networks

Authors: Manuel Valera, Zhengyang Guo, Priscilla Kelly, Sean Matz, Vito Adrian Cantu, Allon G. Percus, Jeffrey D. Hyman, Gowri Srinivasan, Hari S. Viswanathan

Abstract: Structural and topological information play a key role in modeling flow and transport through fractured rock in the subsurface. Discrete fracture network (DFN) computational suites such as dfnWorks are designed to simulate flow and transport in such porous media. Flow and transport calculations reveal that a small backbone of fractures exists, where most flow and transport occurs. Restricting the… ▽ More Structural and topological information play a key role in modeling flow and transport through fractured rock in the subsurface. Discrete fracture network (DFN) computational suites such as dfnWorks are designed to simulate flow and transport in such porous media. Flow and transport calculations reveal that a small backbone of fractures exists, where most flow and transport occurs. Restricting the flowing fracture network to this backbone provides a significant reduction in the network's effective size. However, the particle tracking simulations needed to determine the reduction are computationally intensive. Such methods may be impractical for large systems or for robust uncertainty quantification of fracture networks, where thousands of forward simulations are needed to bound system behavior. In this paper, we develop an alternative network reduction approach to characterizing transport in DFNs, by combining graph theoretical and machine learning methods. We consider a graph representation where nodes signify fractures and edges denote their intersections. Using random forest and support vector machines, we rapidly identify a subnetwork that captures the flow patterns of the full DFN, based primarily on node centrality features in the graph. Our supervised learning techniques train on particle-tracking backbone paths found by dfnWorks, but run in negligible time compared to those simulations. We find that our predictions can reduce the network to approximately 20% of its original size, while still generating breakthrough curves consistent with those of the original network. △ Less

Submitted 29 January, 2018; v1 submitted 27 May, 2017; originally announced May 2017.

Comments: Computational Geosciences (2018)

Report number: LA-UR-17-24300

Journal ref: Computational Geosciences 22, 695-710 (2018)

arXiv:1702.00505 [pdf, other]

Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications Using HyperMapper

Authors: Luigi Nardi, Bruno Bodin, Sajad Saeedi, Emanuele Vespa, Andrew J. Davison, Paul H. J. Kelly

Abstract: In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. The goal of this exploration is to reduce execution time while meeting our quality of result objectives. In previous work we showed for the first time that it is possible to map this application to power constrained embedded systems, highlighting that decis… ▽ More In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. The goal of this exploration is to reduce execution time while meeting our quality of result objectives. In previous work we showed for the first time that it is possible to map this application to power constrained embedded systems, highlighting that decision choices made at the algorithmic design-level have the most impact. As the algorithmic design space is too large to be exhaustively evaluated, we use a previously introduced multi-objective Random Forest Active Learning prediction framework dubbed HyperMapper, to find good algorithmic designs. We show that HyperMapper generalizes on a recent cutting edge 3D scene understanding algorithm and on a modern GPU-based computer architecture. HyperMapper is able to beat an expert human hand-tuning the algorithmic parameters of the class of Computer Vision applications taken under consideration in this paper automatically. In addition, we use crowd-sourcing using a 3D scene understanding Android app to show that the Pareto front obtained on an embedded system can be used to accelerate the same application on all the 83 smart-phones and tablets crowd-sourced with speedups ranging from 2 to over 12. △ Less

Submitted 21 March, 2017; v1 submitted 1 February, 2017; originally announced February 2017.

Comments: 10 pages, Keywords: design space exploration, machine learning, computer vision, SLAM, embedded systems, GPU, crowd-sourcing

Journal ref: 31st IEEE International Parallel and Distributed Processing Symposium May 29 - June 2, 2017 Orlando, Florida USA

arXiv:1604.05937 [pdf, other]

doi 10.5194/gmd-9-3803-2016

A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake

Authors: Gheorghe-Teodor Bercea, Andrew T. T. McRae, David A. Ham, Lawrence Mitchell, Florian Rathgeber, Luigi Nardi, Fabio Luporini, Paul H. J. Kelly

Abstract: We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite ele… ▽ More We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-specific limit. △ Less

Submitted 28 October, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

Comments: Bibliography fixes, 23 pages

Journal ref: Geoscientific Model Development 9:3803-3815 (2016)

arXiv:1604.05872 [pdf, other]

doi 10.1145/3054944

An algorithm for the optimization of finite element integration loops

Authors: Fabio Luporini, David A. Ham, Paul H. J. Kelly

Abstract: We present an algorithm for the optimization of a class of finite element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the s… ▽ More We present an algorithm for the optimization of a class of finite element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the state of the art in finite element code generation in almost all cases. This validates the effectiveness of the algorithm presented here, and illustrates its limitations. △ Less

Submitted 20 April, 2016; originally announced April 2016.

ACM Class: G.1.8; G.4

arXiv:1512.06282 [pdf, ps, other]

Contributions to the compositional semantics of first-order predicate logic

Authors: Philip Kelly, M. H. van Emden

Abstract: Henkin, Monk and Tarski gave a compositional semantics for first-order predicate logic. We extend this work by including function symbols in the language and by giving the denotation of the atomic formula as a composition of the denotations of its predicate symbol and of its tuple of arguments. In addition we give the denotation of a term as a composition of the denotations of its function symbol… ▽ More Henkin, Monk and Tarski gave a compositional semantics for first-order predicate logic. We extend this work by including function symbols in the language and by giving the denotation of the atomic formula as a composition of the denotations of its predicate symbol and of its tuple of arguments. In addition we give the denotation of a term as a composition of the denotations of its function symbol and of its tuple of arguments. △ Less

Submitted 19 December, 2015; originally announced December 2015.

Comments: 14 pages, 1 figure

Report number: DCS-356-IR

arXiv:1509.04648 [pdf, other]

doi 10.1109/ICRA.2016.7487261

Comparative Design Space Exploration of Dense and Semi-Dense SLAM

Authors: M. Zeeshan Zia, Luigi Nardi, Andrew Jack, Emanuele Vespa, Bruno Bodin, Paul H. J. Kelly, Andrew J. Davison

Abstract: SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these sys… ▽ More SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights. △ Less

Submitted 3 March, 2016; v1 submitted 15 September, 2015; originally announced September 2015.

Comments: IEEE International Conference on Robotics and Automation 2016

arXiv:1505.04694 [pdf, other]

Thread Parallelism for Highly Irregular Computation in Anisotropic Mesh Adaptation

Authors: Georgios Rokos, Gerard J. Gorman, Kristian Ejlebjerg Jensen, Paul H. J. Kelly

Abstract: Thread-level parallelism in irregular applications with mutable data dependencies presents challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free. In this article we describe a methodology for exploiting thread parallelism for a class of graph-mutating worklist algorithms, w… ▽ More Thread-level parallelism in irregular applications with mutable data dependencies presents challenges because the underlying data is extensively modified during execution of the algorithm and a high degree of parallelism must be realized while keeping the code race-free. In this article we describe a methodology for exploiting thread parallelism for a class of graph-mutating worklist algorithms, which guarantees safe parallel execution via processing in rounds of independent sets and using a deferred update strategy to commit changes in the underlying data structures. Scalability is assisted by atomic fetch-and-add operations to create worklists and work-stealing to balance the shared-memory workload. This work is motivated by mesh adaptation algorithms, for which we show a parallel efficiency of 60% and 50% on Intel(R) Xeon(R) Sandy Bridge and AMD Opteron(tm) Magny-Cours systems, respectively, using these techniques. △ Less

Submitted 18 May, 2015; originally announced May 2015.

Comments: To appear in the proceedings of EASC 2015

arXiv:1505.04134 [pdf, ps, other]

An Interrupt-Driven Work-Sharing For-Loop Scheduler

Authors: Georgios Rokos, Gerard J. Gorman, Paul H. J. Kelly

Abstract: In this paper we present a parallel for-loop scheduler which is based on work-stealing principles but runs under a completely cooperative scheme. POSIX signals are used by idle threads to interrupt left-behind workers, which in turn decide what portion of their workload can be given to the requester. We call this scheme Interrupt-Driven Work-Sharing (IDWS). This article describes how IDWS works, h… ▽ More In this paper we present a parallel for-loop scheduler which is based on work-stealing principles but runs under a completely cooperative scheme. POSIX signals are used by idle threads to interrupt left-behind workers, which in turn decide what portion of their workload can be given to the requester. We call this scheme Interrupt-Driven Work-Sharing (IDWS). This article describes how IDWS works, how it can be integrated into any POSIX-compliant OpenMP implementation and how a user can manually replace OpenMP parallel for-loops with IDWS in existing POSIX-compliant C++ applications. Additionally, we measure its performance using both a synthetic benchmark with varying distributions of workload across the iteration space and a real-life application on Sandy Bridge and Xeon Phi systems. Regardless the workload distribution and the underlying hardware, IDWS is always the best or among the best-performing strategies, providing a good all-around solution to the scheduling-choice dilemma. △ Less

Submitted 18 May, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

arXiv:1505.04086 [pdf, ps, other]

A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

Authors: Georgios Rokos, Gerard Gorman, Paul H J Kelly

Abstract: Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of expos… ▽ More Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels. △ Less

Submitted 18 May, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

Comments: To appear in the proceedings of Euro Par 2015

arXiv:1501.01809 [pdf, other]

doi 10.1145/2998441

Firedrake: automating the finite element method by composing abstractions

Authors: Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, Paul H. J. Kelly

Abstract: Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separat… ▽ More Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns which eases the incorporation of separate contributions from computer scientists, numerical analysts and application specialists. These contributions may add functionality, or improve performance. Firedrake benefits from automatically applying new optimisations. This includes factorising mixed function spaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside pure variational formulations, such as flux-limiters. △ Less

Submitted 1 July, 2016; v1 submitted 8 January, 2015; originally announced January 2015.

Comments: Minor revisions to v2

ACM Class: G.1.8; G.4

Journal ref: ACM Transactions on Mathematical Software 43(3):24:1--24:27 (2016)

Showing 1–50 of 57 results for author: Kelly, P