-
Active learning of neural population dynamics using two-photon holographic optogenetics
Authors:
Andrew Wagenmaker,
Lu Mi,
Marton Rozsa,
Matthew S. Bull,
Karel Svoboda,
Kayvon Daie,
Matthew D. Golub,
Kevin Jamieson
Abstract:
Recent advances in techniques for monitoring and perturbing neural populations have greatly enhanced our ability to study circuits in the brain. In particular, two-photon holographic optogenetics now enables precise photostimulation of experimenter-specified groups of individual neurons, while simultaneous two-photon calcium imaging enables the measurement of ongoing and induced activity across th…
▽ More
Recent advances in techniques for monitoring and perturbing neural populations have greatly enhanced our ability to study circuits in the brain. In particular, two-photon holographic optogenetics now enables precise photostimulation of experimenter-specified groups of individual neurons, while simultaneous two-photon calcium imaging enables the measurement of ongoing and induced activity across the neural population. Despite the enormous space of potential photostimulation patterns and the time-consuming nature of photostimulation experiments, very little algorithmic work has been done to determine the most effective photostimulation patterns for identifying the neural population dynamics. Here, we develop methods to efficiently select which neurons to stimulate such that the resulting neural responses will best inform a dynamical model of the neural population activity. Using neural population responses to photostimulation in mouse motor cortex, we demonstrate the efficacy of a low-rank linear dynamical systems model, and develop an active learning procedure which takes advantage of low-rank structure to determine informative photostimulation patterns. We demonstrate our approach on both real and synthetic data, obtaining in some cases as much as a two-fold reduction in the amount of data required to reach a given predictive power. Our active stimulation design method is based on a novel active learning procedure for low-rank regression, which may be of independent interest.
△ Less
Submitted 8 May, 2025; v1 submitted 3 December, 2024;
originally announced December 2024.
-
Performance and scaling of the LFRic weather and climate model on different generations of HPE Cray EX supercomputers
Authors:
J. Mark Bull,
Andrew Coughtrie,
Deva Deeptimahanti,
Mark Hedley,
Caoimhín Laoide-Kemp,
Christopher Maynard,
Harry Shepherd,
Sebastiaan van de Bund,
Michèle Weiland,
Benjamin Went
Abstract:
This study presents scaling results and a performance analysis across different supercomputers and compilers for the Met Office weather and climate model, LFRic. The model is shown to scale to large numbers of nodes which meets the design criteria, that of exploitation of parallelism to achieve good scaling. The model is written in a Domain-Specific Language, embedded in modern Fortran and uses a…
▽ More
This study presents scaling results and a performance analysis across different supercomputers and compilers for the Met Office weather and climate model, LFRic. The model is shown to scale to large numbers of nodes which meets the design criteria, that of exploitation of parallelism to achieve good scaling. The model is written in a Domain-Specific Language, embedded in modern Fortran and uses a Domain-Specific Compiler, PSyclone, to generate the parallel code. The performance analysis shows the effect of choice of algorithm, such as redundant computation and scaling with OpenMP threads. The analysis can be used to motivate a discussion of future work to improve the OpenMP performance of other parts of the code. Finally, an analysis of the performance tuning of the I/O server, XIOS is presented.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Quantum Task Offloading with the OpenMP API
Authors:
Joseph K. L. Lee,
Oliver T. Brown,
Mark Bull,
Martin Ruefenacht,
Johannes Doerfert,
Michael Klemm,
Martin Schulz
Abstract:
Most of the widely used quantum programming languages and libraries are not designed for the tightly coupled nature of hybrid quantum-classical algorithms, which run on quantum resources that are integrated on-premise with classical HPC infrastructure. We propose a programming model using the API provided by OpenMP to target quantum devices, which provides an easy-to-use and efficient interface fo…
▽ More
Most of the widely used quantum programming languages and libraries are not designed for the tightly coupled nature of hybrid quantum-classical algorithms, which run on quantum resources that are integrated on-premise with classical HPC infrastructure. We propose a programming model using the API provided by OpenMP to target quantum devices, which provides an easy-to-use and efficient interface for HPC applications to utilize quantum compute resources. We have implemented a variational quantum eigensolver using the programming model, which has been tested using a classical simulator. We are in the process of testing on the quantum resources hosted at the Leibniz Supercomputing Centre (LRZ).
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Extended abstract: Type oriented programming for task based parallelism
Authors:
Nick Brown,
Ludovic Capelli,
J. Mark Bull
Abstract:
Writing parallel codes is difficult and exhibits a fundamental trade-off between abstraction and performance. The high level language abstractions designed to simplify the complexities of parallelism make certain assumptions that impacts performance and scalability. On the other hand lower level languages, providing many opportunities for optimisation, require in-depth knowledge and the programmer…
▽ More
Writing parallel codes is difficult and exhibits a fundamental trade-off between abstraction and performance. The high level language abstractions designed to simplify the complexities of parallelism make certain assumptions that impacts performance and scalability. On the other hand lower level languages, providing many opportunities for optimisation, require in-depth knowledge and the programmer to consider tricky details of parallelism. An approach is required which can bridge the gap and provide both the ease of programming and opportunities for control and optimisation. By optionally decorating their codes with additional type information, programmers can either direct the compiler to make certain decisions or rely on sensible default choices.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Driving asynchronous distributed tasks with events
Authors:
Nick Brown,
Oliver Thomson Brown,
J. Mark Bull
Abstract:
Open-source matters, not just to the current cohort of HPC users but also to potential new HPC communities, such as machine learning, themselves often rooted in open-source. Many of these potential new workloads are, by their very nature, far more asynchronous and unpredictable than traditional HPC codes and open-source solutions must be found to enable new communities of developers to easily take…
▽ More
Open-source matters, not just to the current cohort of HPC users but also to potential new HPC communities, such as machine learning, themselves often rooted in open-source. Many of these potential new workloads are, by their very nature, far more asynchronous and unpredictable than traditional HPC codes and open-source solutions must be found to enable new communities of developers to easily take advantage of large scale parallel machines. Task-based models have the potential to help here, but many of these either entirely abstract the user from the distributed nature of their code, placing emphasis on the runtime to make important decisions concerning scheduling and locality, or require the programmer to explicitly combine their task-based code with a distributed memory technology such as MPI, which adds considerable complexity. In this paper we describe a new approach where the programmer still splits their code up into distinct tasks, but is explicitly aware of the distributed nature of the machine and drives interactions between tasks via events. This provides the best of both worlds; the programmer is able to direct important aspects of parallelism whilst still being abstracted from the low level mechanism of how this parallelism is achieved. We demonstrate our approach via two use-cases, the Graph500 BFS benchmark and in-situ data analytics of MONC, an atmospheric model. For both applications we demonstrate considerably improved performance at large core counts and the result of this work is an approach and open-source library which is readily applicable to a wide range of codes.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
iPregel: Vertex-centric programmability vs memory efficiency and performance, why choose?
Authors:
Ludovic A. R. Capelli,
Zhenjiang Hu,
Timothy A. K. Zakian,
Nick Brown,
J. Mark Bull
Abstract:
The vertex-centric programming model, designed to improve the programmability in graph processing application writing, has attracted great attention over the years. However, shared memory frameworks that implement the vertex-centric interface all expose a common tradeoff: programmability against memory efficiency and performance.
Our approach, iPregel, preserves vertex-centric programmability, w…
▽ More
The vertex-centric programming model, designed to improve the programmability in graph processing application writing, has attracted great attention over the years. However, shared memory frameworks that implement the vertex-centric interface all expose a common tradeoff: programmability against memory efficiency and performance.
Our approach, iPregel, preserves vertex-centric programmability, while implementing optimisations for performance, and designing these so they are transparent to a user's application code, hence not impacting programmability. In this paper, we evaluate iPregel against FemtoGraph, whose characteristics are identical, an asynchronous counterpart GraphChi and the vertex-subset-centric framework Ligra. Our experiments include three of the most popular vertex-centric benchmark applications over 4 real-world publicly accessible graphs, which cover orders of magnitude between a million to a billion edges, measuring execution time and peak memory usage. Finally, we evaluate the programmability of each framework by comparing against Google's original Pregel framework.
Experiments demonstrate that iPregel, like FemtoGraph, does not sacrifice vertex-centric programmability for additional performance and memory efficiency optimisations, which contrasts against GraphChi and Ligra. Sacrificing vertex-centric programmability allowed the latter to benefit from substantial performance and memory efficiency gains. We demonstrate that iPregel is up to 2300 times faster than FemtoGraph, as well as generating a memory footprint up to 100 times smaller. Ligra and GraphChi are up to 17000 and 700 times faster than FemtoGraph but, when comparing against iPregel, this maximum speed-up drops to 10. Furthermore, with PageRank, iPregel is the fastest overall. For memory efficiency, iPregel provides the same memory efficiency as Ligra and 3 to 6 times lighter than GraphChi on average.
△ Less
Submitted 17 October, 2020;
originally announced October 2020.
-
iPregel: Strategies to Deal with an Extreme Form of Irregularity in Vertex-Centric Graph Processing
Authors:
Ludovic Anthony Richard Capelli,
Nick Brown,
Jonathan Mark Bull
Abstract:
Over the last decade, the vertex-centric programming model has attracted significant attention in the world of graph processing, resulting in the emergence of a number of vertex-centric frameworks. Its simple programming interface, where computation is expressed from a vertex point of view, offers both ease of programming to the user and inherent parallelism for the underlying framework to leverag…
▽ More
Over the last decade, the vertex-centric programming model has attracted significant attention in the world of graph processing, resulting in the emergence of a number of vertex-centric frameworks. Its simple programming interface, where computation is expressed from a vertex point of view, offers both ease of programming to the user and inherent parallelism for the underlying framework to leverage. However, vertex-centric programs represent an extreme form of irregularity, both inter and intra core. This is because they exhibit a variety of challenges from a workload that may greatly vary across supersteps, through fine-grain synchronisations, to memory accesses that are unpredictable both in terms of quantity and location. In this paper, we explore three optimisations which address these irregular challenges; a hybrid combiner carefully coupling lock-free and lock-based combinations, the partial externalisation of vertex structures to improve locality and the shift to an edge-centric representation of the workload. The optimisations were integrated into the iPregel vertex-centric framework, enabling the evaluation of each optimisation in the context of graph processing across three general purpose benchmarks common in the vertex-centric community, each run on four publicly available graphs covering all orders of magnitude from a million to a billion edges. The result of this work is a set of techniques which we believe not only provide a significant performance improvement in vertex-centric graph processing, but are also applicable more generally to irregular applications.
△ Less
Submitted 4 October, 2020;
originally announced October 2020.
-
A highly scalable approach to solving linear systems using two-stage multisplitting
Authors:
Nick Brown,
J. Mark Bull,
Iain Bethune
Abstract:
Iterative methods for solving large sparse systems of linear equations are widely used in many HPC applications. Extreme scaling of these methods can be difficult, however, since global communication to form dot products is typically required at every iteration.
To try to overcome this limitation we propose a hybrid approach, where the matrix is partitioned into blocks. Within each block, we use…
▽ More
Iterative methods for solving large sparse systems of linear equations are widely used in many HPC applications. Extreme scaling of these methods can be difficult, however, since global communication to form dot products is typically required at every iteration.
To try to overcome this limitation we propose a hybrid approach, where the matrix is partitioned into blocks. Within each block, we use a highly optimised (parallel) conventional solver, but we then couple the blocks together using block Jacobi or some other multisplitting technique that can be implemented in either a synchronous or an asynchronous fashion. This allows us to limit the block size to the point where the conventional iterative methods no longer scale, and to avoid global communication (and possibly synchronisation) across all processes.
Our block framework has been built to use PETSc, a popular scientific suite for solving sparse linear systems, as the synchronous intra-block solver, and we demonstrate results on up to 32768 cores of a Cray XE6 system. At this scale, the conventional solvers are still more efficient, though trends suggest that the hybrid approach may be beneficial at higher core counts.
△ Less
Submitted 26 September, 2020;
originally announced September 2020.
-
An efficient algorithm for the calculation of reserves for non-unit linked life policies
Authors:
Mark Tucker,
J. Mark Bull
Abstract:
The underlying stochastic nature of the requirements for the Solvency II regulations has introduced significant challenges if the required calculations are to be performed correctly, without resorting to excessive approximations, within practical timescales. It is generally acknowledged by practising actuaries within UK life offices that it is currently impossible to correctly fulfil the requireme…
▽ More
The underlying stochastic nature of the requirements for the Solvency II regulations has introduced significant challenges if the required calculations are to be performed correctly, without resorting to excessive approximations, within practical timescales. It is generally acknowledged by practising actuaries within UK life offices that it is currently impossible to correctly fulfil the requirements imposed by Solvency II using existing computational techniques based on commercially available valuation packages. Our work has already shown that it is possible to perform profitability calculations at a far higher rate than is achievable using commercial packages. One of the key factors in achieving these gains is to calculate reserves using recurrence relations that scale linearly with the number of time steps. Here, we present a general vector recurrence relation which can be used for a wide range of non-unit linked policies that are covered by Solvency II; such contracts include annuities, term assurances, and endowments. Our results suggest that by using an optimised parallel implementation of this algorithm, on an affordable hardware platform, it is possible to perform the `brute force' approach to demonstrating solvency in a realistic timescale (of the order of a few hours).
△ Less
Submitted 27 June, 2014; v1 submitted 8 January, 2014;
originally announced January 2014.