-
SMaLL: A Software Framework for portable Machine Learning Libraries
Authors:
Upasana Sridhar,
Nicholai Tukanov,
Elliott Binder,
Tze Meng Low,
Scott McMillan,
Martin D. Schatz
Abstract:
Interest in deploying Deep Neural Network (DNN) inference on edge devices has resulted in an explosion of the number and types of hardware platforms to use. While the high-level programming interface, such as TensorFlow, can be readily ported across different devices, high-performance inference implementations rely on a good mapping of the high-level interface to the target hardware platform. Comm…
▽ More
Interest in deploying Deep Neural Network (DNN) inference on edge devices has resulted in an explosion of the number and types of hardware platforms to use. While the high-level programming interface, such as TensorFlow, can be readily ported across different devices, high-performance inference implementations rely on a good mapping of the high-level interface to the target hardware platform. Commonly, this mapping may use optimizing compilers to generate code at compile time or high-performance vendor libraries that have been specialized to the target platform. Both approaches rely on expert knowledge to produce the mapping, which may be time-consuming and difficult to extend to new architectures.
In this work, we present a DNN library framework, SMaLL, that is easily extensible to new architectures. The framework uses a unified loop structure and shared, cache-friendly data format across all intermediate layers, eliminating the time and memory overheads incurred by data transformation between layers. Layers are implemented by simply specifying the layer's dimensions and a kernel -- the key computing operations of each layer. The unified loop structure and kernel abstraction allows us to reuse code across layers and computing platforms. New architectures only require the 100s of lines in the kernel to be redesigned. To show the benefits of our approach, we have developed software that supports a range of layer types and computing platforms, which is easily extensible for rapidly instantiating high performance DNN libraries.
We evaluate our software by instantiating networks from the TinyMLPerf benchmark suite on 5 ARM platforms and 1 x86 platform ( an AMD Zen 2). Our framework shows end-to-end performance that is comparable to or better than ML Frameworks such as TensorFlow, TVM and LibTorch.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Delayed Asynchronous Iterative Graph Algorithms
Authors:
Mark P. Blanco,
Scott McMillan,
Tze Meng Low
Abstract:
Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-…
▽ More
Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5% - 19.4% speedup over an asynchronous approach for Pagerank and up to 1.9% - 17% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main diagonal of the adjacency matrix.
△ Less
Submitted 29 September, 2021;
originally announced October 2021.
-
LAGraph: Linear Algebra, Network Analysis Libraries, and the Study of Graph Algorithms
Authors:
Gábor Szárnyas,
David A. Bader,
Timothy A. Davis,
James Kitchen,
Timothy G. Mattson,
Scott McMillan,
Erik Welch
Abstract:
Graph algorithms can be expressed in terms of linear algebra. GraphBLAS is a library of low-level building blocks for such algorithms that targets algorithm developers. LAGraph builds on top of the GraphBLAS to target users of graph algorithms with high-level algorithms common in network analysis. In this paper, we describe the first release of the LAGraph library, the design decisions behind the…
▽ More
Graph algorithms can be expressed in terms of linear algebra. GraphBLAS is a library of low-level building blocks for such algorithms that targets algorithm developers. LAGraph builds on top of the GraphBLAS to target users of graph algorithms with high-level algorithms common in network analysis. In this paper, we describe the first release of the LAGraph library, the design decisions behind the library, and performance using the GAP benchmark suite. LAGraph, however, is much more than a library. It is also a project to document and analyze the full range of algorithms enabled by the GraphBLAS. To that end, we have developed a compact and intuitive notation for describing these algorithms. In this paper, we present that notation with examples from the GAP benchmark suite.
△ Less
Submitted 4 April, 2021;
originally announced April 2021.
-
Towards an Objective Metric for the Performance of Exact Triangle Count
Authors:
Mark P. Blanco,
Scott McMillan,
Tze Meng Low
Abstract:
The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss…
▽ More
The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.
△ Less
Submitted 29 September, 2021; v1 submitted 16 September, 2020;
originally announced September 2020.
-
Evaluation of Sampling Methods for Robotic Sediment Sampling Systems
Authors:
Jun Han Bae,
Wonse Jo,
Jee Hwan Park,
Richard M. Voyles,
Sara K. McMillan,
Byung-Cheol Min
Abstract:
Analysis of sediments from rivers, lakes, reservoirs, wetlands and other constructed surface water impoundments is an important tool to characterize the function and health of these systems, but is generally carried out manually. This is costly and can be hazardous and difficult for humans due to inaccessibility, contamination, or availability of required equipment. Robotic sampling systems can ea…
▽ More
Analysis of sediments from rivers, lakes, reservoirs, wetlands and other constructed surface water impoundments is an important tool to characterize the function and health of these systems, but is generally carried out manually. This is costly and can be hazardous and difficult for humans due to inaccessibility, contamination, or availability of required equipment. Robotic sampling systems can ease these burdens, but little work has examined the efficiency of such sampling means and no prior work has investigated the quality of the resulting samples. This paper presents an experimental study that evaluates and optimizes sediment sampling patterns applied to a robot sediment sampling system that allows collection of minimally-disturbed sediment cores from natural and man-made water bodies for various sediment types. To meet this need, we developed and tested a robotic sampling platform in the laboratory to test functionality under a range of sediment types and operating conditions. Specifically, we focused on three patterns by which a cylindrical coring device was driven into the sediment (linear, helical, and zig-zag) for three sediment types (coarse sand, medium sand, and silt). The results show that the optimal sampling pattern varies depending on the type of sediment and can be optimized based on the sampling objective. We examined two sampling objectives: maximizing the mass of minimally disturbed sediment and minimizing the power per mass of sample. This study provides valuable data to aid in the selection of optimal sediment coring methods for various applications and builds a solid foundation for future field testing under a range of environmental conditions.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
Delta-stepping SSSP: from Vertices and Edges to GraphBLAS Implementations
Authors:
Upasana Sridhar,
Mark Blanco,
Rahul Mayuranath,
Daniele G. Spampinato,
Tze Meng Low,
Scott McMillan
Abstract:
GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two…
▽ More
GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two approaches can pose considerable difficulties in the adoption of the GraphBLAS as standard interface for development. This paper investigates a systematic approach for translating a graph algorithm described in the canonical vertex and edge representation into an implementation that leverages the GraphBLAS interface. We present a two-step approach to this problem. First, we express common vertex- and edge-centric design patterns using a linear algebraic language. Second, we map this intermediate representation to the GraphBLAS interface. We illustrate our approach by translating the delta-stepping single source shortest path algorithm from its canonical description to a GraphBLAS implementation, and highlight lessons learned when implementing using GraphBLAS.
△ Less
Submitted 16 September, 2020; v1 submitted 15 November, 2019;
originally announced November 2019.
-
Mathematical Foundations of the GraphBLAS
Authors:
Jeremy Kepner,
Peter Aaltonen,
David Bader,
Aydın Buluc,
Franz Franchetti,
John Gilbert,
Dylan Hutchison,
Manoj Kumar,
Andrew Lumsdaine,
Henning Meyerhenke,
Scott McMillan,
Jose Moreira,
John D. Owens,
Carl Yang,
Marcin Zalewski,
Timothy Mattson
Abstract:
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of th…
▽ More
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.
△ Less
Submitted 13 July, 2016; v1 submitted 18 June, 2016;
originally announced June 2016.
-
Simulating the universe on an intercontinental grid of supercomputers
Authors:
Simon Portegies Zwart,
Tomoaki Ishiyama,
Derek Groen,
Keigo Nitadori,
Junichiro Makino,
Cees de Laat,
Stephen McMillan,
Kei Hiraki,
Stefan Harfst,
Paola Grosso
Abstract:
Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbation…
▽ More
Understanding the universe is hampered by the elusiveness of its most common constituent, cold dark matter. Almost impossible to observe, dark matter can be studied effectively by means of simulation and there is probably no other research field where simulation has led to so much progress in the last decade. Cosmological N-body simulations are an essential tool for evolving density perturbations in the nonlinear regime. Simulating the formation of large-scale structures in the universe, however, is still a challenge due to the enormous dynamic range in spatial and temporal coordinates, and due to the enormous computer resources required. The dynamic range is generally dealt with by the hybridization of numerical techniques. We deal with the computational requirements by connecting two supercomputers via an optical network and make them operate as a single machine. This is challenging, if only for the fact that the supercomputers of our choice are separated by half the planet, as one is located in Amsterdam and the other is in Tokyo. The co-scheduling of the two computers and the 'gridification' of the code enables us to achieve a 90% efficiency for this distributed intercontinental supercomputer.
△ Less
Submitted 5 January, 2010;
originally announced January 2010.
-
Exploring the Use of Virtual Worlds as a Scientific Research Platform: The Meta-Institute for Computational Astrophysics (MICA)
Authors:
S. G. Djorgovski,
P. Hut,
S. McMillan,
E. Vesperini,
R. Knop,
W. Farr,
M. J. Graham
Abstract:
We describe the Meta-Institute for Computational Astrophysics (MICA), the first professional scientific organization based exclusively in virtual worlds (VWs). The goals of MICA are to explore the utility of the emerging VR and VWs technologies for scientific and scholarly work in general, and to facilitate and accelerate their adoption by the scientific research community. MICA itself is an exp…
▽ More
We describe the Meta-Institute for Computational Astrophysics (MICA), the first professional scientific organization based exclusively in virtual worlds (VWs). The goals of MICA are to explore the utility of the emerging VR and VWs technologies for scientific and scholarly work in general, and to facilitate and accelerate their adoption by the scientific research community. MICA itself is an experiment in academic and scientific practices enabled by the immersive VR technologies. We describe the current and planned activities and research directions of MICA, and offer some thoughts as to what the future developments in this arena may be.
△ Less
Submitted 20 July, 2009;
originally announced July 2009.
-
A parallel gravitational N-body kernel
Authors:
Simon Portegies Zwart,
Steve McMillan,
Derek Groen,
Alessia Gualandris,
Michael Sipior,
Willem Vermin
Abstract:
We describe source code level parallelization for the {\tt kira} direct gravitational $N$-body integrator, the workhorse of the {\tt starlab} production environment for simulating dense stellar systems. The parallelization strategy, called ``j-parallelization'', involves the partition of the computational domain by distributing all particles in the system among the available processors. Partial…
▽ More
We describe source code level parallelization for the {\tt kira} direct gravitational $N$-body integrator, the workhorse of the {\tt starlab} production environment for simulating dense stellar systems. The parallelization strategy, called ``j-parallelization'', involves the partition of the computational domain by distributing all particles in the system among the available processors. Partial forces on the particles to be advanced are calculated in parallel by their parent processors, and are then summed in a final global operation. Once total forces are obtained, the computing elements proceed to the computation of their particle trajectories. We report the results of timing measurements on four different parallel computers, and compare them with theoretical predictions. The computers employ either a high-speed interconnect, a NUMA architecture to minimize the communication overhead or are distributed in a grid. The code scales well in the domain tested, which ranges from 1024 - 65536 stars on 1 - 128 processors, providing satisfactory speedup. Running the production environment on a grid becomes inefficient for more than 60 processors distributed across three sites.
△ Less
Submitted 5 November, 2007;
originally announced November 2007.
-
Distributed N-body Simulation on the Grid Using Dedicated Hardware
Authors:
Derek Groen,
Simon Portegies Zwart,
Steve McMillan,
Jun Makino
Abstract:
We present performance measurements of direct gravitational N -body simulation on the grid, with and without specialized (GRAPE-6) hardware. Our inter-continental virtual organization consists of three sites, one in Tokyo, one in Philadelphia and one in Amsterdam. We run simulations with up to 196608 particles for a variety of topologies. In many cases, high performance simulations over the enti…
▽ More
We present performance measurements of direct gravitational N -body simulation on the grid, with and without specialized (GRAPE-6) hardware. Our inter-continental virtual organization consists of three sites, one in Tokyo, one in Philadelphia and one in Amsterdam. We run simulations with up to 196608 particles for a variety of topologies. In many cases, high performance simulations over the entire planet are dominated by network bandwidth rather than latency. With this global grid of GRAPEs our calculation time remains dominated by communication over the entire range of N, which was limited due to the use of three sites. Increasing the number of particles will result in a more efficient execution. Based on these timings we construct and calibrate a model to predict the performance of our simulation on any grid infrastructure with or without GRAPE. We apply this model to predict the simulation performance on the Netherlands DAS-3 wide area computer. Equipping the DAS-3 with GRAPE-6Af hardware would achieve break-even between calculation and communication at a few million particles, resulting in a compute time of just over ten hours for 1 N -body time unit. Key words: high-performance computing, grid, N-body simulation, performance modelling
△ Less
Submitted 5 November, 2007; v1 submitted 28 September, 2007;
originally announced September 2007.