Search | arXiv e-print repository

A Flexible Instruction Set Architecture for Efficient GEMMs

Authors: Alexandre de Limas Santana, Adrià Armejach, Francesc Martinez, Erich Focht, Marc Casas

Abstract: GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISA… ▽ More GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA. △ Less

Submitted 4 July, 2025; originally announced July 2025.

ACM Class: C.1.0

arXiv:2412.13235 [pdf, ps, other]

Logic-Constrained Shortest Paths for Flight Planning

Authors: Ricardo Euler, Pedro Maristany de las Casas, Ralf Borndörfer

Abstract: The logic-constrained shortest path problem (LCSPP) combines a one-to-one shortest path problem with satisfiability constraints imposed on the routing graph. This setting arises in flight planning, where air traffic control (ATC) authorities are enforcing a set of traffic flow restrictions (TFRs) on aircraft routes in order to increase safety and throughput. We propose a new branch and bound-based… ▽ More The logic-constrained shortest path problem (LCSPP) combines a one-to-one shortest path problem with satisfiability constraints imposed on the routing graph. This setting arises in flight planning, where air traffic control (ATC) authorities are enforcing a set of traffic flow restrictions (TFRs) on aircraft routes in order to increase safety and throughput. We propose a new branch and bound-based algorithm for the LCSPP. The resulting algorithm has three main degrees of freedom: the node selection rule, the branching rule and the conflict. While node selection and branching rules have been long studied in the MIP and SAT communities, most of them cannot be applied out of the box for the LCSPP. We review the existing literature and develop tailored variants of the most prominent rules. The conflict, the set of variables to which the branching rule is applied, is unique to the LCSPP. We analyze its theoretical impact on the B&B algorithm. In the second part of the paper, we show how to model the flight planning problem with TFRs as an LCSPP and solve it using the branch and bound algorithm. We demonstrate the algorithm's efficiency on a dataset consisting of a global flight graph and a set of around 20000 real TFRs obtained from our industry partner Lufthansa Systems GmbH. We make this dataset publicly available. Finally, we conduct an empirical in-depth analysis of dynamic shortest path algorithms, node selection rules, branching rules and conflicts. Carefully choosing an appropriate combination yields an improvement of an order of magnitude compared to an uninformed choice. △ Less

Submitted 11 June, 2025; v1 submitted 17 December, 2024; originally announced December 2024.

arXiv:2406.02579 [pdf, other]

doi 10.1109/FPL60245.2023.00011

An Open-Source Framework for Efficient Numerically-Tailored Computations

Authors: Louis Ledoux, Marc Casas

Abstract: We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The framework offers two primary contributions: first, a fine-tuned, automated pipeline for arithmetic datapath generation, enabling highly customizable systolic MMM kernels; second, seamless integration of the generated kernels into user code, irrespective of t… ▽ More We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The framework offers two primary contributions: first, a fine-tuned, automated pipeline for arithmetic datapath generation, enabling highly customizable systolic MMM kernels; second, seamless integration of the generated kernels into user code, irrespective of the programming language employed, without necessitating modifications. The framework demonstrates a systematic enhancement in accuracy per energy cost across diverse High Performance Computing (HPC) workloads displaying a variety of numerical requirements, such as Artificial Intelligence (AI) inference and Sea Surface Height (SSH) computation. For AI inference, we consider a set of state-of-the-art neural network models, namely ResNet18, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11, in conjunction with two datasets, two computer formats, and 27 distinct intermediate arithmetic datapaths. Our approach consistently reduces energy consumption across all cases, with a notable example being the reduction by factors of $3.3\times$ for IEEE754-32 and $1.4\times$ for Bfloat16 during ImageNet inference with ResNet50. This is accomplished while maintaining accuracies of $82.3\%$ and $86\%$, comparable to those achieved with conventional Floating-Point Units (FPUs). In the context of SSH computation, our method achieves fully-reproducible results using double-precision words, surpassing the accuracy of conventional double- and quad-precision arithmetic in FPUs. Our approach enhances SSH computation accuracy by a minimum of $5\times$ and $27\times$ compared to IEEE754-64 and IEEE754-128, respectively, resulting in $5.6\times$ and $15.1\times$ improvements in accuracy per power cost. △ Less

Submitted 29 May, 2024; originally announced June 2024.

Comments: 6 pages, open-source

Journal ref: International Conference on Field Programmable Logic and Applications 2023

arXiv:2403.15181 [pdf, other]

doi 10.1109/HPCA57654.2024.00046

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

Authors: Alexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas

Abstract: To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, name… ▽ More To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named First Level Predictor (FLP) and Second Level Predictor (SLP). FLP performs accurate off-chip prediction by using several program features based on virtual addresses and a novel selective delay component. The novelty of SLP relies on leveraging off-chip prediction to drive L1D prefetch filtering by using physical addresses and the FLP prediction as features. TLP constitutes the first hardware proposal targeting both off-chip prediction and prefetch filtering using a multi-level perceptron hardware approach. TLP only requires 7KB of storage. To demonstrate the benefits of TLP we compare its performance with state-of-the-art approaches using off-chip prediction and prefetch filtering on a wide range of single-core and multi-core workloads. Our experiments show that TLP reduces the average DRAM transactions by 30.7% and 17.7%, as compared to a baseline using state-of-the-art cache prefetchers but no off-chip prediction mechanism, across the single-core and multi-core workloads, respectively, while recent work significantly increases DRAM transactions. As a result, TLP achieves geometric mean performance speedups of 6.2% and 11.8% across single-core and multi-core workloads, respectively. In addition, our evaluation demonstrates that TLP is effective independently of the L1D prefetching logic. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: To appear in 30th International Symposium on High-Performance Computer Architecture (HPCA), 2024

arXiv:2309.10377 [pdf, other]

K-Shortest Simple Paths Using Biobjective Path Search

Authors: Pedro Maristany de las Casas, Antonio Sedeño-Noda, Ralf Borndörfer, Max Huneshagen

Abstract: In this paper we introduce a new algorithm for the \emph{$k$-Shortest Simple Paths} (\kspp{k}) problem with an asymptotic running time matching the state of the art from the literature. It is based on a black-box algorithm due to \citet{Roditty12} that solves at most $2k$ instances of the \emph{Second Shortest Simple Path} (\kspp{2}) problem without specifying how this is done. We fill this gap us… ▽ More In this paper we introduce a new algorithm for the \emph{$k$-Shortest Simple Paths} (\kspp{k}) problem with an asymptotic running time matching the state of the art from the literature. It is based on a black-box algorithm due to \citet{Roditty12} that solves at most $2k$ instances of the \emph{Second Shortest Simple Path} (\kspp{2}) problem without specifying how this is done. We fill this gap using a novel approach: we turn the scalar \kspp{2} into instances of the Biobjective Shortest Path problem. Our experiments on grid graphs and on road networks show that the new algorithm is very efficient in practice. △ Less

Submitted 19 September, 2023; originally announced September 2023.

MSC Class: 90C99 ACM Class: G.4; G.2.2

arXiv:2309.07158 [pdf, other]

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

Authors: Federico Rossi, Marco Cococcioni, Roger Ferrer Ibàñez, Jesùs Labarta, Filippo Mantovani, Marc Casas, Emanuele Ruffaldi, Sergio Saponara

Abstract: As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted considerable attention. In this paper, we focused on two families of formats that have already achieved interesting results in compressing binary32 numbers in machin… ▽ More As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted considerable attention. In this paper, we focused on two families of formats that have already achieved interesting results in compressing binary32 numbers in machine learning applications, without sensible degradation of the accuracy: bfloat and posit. Even if 16-bit and 8-bit bfloat/posit are routinely used for reducing the storage of the weights/biases of trained DNNs, the inference still often happens on the 32-bit FPU of the CPU (especially if GPUs are not available). In this paper we propose a way to decompress a tensor of bfloat/posits just before computations, i.e., after the compressed operands have been loaded within the vector registers of a vector capable CPU, in order to save bandwidth usage and increase cache efficiency. Finally, we show the architectural parameters and considerations under which this solution is advantageous with respect to the uncompressed one. △ Less

Submitted 11 September, 2023; originally announced September 2023.

arXiv:2307.10332 [pdf, other]

doi 10.1016/j.ejor.2024.05.002

Labeling Methods for Partially Ordered Paths

Authors: Ricardo Euler, Pedro Maristany de las Casas

Abstract: The landscape of applications and subroutines relying on shortest path computations continues to grow steadily. This growth is driven by the undeniable success of shortest path algorithms in theory and practice. It also introduces new challenges as the models and assessing the optimality of paths become more complicated. Hence, multiple recent publications in the field adapt existing labeling meth… ▽ More The landscape of applications and subroutines relying on shortest path computations continues to grow steadily. This growth is driven by the undeniable success of shortest path algorithms in theory and practice. It also introduces new challenges as the models and assessing the optimality of paths become more complicated. Hence, multiple recent publications in the field adapt existing labeling methods in an ad hoc fashion to their specific problem variant without considering the underlying general structure: they always deal with multi-criteria scenarios, and those criteria define different partial orders on the paths. In this paper, we introduce the partial order shortest path problem (POSP), a generalization of the multi-objective shortest path problem (MOSP) and in turn also of the classical shortest path problem. POSP captures the particular structure of many shortest path applications as special cases. In this generality, we study optimality conditions or the lack of them, depending on the objective functions' properties. Our final contribution is a big lookup table summarizing our findings and providing the reader with an easy way to choose among the most recent multi-criteria shortest path algorithms depending on their problems' weight structure. Examples range from time-dependent shortest path and bottleneck path problems to the electric vehicle shortest path problem with recharging and complex financial weight functions studied in the public transportation community. Our results hold for general digraphs and, therefore, surpass previous generalizations that were limited to acyclic graphs. △ Less

Submitted 12 August, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

Journal ref: European Journal of Operational Research, Volume 318, Issue 1, 1 October 2024, Pages 19-30

arXiv:2306.16203 [pdf, other]

New Dynamic Programming Algorithm for the Multiobjective Minimum Spanning Tree Problem

Authors: Pedro Maristany de las Casas, Antonio Sedeño-Noda, Ralf Borndörfer

Abstract: The Multiobjective Minimum Spanning Tree (MO-MST) problem is a variant of the Minimum Spanning Tree problem, in which the costs associated with every edge of the input graph are vectors. In this paper, we design a new dynamic programming MO-MST algorithm. Dynamic programming for a MO-MST instance leads to the definition of an instance of the One-to-One Multiobjective Shortest Path (MOSP) problem a… ▽ More The Multiobjective Minimum Spanning Tree (MO-MST) problem is a variant of the Minimum Spanning Tree problem, in which the costs associated with every edge of the input graph are vectors. In this paper, we design a new dynamic programming MO-MST algorithm. Dynamic programming for a MO-MST instance leads to the definition of an instance of the One-to-One Multiobjective Shortest Path (MOSP) problem and both instances have equivalent solution sets. The arising MOSP instance is defined on a so called transition graph. We study the original size of this graph in detail and reduce its size using cost dependent arc pruning criteria. To solve the MOSP instance on the reduced transition graph, we design the Implicit Graph Multiobjective Dijkstra Algorithm (IG-MDA), exploiting recent improvements on MOSP algorithms from the literature. All in all, the new IG-MDA outperforms the current state of the art on a big set of instances from the literature. Our code and results are publicly available. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 35 pages; 30 pages without appendix. 4 Tables, 13 Figures

MSC Class: 90C29 ACM Class: G.2.2

arXiv:2305.18328 [pdf, other]

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations

Authors: Louis Ledoux, Marc Casas

Abstract: Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing community. However, these workloads have a wide range of numerical requirements. Ill-conditioned linear systems require high-precision arithmetic to ensure correct and… ▽ More Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing community. However, these workloads have a wide range of numerical requirements. Ill-conditioned linear systems require high-precision arithmetic to ensure correct and reproducible results. In contrast, emerging workloads such as deep neural networks, which can have millions up to billions of parameters, have shown resilience to arithmetic tinkering and precision lowering. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.06696 [pdf, other]

Characterizing the impact of last-level cache replacement policies on big-data workloads

Authors: Alexandre Valentin Jamet, Lluc Alvarez, Marc Casas

Abstract: In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph traversal methods exhibiting irregular memory access patterns. Recent work demonstrates that, due to the highly irregular memory access patterns of data-depende… ▽ More In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph traversal methods exhibiting irregular memory access patterns. Recent work demonstrates that, due to the highly irregular memory access patterns of data-dependent graph traversals, state-of-the-art graph-processing workloads spend up to 80 % of the total execution time waiting for memory accesses to be served by the DRAM. The vast disparity between the Last Level Cache (LLC) and main memory latencies is a problem that has been addressed for years in computer architecture. One of the prevailing approaches when it comes to mitigating this performance gap between modern CPUs and DRAM is cache replacement policies. In this work, we characterize the challenges drawn by graph-processing workloads and evaluate the most relevant cache replacement policies. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: Extended abstract submitted to the 10th BSC Doctoral Symposium

arXiv:2303.02471 [pdf, other]

Optimization of SpGEMM with Risc-V vector instructions

Authors: Valentin Le Fèvre, Marc Casas

Abstract: The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) $C = A \times B$ is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, an… ▽ More The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) $C = A \times B$ is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, and it is relatively efficient for sparse matrices featuring several tens of non-zero coefficients per column as it computes C columns one by one. However, when dealing with matrices containing just a few non-zero coefficients per column, the state-of-the-art algorithm is not able to fully exploit long vector architectures when computing the SpGEMM kernel. To overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm, which computes in parallel several C columns among other optimizations, and the HASH algorithm, which uses dynamically sized hash tables to store intermediate output values. To combine the efficiency of SPA for relatively dense matrix blocks with the high performance that SPARS and HASH deliver for very sparse matrix blocks we propose H-SPA(t) and H-HASH(t), which dynamically switch between different algorithms. H-SPA(t) and H-HASH(t) obtain 1.24$\times$ and 1.57$\times$ average speed-ups with respect to SPA respectively, over a set of 40 sparse matrices obtained from the SuiteSparse Matrix Collection. For the 22 most sparse matrices, H-SPA(t) and H-HASH(t) deliver 1.42$\times$ and 1.99$\times$ average speed-ups respectively. △ Less

Submitted 2 June, 2023; v1 submitted 4 March, 2023; originally announced March 2023.

arXiv:2211.08272 [pdf, other]

Low-Thrust Orbital Transfer using Dynamics-Agnostic Reinforcement Learning

Authors: Carlos M. Casas, Belen Carro, Antonio Sanchez-Esguevillas

Abstract: Low-thrust trajectory design and in-flight control remain two of the most challenging topics for new-generation satellite operations. Most of the solutions currently implemented are based on reference trajectories and lead to sub-optimal fuel usage. Other solutions are based on simple guidance laws that need to be updated periodically, increasing the cost of operations. Whereas some optimization s… ▽ More Low-thrust trajectory design and in-flight control remain two of the most challenging topics for new-generation satellite operations. Most of the solutions currently implemented are based on reference trajectories and lead to sub-optimal fuel usage. Other solutions are based on simple guidance laws that need to be updated periodically, increasing the cost of operations. Whereas some optimization strategies leverage Artificial Intelligence methods, all of the approaches studied so far need either previously generated data or a strong a priori knowledge of the satellite dynamics. This study uses model-free Reinforcement Learning to train an agent on a constrained pericenter raising scenario for a low-thrust medium-Earth-orbit satellite. The agent does not have any prior knowledge of the environment dynamics, which makes it unbiased from classical trajectory optimization patterns. The trained agent is then used to design a trajectory and to autonomously control the satellite during the cruise. Simulations show that a dynamics-agnostic agent is able to learn a quasi-optimal guidance law and responds well to uncertainties in the environment dynamics. The results obtained open the door to the usage of Reinforcement Learning on more complex scenarios, multi-satellite problems, or to explore trajectories in environments where a reference solution is not known △ Less

Submitted 6 October, 2022; originally announced November 2022.

arXiv:2202.09288 [pdf, other]

Optimization of the Sparse Multi-Threaded Cholesky Factorization for A64FX

Authors: Valentin Le Fèvre, Tetsuzo Usui, Marc Casas

Abstract: Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices. This paper presents… ▽ More Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices. This paper presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D-COST algorithm, which automatically and dynamically applies selective nesting. OPT-D-COST leverages matrix sparsity to drive complex task-based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 60 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D-COST delivers an average performance speedup of 1.46$\times$ with respect to the best state-of-the-art parallel method to run direct solvers. △ Less

Submitted 18 February, 2022; originally announced February 2022.

arXiv:2110.10978 [pdf, other]

Targeted Multiobjective Dijkstra Algorithm

Authors: Pedro Maristany de las Casas, Luitgard Kraus, Antonio Sedeño-Noda, Ralf Borndörfer

Abstract: In this paper, we introduce the Targeted Multiobjective Dijkstra Algorithm (T-MDA), a label setting algorithm for the One-to-One Multiobjective Shortest Path (MOSP) Problem. The T-MDA is based on the recently published Multiobjective Dijkstra Algorithm (MDA) and equips it with A*-like techniques. The resulting speedup is comparable to the speedup that the original A* algorithm achieves for Dijkstr… ▽ More In this paper, we introduce the Targeted Multiobjective Dijkstra Algorithm (T-MDA), a label setting algorithm for the One-to-One Multiobjective Shortest Path (MOSP) Problem. The T-MDA is based on the recently published Multiobjective Dijkstra Algorithm (MDA) and equips it with A*-like techniques. The resulting speedup is comparable to the speedup that the original A* algorithm achieves for Dijkstra's algorithm. Unlike other methods from the literature, which rely on special properties of the biobjective case, the T-MDA works for any dimension. To the best of our knowledge, it gives rise to the first efficient implementation that can deal with large scale instances with more than two objectives. A version tuned for the biobjective case, the T-BDA, outperforms state-of-the-art methods on almost every instance of a standard benchmark testbed that is not solvable in fractions of a second. △ Less

Submitted 17 December, 2021; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: 20 pages, 58 figures, 10 tables

MSC Class: 90C29; 90C35; 68W99 ACM Class: G.2.2

arXiv:2009.08698 [pdf, other]

Generating Efficient DNN-Ensembles with Evolutionary Computation

Authors: Marc Ortiz, Florian Scheidegger, Marc Casas, Cristiano Malossi, Eduard Ayguadé

Abstract: In this work, we leverage ensemble learning as a tool for the creation of faster, smaller, and more accurate deep learning models. We demonstrate that we can jointly optimize for accuracy, inference time, and the number of parameters by combining DNN classifiers. To achieve this, we combine multiple ensemble strategies: bagging, boosting, and an ordered chain of classifiers. To reduce the number o… ▽ More In this work, we leverage ensemble learning as a tool for the creation of faster, smaller, and more accurate deep learning models. We demonstrate that we can jointly optimize for accuracy, inference time, and the number of parameters by combining DNN classifiers. To achieve this, we combine multiple ensemble strategies: bagging, boosting, and an ordered chain of classifiers. To reduce the number of DNN ensemble evaluations during the search, we propose EARN, an evolutionary approach that optimizes the ensemble according to three objectives regarding the constraints specified by the user. We run EARN on 10 image classification datasets with an initial pool of 32 state-of-the-art DCNN on both CPU and GPU platforms, and we generate models with speedups up to $7.60\times$, reductions of parameters by $10\times$, or increases in accuracy up to $6.01\%$ regarding the best DNN in the pool. In addition, our method generates models that are $5.6\times$ faster than the state-of-the-art methods for automatic model generation. △ Less

Submitted 3 May, 2021; v1 submitted 18 September, 2020; originally announced September 2020.

Comments: 8 pages

arXiv:2004.02297 [pdf, other]

Reducing Data Motion to Accelerate the Training of Deep Neural Networks

Authors: Sicong Zhuang, Cristiano Malossi, Marc Casas

Abstract: This paper reduces the cost of DNNs training by decreasing the amount of data movement across heterogeneous architectures composed of several GPUs and multicore CPU devices. In particular, this paper proposes an algorithm to dynamically adapt the data representation format of network weights during training. This algorithm drives a compression procedure that reduces data size before sending them o… ▽ More This paper reduces the cost of DNNs training by decreasing the amount of data movement across heterogeneous architectures composed of several GPUs and multicore CPU devices. In particular, this paper proposes an algorithm to dynamically adapt the data representation format of network weights during training. This algorithm drives a compression procedure that reduces data size before sending them over the parallel system. We run an extensive evaluation campaign considering several up-to-date deep neural network models and two high-end parallel architectures composed of multiple GPUs and CPU multicore chips. Our solution achieves average performance improvements from 6.18\% up to 11.91\%. △ Less

Submitted 5 April, 2020; originally announced April 2020.

arXiv:1810.06472 [pdf, other]

doi 10.1109/IOLTS.2019.8854397

Memory Vulnerability: A Case for Delaying Error Reporting

Authors: Luc Jaulmes, Miquel Moretó, Mateo Valero, Marc Casas

Abstract: To face future reliability challenges, it is necessary to quantify the risk of error in any part of a computing system. To this goal, the Architectural Vulnerability Factor (AVF) has long been used for chips. However, this metric is used for offline characterisation, which is inappropriate for memory. We survey the literature and formalise one of the metrics used, the Memory Vulnerability Factor,… ▽ More To face future reliability challenges, it is necessary to quantify the risk of error in any part of a computing system. To this goal, the Architectural Vulnerability Factor (AVF) has long been used for chips. However, this metric is used for offline characterisation, which is inappropriate for memory. We survey the literature and formalise one of the metrics used, the Memory Vulnerability Factor, and extend it to take into account false errors. These are reported errors which would have no impact on the program if they were ignored. We measure the False Error Aware MVF (FEA) and related metrics precisely in a cycle-accurate simulator, and compare them with the effects of injecting faults in a program's data, in native parallel runs. Our findings show that MVF and FEA are the only two metrics that are safe to use at runtime, as they both consistently give an upper bound on the probability of incorrect program outcome. FEA gives a tighter bound than MVF, and is the metric that correlates best with the incorrect outcome probability of all considered metrics. △ Less

Submitted 15 October, 2018; originally announced October 2018.

arXiv:1804.05267 [pdf, other]

Low-Precision Floating-Point Schemes for Neural Network Training

Authors: Marc Ortiz, Adrián Cristal, Eduard Ayguadé, Marc Casas

Abstract: The use of low-precision fixed-point arithmetic along with stochastic rounding has been proposed as a promising alternative to the commonly used 32-bit floating point arithmetic to enhance training neural networks training in terms of performance and energy efficiency. In the first part of this paper, the behaviour of the 12-bit fixed-point arithmetic when training a convolutional neural network w… ▽ More The use of low-precision fixed-point arithmetic along with stochastic rounding has been proposed as a promising alternative to the commonly used 32-bit floating point arithmetic to enhance training neural networks training in terms of performance and energy efficiency. In the first part of this paper, the behaviour of the 12-bit fixed-point arithmetic when training a convolutional neural network with the CIFAR-10 dataset is analysed, showing that such arithmetic is not the most appropriate for the training phase. After that, the paper presents and evaluates, under the same conditions, alternative low-precision arithmetics, starting with the 12-bit floating-point arithmetic. These two representations are then leveraged using local scaling in order to increase accuracy and get closer to the baseline 32-bit floating-point arithmetic. Finally, the paper introduces a simplified model in which both the outputs and the gradients of the neural networks are constrained to power-of-two values, just using 7 bits for their representation. The evaluation demonstrates a minimal loss in accuracy for the proposed Power-of-Two neural network, avoiding the use of multiplications and divisions and thereby, significantly reducing the training time as well as the energy consumption and memory requirements during the training and inference phases. △ Less

Submitted 14 April, 2018; originally announced April 2018.

Comments: 16 pages, 9 figures and 4 tables

ACM Class: I.2.6; I.5

arXiv:1707.08951 [pdf, ps, other]

Handwritten character recognition using some (anti)-diagonal structural features

Authors: José Manuel Casas, Nick Inassaridze, Manuel Ladra, Susana Ladra

Abstract: In this paper, we present a methodology for off-line handwritten character recognition. The proposed methodology relies on a new feature extraction technique based on structural characteristics, histograms and profiles. As novelty, we propose the extraction of new eight histograms and four profiles from the $32\times 32$ matrices that represent the characters, creating 256-dimension feature vector… ▽ More In this paper, we present a methodology for off-line handwritten character recognition. The proposed methodology relies on a new feature extraction technique based on structural characteristics, histograms and profiles. As novelty, we propose the extraction of new eight histograms and four profiles from the $32\times 32$ matrices that represent the characters, creating 256-dimension feature vectors. These feature vectors are then employed in a classification step that uses a $k$-means algorithm. We performed experiments using the NIST database to evaluate our proposal. Namely, the recognition system was trained using 1000 samples and 64 classes for each symbol and was tested on 500 samples for each symbol. We obtain promising accuracy results that vary from 81.74\% to 93.75\%, depending on the difficulty of the character category, showing better accuracy results than other methods from the state of the art also based on structural characteristics. △ Less

Submitted 14 February, 2018; v1 submitted 27 July, 2017; originally announced July 2017.

Comments: Revised version with a number of improvements and update references, 9 pages

arXiv:1501.02282 [pdf, other]

Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7

Authors: David Prat, Cristobal Ortega, Marc Casas, Miquel Moretó, Mateo Valero

Abstract: Hardware data prefetcher engines have been extensively used to reduce the impact of memory latency. However, microprocessors' hardware prefetcher engines do not include any automatic hardware control able to dynamically tune their operation. This lacking architectural feature causes systems to operate with prefetchers in a fixed configuration, which in many cases harms performance and energy consu… ▽ More Hardware data prefetcher engines have been extensively used to reduce the impact of memory latency. However, microprocessors' hardware prefetcher engines do not include any automatic hardware control able to dynamically tune their operation. This lacking architectural feature causes systems to operate with prefetchers in a fixed configuration, which in many cases harms performance and energy consumption. In this paper, a piece of software that solves the discussed problem in the context of the IBM POWER7 microprocessor is presented. The proposed solution involves using the runtime software as a bridge that is able to characterize user applications' workload and dynamically reconfigure the prefetcher engine. The proposed mechanisms has been deployed over OmpSs, a state-of-the-art task-based programming model. The paper shows significant performance improvements over a representative set of microbenchmarks and High Performance Computing (HPC) applications. △ Less

Submitted 9 January, 2015; originally announced January 2015.

Comments: Part of ADAPT Workshop proceedings, 2015 (arXiv:1412.2347)

Report number: ADAPT/2015/07

Showing 1–20 of 20 results for author: Casas, M