Skip to main content

Showing 1–9 of 9 results for author: Castelló, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.11728  [pdf, ps, other

    cs.CL

    The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

    Authors: Héctor Martínez, Adrián Castelló, Francisco D. Igual, Enrique S. Quintana-Ortí

    Abstract: Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resou… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 16 pages, 7 tables, 7 figures

  2. arXiv:2407.07273  [pdf, other

    physics.data-an cs.CE

    Combination of operational modal analysis algorithms to identify modal parameters of an actual centrifugal compressor

    Authors: Leandro O. Zague, Daniel A. Castello, Carlos F. T. Matt

    Abstract: The novelty of the current work is precisely to propose a statistical procedure to combine estimates of the modal parameters provided by any set of Operational Modal Analysis (OMA) algorithms so as to avoid preference for a particular one and also to derive an approximate joint probability distribution of the modal parameters, from which engineering statistics of interest such as mean value and va… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: 6 figures

  3. Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

    Authors: Cristian Ramírez, Adrián Castelló, Héctor Martínez, Enrique S. Quintana-Ortí

    Abstract: The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLA… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 12 pages, 2 Tables, 6 Figures

    Journal ref: High Performance Computing. ISC High Performance 2022 International Workshops. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13387. Springer, Cham

  4. arXiv:2310.20347  [pdf, other

    cs.CL

    Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

    Authors: Guillermo Alaejos, Adrián Castelló, Pedro Alonso-Jordá, Francisco D. Igual, Héctor Martínez, Enrique S. Quintana-Ortí

    Abstract: We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). % In addition, we fully automatize the generation process, by also leveragin… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: 35 pages, 22 figures. Submitted to ACM TOMS

  5. arXiv:2310.17408  [pdf, other

    cs.MS cs.CL cs.PF

    Tackling the Matrix Multiplication Micro-kernel Generation with Exo

    Authors: Adrián Castelló, Julian Bellavita, Grace Dinh, Yuka Ikarashi, Héctor Martínez

    Abstract: The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a… ▽ More

    Submitted 27 October, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: 11 pages, 18 figures. Presented at CGO 2024. It includes a software artifact step-by-step execution

  6. arXiv:2109.09686  [pdf, other

    eess.AS cs.LG

    Acoustic Echo Cancellation using Residual U-Nets

    Authors: J. Silva-Rodríguez, M. F. Dolz, M. Ferrer, A. Castelló, V. Naranjo, G. Piñero

    Abstract: This paper presents an acoustic echo canceler based on a U-Net convolutional neural network for single-talk and double-talk scenarios. U-Net networks have previously been used in the audio processing area for source separation problems because of their ability to reproduce the finest details of audio signals, but to our knowledge, this is the first time they have been used for acoustic echo cancel… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: 6 pages, 2 figures, submitted to the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing on October 2020

  7. arXiv:2105.09187  [pdf, other

    cs.DC cs.AR cs.PF

    High performance and energy efficient inference for deep learning on ARM processors

    Authors: Adrián Castelló, Sergio Barrachina, Manuel F. Dolz, Enrique S. Quintana-Ortí, Pau San Juan

    Abstract: We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and d… ▽ More

    Submitted 19 May, 2021; originally announced May 2021.

    Comments: 13 pages, 7 figures

  8. arXiv:2005.06410  [pdf, other

    cs.PF

    High Performance and Portable Convolution Operators for ARM-based Multicore Processors

    Authors: Pablo San Juan, Adrián Castelló, Manuel F. Dolz, Pedro Alonso-Jordá, Enrique S. Quintana-Ortí

    Abstract: The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the \imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the… ▽ More

    Submitted 13 May, 2020; originally announced May 2020.

    ACM Class: B.8; C.4; I.2; I.4

  9. arXiv:1804.07017  [pdf, other

    cs.DC cs.MS

    Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

    Authors: Sandra Catalán, Adrián Castelló, Francisco D. Igual, Rafael Rodríguez-Sánchez, Enrique S. Quintana-Ortí

    Abstract: We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multithreaded version of BLAS. This approach is also different from the more sophisticated runtime-assisted implementations, which decompose the operation into tasks and identify dependencies via d… ▽ More

    Submitted 19 April, 2018; originally announced April 2018.

    Comments: 28 pages